English | 中文版

4. A More Realistic Example: Softmax

Vector multiplication demonstrates the basics, but real neural network workloads require math functions like exp(), log(), and sqrt(). The softmax function — used in attention layers, classification heads, and probability normalization — is a perfect example:

$$\text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_j e^{x_j - \max(x)}}$$

4.1 Math Intrinsics in `ascend_std`

ascend-rs exposes hardware math operations as Rust methods on primitive types. Under the hood, f32::exp() maps to the expf32 compiler intrinsic, which the MLIR codegen backend lowers to llvm.intr.exp — ultimately executing as a native NPU math instruction.

// In ascend_std: these methods are available on f32/f64 in kernel code
let y = x.exp();   // expf32 → llvm.intr.exp
let y = x.ln();    // logf32 → llvm.intr.log
let y = x.sqrt();  // sqrtf32 → llvm.intr.sqrt

4.2 The Softmax Kernel

Here is a complete softmax kernel written in Rust for the Ascend NPU:

#![feature(no_core)]
#![no_std]
#![no_core]

#[ascend_std::aiv_kernel]
pub unsafe fn softmax(input: *const f32, output: *mut f32, len: *const u32) {
    unsafe {
        let n = *len as usize;

        // Step 1: Find max value for numerical stability
        let mut max_val = *input;
        let mut i = 1usize;
        loop {
            if i >= n { break; }
            let val = *input.wrapping_add(i);
            if val > max_val { max_val = val; }
            i = i + 1;
        }

        // Step 2: Compute exp(x_i - max) and accumulate sum
        let mut sum: f32 = 0.0;
        i = 0;
        loop {
            if i >= n { break; }
            let exp_val = (*input.wrapping_add(i) - max_val).exp();
            *output.wrapping_add(i) = exp_val;
            sum = sum + exp_val;
            i = i + 1;
        }

        // Step 3: Normalize
        i = 0;
        loop {
            if i >= n { break; }
            *output.wrapping_add(i) = *output.wrapping_add(i) / sum;
            i = i + 1;
        }
    }
}

The key line is (*input.wrapping_add(i) - max_val).exp() — this calls f32::exp(), which compiles through the MLIR backend into a native NPU exponential instruction. The subtraction of max_val before exponentiation is the standard numerical stability trick that prevents overflow.

This demonstrates that ascend-rs kernel code isn’t limited to simple arithmetic — it can express the same algorithms you’d write in C++ AscendC, with Rust’s safety guarantees.

4.3 Performance: Rust vs C++ on Real Hardware

How does a Rust kernel perform compared to hand-written C++ on actual NPU hardware? We benchmarked the softmax kernel on an Ascend 310P NPU with four implementations:

C++ naive (scalar) — A hand-written C++ kernel using scalar loops with GetValue/SetValue accessors
C++ optimized (vector) — An expert-written C++ kernel using AscendC vector intrinsics (ReduceMax, Exp, Muls)
Rust scalar — The Rust kernel above, compiled through the MLIR-to-C++ codegen pipeline
Rust vector — A Rust kernel using ascend-rs vector intrinsics (ascend_reduce_max_f32, ascend_exp_f32, ascend_muls_f32), compiled through the same pipeline

Each kernel processes f32 input arrays, with 1 warmup iteration and 10 timed iterations per configuration. All results are verified against a CPU reference for correctness.

Size	C++ Naive (ms)	C++ Opt (ms)	Rust Scalar (ms)	Rust Vector (ms)	Scalar vs Naive	Vector vs Opt
256	0.100	0.078	0.099	0.077	0.99x	0.99x
1,024	0.191	0.077	0.202	0.076	1.06x	0.99x
4,096	0.568	0.079	0.607	0.079	1.07x	1.00x
16,384	2.073	0.089	2.221	0.087	1.07x	0.98x

Key findings:

Rust vector matches C++ optimized performance. The Rust vectorized kernel, using ascend_std vector intrinsics that map to AscendC operations, performs within 1-2% of the hand-optimized C++ kernel across all sizes. At 16,384 elements, the Rust vector kernel (0.087ms) is actually slightly faster than C++ optimized (0.089ms). This means there is zero performance penalty for writing vectorized NPU kernels in Rust instead of C++.
Vector intrinsics provide massive speedups. Both vectorized kernels are 1.3x faster at small sizes and up to 25x faster at 16,384 elements compared to their scalar counterparts. The vector pipeline processes 256 bits (8 floats) per cycle vs one element per cycle for scalar code.
Rust scalar is within 5-7% of C++ scalar. The scalar codegen path also produces competitive code, with the small overhead coming from different UB access patterns (direct pointer arithmetic vs accessor methods).
All implementations are numerically correct. Every kernel-size combination produces results matching the CPU reference (max error < 1e-8, output sum ≈ 1.0). The vector implementations achieve even lower error than scalar (max_err ~1e-10 vs ~1e-8) due to hardware-optimized math operations.

Here is what the Rust vectorized softmax kernel looks like — it reads almost identically to the C++ version:

#[ascend_std::aiv_kernel]
pub unsafe fn softmax(input: *const f32, output: *mut f32, len_buf: *const u32) {
    unsafe {
        let n = *len_buf;
        let in_buf  = ascend_std::ascend_buf_alloc(n);
        let out_buf = ascend_std::ascend_buf_alloc(n);
        let work    = ascend_std::ascend_buf_alloc(n);
        let rwork   = ascend_std::ascend_buf_alloc(n);

        ascend_std::ascend_buf_load_f32(in_buf, input, n);
        ascend_std::ascend_pipe_barrier();

        let max_val = ascend_std::ascend_reduce_max_f32(work, in_buf, rwork, n);
        ascend_std::ascend_adds_f32(out_buf, in_buf, 0.0f32 - max_val, n);
        ascend_std::ascend_exp_f32(out_buf, out_buf, n);
        let sum_val = ascend_std::ascend_reduce_sum_f32(work, out_buf, rwork, n);
        ascend_std::ascend_muls_f32(out_buf, out_buf, 1.0f32 / sum_val, n);

        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_buf_store_f32(output, out_buf, n);
    }
}

The ascend_buf_alloc / ascend_buf_load_f32 / ascend_reduce_max_f32 calls are extern "C" stubs in ascend_std that the MLIR codegen backend recognizes and translates to AscendC API calls (TBuf, DataCopy, ReduceMax, etc.) during C++ code generation. This gives Rust kernels direct access to the NPU’s vector pipeline with zero overhead.

4.4 Beyond Softmax: Activation Function Benchmarks

To validate the breadth of the vector intrinsic API, we benchmarked three additional activation functions — Relu, Sigmoid, and Tanh — each composed from the same primitive operations. Unlike softmax, these activations don’t have dedicated AscendC builtins; instead they are constructed from composable vector primitives:

Relu(x) = max(x, 0) → Maxs
Sigmoid(x) = 1 / (1 + exp(-x)) → Muls → Exp → Adds → Reciprocal
Tanh(x) = 2 · sigmoid(2x) - 1 → Muls → Exp → Adds → Reciprocal → Muls → Adds

For each function, we compare a C++ implementation (TQue pipeline) against the equivalent Rust-style code (TBuf pipeline matching the mlir_to_cpp output):

Size	Relu C++ (ms)	Relu Rust (ms)	Sigmoid C++ (ms)	Sigmoid Rust (ms)	Tanh C++ (ms)	Tanh Rust (ms)
256	0.078	0.075	0.075	0.075	0.075	0.077
1,024	0.075	0.076	0.075	0.074	0.075	0.076
4,096	0.075	0.076	0.077	0.077	0.076	0.078
16,384	0.083	0.083	0.086	0.086	0.085	0.086

All six kernels perform identically within measurement noise. Relu achieves exact correctness (max_err = 0), while Sigmoid and Tanh achieve max_err < 3e-3 at sizes ≥ 1024. The size=256 correctness issue affects both C++ and Rust equally — it’s an AscendC hardware-level precision artifact at small vector sizes, not a codegen issue.

This confirms that the Rust vector intrinsic API generalizes beyond softmax. For the activation functions tested here — each a composition of AscendC vector primitives — Rust and C++ produce identical performance. We expect this to hold for any kernel composed purely from vector intrinsics, since the codegen maps each Rust intrinsic call 1:1 to the same AscendC C++ call. Cube engine operations (matmul via Mmad) and multi-level buffer hierarchies (L1/L0A/L0B/L0C) are supported at the API level but have not yet been hardware-verified through the full pipeline.

4.5 Formal Equivalence Verification: AscendC vs AscendRS

Performance parity is compelling, but the strongest argument for the Rust codegen pipeline is bitwise equivalence — proving that Rust-generated kernels produce exactly the same numerical results as hand-written AscendC C++ kernels on real NPU hardware.

We selected three representative kernels that cover the most common neural network operation patterns:

ReLU — single vector op: output[i] = max(input[i], 0) → ascend_maxs_f32
Sigmoid — chained vector ops: output[i] = 1/(1 + exp(-input[i])) → Muls → Exp → Adds → Reciprocal
Vec Add — binary vector op: z[i] = x[i] + y[i] → ascend_add_f32

For each kernel, we compiled two implementations:

AscendC original — idiomatic C++ using the TQue pipeline (EnQue/DeQue implicit synchronization), as a 910B production engineer would write it
AscendRS equivalent — C++ generated from Rust source via the mlir_to_cpp pipeline (TBuf + explicit pipe_barrier(PIPE_ALL))

Both were run on the 310P NPU with identical inputs (256 f32 elements, deterministic PRNG) and compared at three levels:

Test	C++ vs CPU	RS vs CPU	C++ vs RS
ReLU	PASS (err=0.00)	PASS (err=0.00)	PASS (err=0.00)
Sigmoid	PASS (err=2.4e-3)	PASS (err=2.4e-3)	PASS (err=0.00)
Vec Add	PASS (err=0.00)	PASS (err=0.00)	PASS (err=0.00)

The C++ vs RS column shows bitwise identical output (max error = 0.0) for all three kernels. The NPU produces exactly the same bits whether the kernel was written in C++ or Rust. The small sigmoid CPU difference (2.4e-3) is the NPU’s Exp() vector unit precision vs x86 expf() — it affects both implementations equally and is not a codegen issue.

Here is the Rust sigmoid kernel — four lines of vector intrinsic calls that produce identical NPU output to the 40-line AscendC C++ class:

#[ascend_std::aiv_kernel]
pub unsafe fn sigmoid(input: *const f32, output: *mut f32, len: *const u32) {
    unsafe {
        let n = *len;
        let buf_in = ascend_std::ascend_buf_alloc(n);
        let buf_out = ascend_std::ascend_buf_alloc(n);

        ascend_std::ascend_buf_load_f32(buf_in, input, n);
        ascend_std::ascend_pipe_barrier();

        ascend_std::ascend_muls_f32(buf_out, buf_in, -1.0f32, n);
        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_exp_f32(buf_out, buf_out, n);
        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_adds_f32(buf_out, buf_out, 1.0f32, n);
        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_reciprocal_f32(buf_out, buf_out, n);

        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_buf_store_f32(output, buf_out, n);
    }
}

A notable discovery during this work: in-place chained vector operations on the 310P require explicit pipe_barrier(PIPE_ALL) between each step. Without barriers between Muls→Exp→Adds→Reciprocal on the same buffer, the next operation reads stale data. This is a hardware synchronization requirement that the Rust codegen pipeline now handles correctly — and the equivalence test serves as a regression test for this behavior.

4.6 Double-Buffering Results (910B2, 2026-04-02)

The single-tile softmax in §4.3–4.5 spends most of its wall-clock waiting for one DMA to finish before the next compute step can start. The fix is the textbook double-buffer: issue two tile loads back-to-back, then compute on the first while the second’s DMA is still in flight. The Rust tile API expresses this as a four-line prologue — tile_load_f32 for tile 0, tile_prefetch_f32 for tile 1 — and mlir_to_pto lowers each to a pto.tload with a distinct partition_view row offset, which is the signal ptoas needs to schedule the two DMAs concurrently on the Mte2 pipe.

Variant (1×1024 f32, 910B2)	Per-tile min	Per-tile avg	Speedup vs single
single tile (PTO, §4.3)	4.0 µs	4.6 µs	1.00× (baseline)
double-buffer (2 tiles)	2.4 µs	3.4 µs	1.65×–1.35×

Numerics tie with the single-tile path: max_err = 3.26e-9, sum within 1 ulp of 1.0. The full reproducer — kernel source, the generated PTO-MLIR with the two distinct row-offset partition_view ops, and the build/run commands — is in Appendix J §J4.

The bug fixes that made this example work — make_pv not propagating GEP offsets, and Pattern 3 flattening the alias chain — are documented at the end of that example. Double-buffer was the test case that surfaced both, because they only matter when two partition_view ops with different offsets need to coexist in the same kernel.

4.7 Softmax via the linalg Bridge: Importing Upstream MLIR

So far every softmax kernel in this chapter began as Rust source. The same kernel can also arrive at the NPU from the opposite direction: written elsewhere in the standard upstream linalg dialect, ingested through the ascend-rs linalg bridge, and emitted to the same AscendC C++ that mlir_to_cpp would produce from Rust. The bridge is what lets ascend-rs absorb kernels from third-party frontends — torch-mlir, iree, hand-written linalg in upstream MLIR tests — without re-authoring them in ascend_std.

The upstream form is two lines:

// benchmarks/linalg/kernels_upstream_shape_matched/softmax_upstream_1x1024.mlir
func.func @upstream_softmax_1x1024(%arg0: tensor<1x1024xf32>) -> tensor<1x1024xf32> {
  %0 = tensor.empty() : tensor<1x1024xf32>
  %1 = linalg.softmax dimension(1) ins(%arg0 : tensor<1x1024xf32>)
                                   outs(%0   : tensor<1x1024xf32>) -> tensor<1x1024xf32>
  return %1 : tensor<1x1024xf32>
}

linalg_to_ascend_tile rewrites the linalg.softmax op into the same ascend_tile_* intrinsic call sequence that the Rust front-end emits, so the downstream pipeline is byte-identical past that point: the AscendC C++ produced by mlir_to_cpp differs in zero bytes from the version generated from a hand-written ascendrs-form kernel.

Ingest paths verified on 2026-04-22 (910B2, chip 0/2, 3 repeat runs):

Pair (1×1024 f32)	Source	NPU min (µs)	Δ vs hand-written	Match
add	upstream linalg	~5.0	≤ 0.4 µs (~5%)	✓
add	torch-mlir FX	~4.2	0.02–0.48 µs	✓
exp	upstream linalg	~4.6	≤ 0.1 µs (<2%)	✓
exp	torch-mlir FX	~4.5	0.08–0.26 µs	✓
softmax	upstream linalg	~5.2	≤ 0.4 µs (<8%)	✓
matmul 32×64×32	upstream linalg	1586	< 0.3 µs (<0.02%)	✓

The matmul row is the decisive one: at 1.58 ms/call the AclEvent timer noise floor is roughly 0.1% of runtime, so a tied min/p50/mean across three runs is genuine numerical equivalence — not measurement uncertainty. For softmax specifically, the bit-identical AscendC emit means any difference in observed throughput would have to come from compiler caches or DMA scheduling, neither of which produced a measurable delta in the bench.

What this means for the running example. Softmax in this chapter has now travelled three routes onto the same 910B2 chip:

(a) Rust scalar     ─┐
(b) Rust vector     ─┼─ rustc + mlir_to_cpp ──── AscendC ─── bisheng ── 910B2
(c) Rust tile API   ─┘                ─── mlir_to_pto ─── ptoas ─── ccec ─── 910B2

(d) upstream linalg ─── linalg_to_ascend_tile ─── mlir_to_cpp ─── AscendC ── 910B2
(e) torch-mlir FX   ─── linalg_to_ascend_tile ─── mlir_to_cpp ─── AscendC ── 910B2

(d) and (e) re-use the exact emitter from (b). The “zero overhead” claim here is therefore not a benchmark trick — it is a structural property of the bridge: the ingress lowers linalg to ascend_tile and then calls the same emitter that the Rust front-end calls. There is no place left for a slowdown to hide.

The reproducer is in Appendix J §J5.

The 30-second walk-through below shows the four routes back-to-back on adablue. Each stage prints the source, runs the host-side step (or shows a committed artifact), and prints the first lines of the emitted form — the punchline being that routes (a), (b) and (e) all converge to the same mlir_to_cpp emit, while route (c) takes the parallel mlir_to_pto + ptoas path:

ch04 softmax — four routes converge at mlir_to_cpp

4.8 Cross-Pipeline Safety: The Same Oracle Watches All Five Routes

Adding ingress paths (d) and (e) raises an honest question: every Rust route in this chapter goes through the rustc front-end, which has already type-checked, borrow-checked, and (via the safety oracle in Chapter 11) statically inspected the kernel for placement and aliasing bugs. Kernels arriving via the linalg bridge skip Rust entirely. Do they get the same safety analysis?

The answer is yes — by reusing the same oracle on the bridge’s intermediate forms. Chapter 12 describes both wirings:

Path A projects the ascend_tile MLIR (the bridge’s intermediate form, after hop 1) into a stage-2 Plan and runs five of the six Chapter 11 passes on it. The same softmax fixture above projects to a clean plan.
Path C lowers the same kernel through mlir_to_pto → ptoas --print-after-all, parses the post-PlanMemoryPass MLIR, and runs the full six-pass oracle on it. The clean softmax stays clean; an injected dead-tile variant fails capacity at the post-blocking layer that Path A’s projector cannot see.

The contrast — same .acl.pto softmax, same ptoas compiler, two outcomes from the oracle — is the demo recorded in §11.6. With the bridge wired to ACLRS_LINALG_SAFETY=path-a (or path-c), an upstream linalg kernel that would silently corrupt VEC at runtime is now a compile-time finding before it ever reaches bisheng.

Keyboard shortcuts

ascend-rs: Memory-Safe NPU Kernel Programming in Rust