English | 中文版
4. A More Realistic Example: Softmax
Vector multiplication demonstrates the basics, but real neural network workloads require math functions like exp(), log(), and sqrt(). The softmax function — used in attention layers, classification heads, and probability normalization — is a perfect example:
$$\text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_j e^{x_j - \max(x)}}$$
4.1 Math Intrinsics in ascend_std
ascend-rs exposes hardware math operations as Rust methods on primitive types. Under the hood, f32::exp() maps to the expf32 compiler intrinsic, which the MLIR codegen backend lowers to llvm.intr.exp — ultimately executing as a native NPU math instruction.
// In ascend_std: these methods are available on f32/f64 in kernel code
let y = x.exp(); // expf32 → llvm.intr.exp
let y = x.ln(); // logf32 → llvm.intr.log
let y = x.sqrt(); // sqrtf32 → llvm.intr.sqrt
4.2 The Softmax Kernel
Here is a complete softmax kernel written in Rust for the Ascend NPU:
#![feature(no_core)]
#![no_std]
#![no_core]
#[ascend_std::aiv_kernel]
pub unsafe fn softmax(input: *const f32, output: *mut f32, len: *const u32) {
unsafe {
let n = *len as usize;
// Step 1: Find max value for numerical stability
let mut max_val = *input;
let mut i = 1usize;
loop {
if i >= n { break; }
let val = *input.wrapping_add(i);
if val > max_val { max_val = val; }
i = i + 1;
}
// Step 2: Compute exp(x_i - max) and accumulate sum
let mut sum: f32 = 0.0;
i = 0;
loop {
if i >= n { break; }
let exp_val = (*input.wrapping_add(i) - max_val).exp();
*output.wrapping_add(i) = exp_val;
sum = sum + exp_val;
i = i + 1;
}
// Step 3: Normalize
i = 0;
loop {
if i >= n { break; }
*output.wrapping_add(i) = *output.wrapping_add(i) / sum;
i = i + 1;
}
}
}
The key line is (*input.wrapping_add(i) - max_val).exp() — this calls f32::exp(), which compiles through the MLIR backend into a native NPU exponential instruction. The subtraction of max_val before exponentiation is the standard numerical stability trick that prevents overflow.
This demonstrates that ascend-rs kernel code isn’t limited to simple arithmetic — it can express the same algorithms you’d write in C++ AscendC, with Rust’s safety guarantees.
4.3 Performance: Rust vs C++ on Real Hardware
How does a Rust kernel perform compared to hand-written C++ on actual NPU hardware? We benchmarked the softmax kernel on an Ascend 310P NPU with four implementations:
- C++ naive (scalar) — A hand-written C++ kernel using scalar loops with
GetValue/SetValueaccessors - C++ optimized (vector) — An expert-written C++ kernel using AscendC vector intrinsics (
ReduceMax,Exp,Muls) - Rust scalar — The Rust kernel above, compiled through the MLIR-to-C++ codegen pipeline
- Rust vector — A Rust kernel using ascend-rs vector intrinsics (
ascend_reduce_max_f32,ascend_exp_f32,ascend_muls_f32), compiled through the same pipeline
Each kernel processes f32 input arrays, with 1 warmup iteration and 10 timed iterations per configuration. All results are verified against a CPU reference for correctness.
| Size | C++ Naive (ms) | C++ Opt (ms) | Rust Scalar (ms) | Rust Vector (ms) | Scalar vs Naive | Vector vs Opt |
|---|---|---|---|---|---|---|
| 256 | 0.100 | 0.078 | 0.099 | 0.077 | 0.99x | 0.99x |
| 1,024 | 0.191 | 0.077 | 0.202 | 0.076 | 1.06x | 0.99x |
| 4,096 | 0.568 | 0.079 | 0.607 | 0.079 | 1.07x | 1.00x |
| 16,384 | 2.073 | 0.089 | 2.221 | 0.087 | 1.07x | 0.98x |
Key findings:
-
Rust vector matches C++ optimized performance. The Rust vectorized kernel, using
ascend_stdvector intrinsics that map to AscendC operations, performs within 1-2% of the hand-optimized C++ kernel across all sizes. At 16,384 elements, the Rust vector kernel (0.087ms) is actually slightly faster than C++ optimized (0.089ms). This means there is zero performance penalty for writing vectorized NPU kernels in Rust instead of C++. -
Vector intrinsics provide massive speedups. Both vectorized kernels are 1.3x faster at small sizes and up to 25x faster at 16,384 elements compared to their scalar counterparts. The vector pipeline processes 256 bits (8 floats) per cycle vs one element per cycle for scalar code.
-
Rust scalar is within 5-7% of C++ scalar. The scalar codegen path also produces competitive code, with the small overhead coming from different UB access patterns (direct pointer arithmetic vs accessor methods).
-
All implementations are numerically correct. Every kernel-size combination produces results matching the CPU reference (max error < 1e-8, output sum ≈ 1.0). The vector implementations achieve even lower error than scalar (max_err ~1e-10 vs ~1e-8) due to hardware-optimized math operations.
Here is what the Rust vectorized softmax kernel looks like — it reads almost identically to the C++ version:
#[ascend_std::aiv_kernel]
pub unsafe fn softmax(input: *const f32, output: *mut f32, len_buf: *const u32) {
unsafe {
let n = *len_buf;
let in_buf = ascend_std::ascend_buf_alloc(n);
let out_buf = ascend_std::ascend_buf_alloc(n);
let work = ascend_std::ascend_buf_alloc(n);
let rwork = ascend_std::ascend_buf_alloc(n);
ascend_std::ascend_buf_load_f32(in_buf, input, n);
ascend_std::ascend_pipe_barrier();
let max_val = ascend_std::ascend_reduce_max_f32(work, in_buf, rwork, n);
ascend_std::ascend_adds_f32(out_buf, in_buf, 0.0f32 - max_val, n);
ascend_std::ascend_exp_f32(out_buf, out_buf, n);
let sum_val = ascend_std::ascend_reduce_sum_f32(work, out_buf, rwork, n);
ascend_std::ascend_muls_f32(out_buf, out_buf, 1.0f32 / sum_val, n);
ascend_std::ascend_pipe_barrier();
ascend_std::ascend_buf_store_f32(output, out_buf, n);
}
}
The ascend_buf_alloc / ascend_buf_load_f32 / ascend_reduce_max_f32 calls are extern "C" stubs in ascend_std that the MLIR codegen backend recognizes and translates to AscendC API calls (TBuf, DataCopy, ReduceMax, etc.) during C++ code generation. This gives Rust kernels direct access to the NPU’s vector pipeline with zero overhead.
4.4 Beyond Softmax: Activation Function Benchmarks
To validate the breadth of the vector intrinsic API, we benchmarked three additional activation functions — Relu, Sigmoid, and Tanh — each composed from the same primitive operations. Unlike softmax, these activations don’t have dedicated AscendC builtins; instead they are constructed from composable vector primitives:
- Relu(x) = max(x, 0) →
Maxs - Sigmoid(x) = 1 / (1 + exp(-x)) →
Muls→Exp→Adds→Reciprocal - Tanh(x) = 2 · sigmoid(2x) - 1 →
Muls→Exp→Adds→Reciprocal→Muls→Adds
For each function, we compare a C++ implementation (TQue pipeline) against the equivalent Rust-style code (TBuf pipeline matching the mlir_to_cpp output):
| Size | Relu C++ (ms) | Relu Rust (ms) | Sigmoid C++ (ms) | Sigmoid Rust (ms) | Tanh C++ (ms) | Tanh Rust (ms) |
|---|---|---|---|---|---|---|
| 256 | 0.078 | 0.075 | 0.075 | 0.075 | 0.075 | 0.077 |
| 1,024 | 0.075 | 0.076 | 0.075 | 0.074 | 0.075 | 0.076 |
| 4,096 | 0.075 | 0.076 | 0.077 | 0.077 | 0.076 | 0.078 |
| 16,384 | 0.083 | 0.083 | 0.086 | 0.086 | 0.085 | 0.086 |
All six kernels perform identically within measurement noise. Relu achieves exact correctness (max_err = 0), while Sigmoid and Tanh achieve max_err < 3e-3 at sizes ≥ 1024. The size=256 correctness issue affects both C++ and Rust equally — it’s an AscendC hardware-level precision artifact at small vector sizes, not a codegen issue.
This confirms that the Rust vector intrinsic API generalizes beyond softmax. For the activation functions tested here — each a composition of AscendC vector primitives — Rust and C++ produce identical performance. We expect this to hold for any kernel composed purely from vector intrinsics, since the codegen maps each Rust intrinsic call 1:1 to the same AscendC C++ call. Cube engine operations (matmul via Mmad) and multi-level buffer hierarchies (L1/L0A/L0B/L0C) are supported at the API level but have not yet been hardware-verified through the full pipeline.
4.5 Formal Equivalence Verification: AscendC vs AscendRS
Performance parity is compelling, but the strongest argument for the Rust codegen pipeline is bitwise equivalence — proving that Rust-generated kernels produce exactly the same numerical results as hand-written AscendC C++ kernels on real NPU hardware.
We selected three representative kernels that cover the most common neural network operation patterns:
- ReLU — single vector op:
output[i] = max(input[i], 0)→ascend_maxs_f32 - Sigmoid — chained vector ops:
output[i] = 1/(1 + exp(-input[i]))→Muls→Exp→Adds→Reciprocal - Vec Add — binary vector op:
z[i] = x[i] + y[i]→ascend_add_f32
For each kernel, we compiled two implementations:
- AscendC original — idiomatic C++ using the TQue pipeline (EnQue/DeQue implicit synchronization), as a 910B production engineer would write it
- AscendRS equivalent — C++ generated from Rust source via the
mlir_to_cpppipeline (TBuf + explicitpipe_barrier(PIPE_ALL))
Both were run on the 310P NPU with identical inputs (256 f32 elements, deterministic PRNG) and compared at three levels:
| Test | C++ vs CPU | RS vs CPU | C++ vs RS |
|---|---|---|---|
| ReLU | PASS (err=0.00) | PASS (err=0.00) | PASS (err=0.00) |
| Sigmoid | PASS (err=2.4e-3) | PASS (err=2.4e-3) | PASS (err=0.00) |
| Vec Add | PASS (err=0.00) | PASS (err=0.00) | PASS (err=0.00) |
The C++ vs RS column shows bitwise identical output (max error = 0.0) for all three kernels. The NPU produces exactly the same bits whether the kernel was written in C++ or Rust. The small sigmoid CPU difference (2.4e-3) is the NPU’s Exp() vector unit precision vs x86 expf() — it affects both implementations equally and is not a codegen issue.
Here is the Rust sigmoid kernel — four lines of vector intrinsic calls that produce identical NPU output to the 40-line AscendC C++ class:
#[ascend_std::aiv_kernel]
pub unsafe fn sigmoid(input: *const f32, output: *mut f32, len: *const u32) {
unsafe {
let n = *len;
let buf_in = ascend_std::ascend_buf_alloc(n);
let buf_out = ascend_std::ascend_buf_alloc(n);
ascend_std::ascend_buf_load_f32(buf_in, input, n);
ascend_std::ascend_pipe_barrier();
ascend_std::ascend_muls_f32(buf_out, buf_in, -1.0f32, n);
ascend_std::ascend_pipe_barrier();
ascend_std::ascend_exp_f32(buf_out, buf_out, n);
ascend_std::ascend_pipe_barrier();
ascend_std::ascend_adds_f32(buf_out, buf_out, 1.0f32, n);
ascend_std::ascend_pipe_barrier();
ascend_std::ascend_reciprocal_f32(buf_out, buf_out, n);
ascend_std::ascend_pipe_barrier();
ascend_std::ascend_buf_store_f32(output, buf_out, n);
}
}
A notable discovery during this work: in-place chained vector operations on the 310P require explicit pipe_barrier(PIPE_ALL) between each step. Without barriers between Muls→Exp→Adds→Reciprocal on the same buffer, the next operation reads stale data. This is a hardware synchronization requirement that the Rust codegen pipeline now handles correctly — and the equivalence test serves as a regression test for this behavior.
4.6 The PTO Tile API Pipeline: Higher-Level Abstractions
The mlir_to_cpp path compiles Rust kernels by generating AscendC C++ with explicit TBuf + pipe_barrier patterns — equivalent to what a C++ programmer writes manually. A second codegen path, mlir_to_pto, targets the PTO (Programmable Tile Operations) dialect: a higher-level MLIR representation that lets kernels be expressed as operations on rectangular tiles of data rather than individual vector operations.
In the tile API, a softmax kernel is just four function calls:
#[ascend_std::aiv_kernel]
pub unsafe fn softmax(input: *const f32, output: *mut f32) {
let bid = ascend_std::get_block_idx() as usize;
let offset = bid * ROWS * COLS;
let t = tile_load_f32::<ROWS, COLS>(input.wrapping_add(offset));
let r = tile_softmax_f32::<ROWS, COLS>(t);
tile_store_f32::<ROWS, COLS>(output.wrapping_add(offset), r);
}
The tile_softmax_f32 call expands at compile time to the standard softmax decomposition (trowmax → trowexpandsub → texp → trowsum → trowexpanddiv). The shape parameters ROWS and COLS are compile-time constants, allowing ptoas (the PTO assembler) to assign optimal UB buffer offsets and synchronization flags automatically.
Compilation Pipeline
Rust source
→ rustc + mlir_to_pto codegen backend
→ PTO-MLIR (.pto) [ascend_tile_* → pto.trowmax / pto.texp / ...]
→ ptoas --enable-insert-sync
→ AscendC C++ (.cpp) [TROWMAX / TEXP / TROWEXPANDDIV + auto sync]
→ bisheng (CANN 8.5)
→ AICore kernel binary (.o)
Benchmark Results (Ascend 910B2, dav-c220)
We benchmarked 6 kernel variants covering both 1D (single-row) and 2D (multi-row) tile shapes on an Ascend 910B2 NPU. Each variant processes ROWS × COLS f32 values in a single AICore block, with 1 warmup iteration and 10 timed iterations. All results are verified for correctness against a CPU reference.
| Shape | Elements | Median (ms) | Max Error | Correctness |
|---|---|---|---|---|
| 1×1024 | 1,024 | 0.0046 | 1.05e-9 | PASS |
| 1×4096 | 4,096 | 0.0063 | 1.75e-10 | PASS |
| 1×8192 | 8,192 | 0.0086 | 2.62e-10 | PASS |
| 4×256 | 1,024 | 0.0054 | 2.79e-9 | PASS |
| 16×256 | 4,096 | 0.0049 | 3.26e-9 | PASS |
| 16×512 | 8,192 | 0.0049 | 2.79e-9 | PASS |
All six kernels pass correctness checks (max error < 1e-8, row sums = 1.0). The multi-row shapes (16×256, 16×512) are faster than the equivalent single-row shapes (1×4096, 1×8192) at the same element count — wider tiles allow the hardware’s vector pipeline to process more rows in parallel.
Compared to the mlir_to_cpp scalar softmax on the 310P (which ran at ~0.087 ms for 16,384 elements), the PTO tile kernels on the 910B2 run 10–18× faster at similar element counts. This reflects both the architectural advantages of the 910B2 (higher frequency, larger UB) and the efficiency of the PTO tile access pattern (single TLOAD/TSTORE per block vs. per-element loads in scalar code).
Numerical Precision
The PTO path achieves higher numerical precision than the scalar mlir_to_cpp path. Where the 310P scalar kernels showed max_err ≈ 1e-8, the 910B2 tile kernels show max_err ≈ 1e-9 to 1e-10 — an order of magnitude improvement. This comes from the PTO decomposition using hardware reduction instructions (TROWMAX, TROWSUM) that accumulate in higher internal precision before returning a float result.
4.7 Async Rust Kernels: Maintainability and Scheduler Freedom
The tile softmax kernel above is already barrier-free from the programmer’s perspective. But the underlying principle deserves deeper examination — because it motivates the long-term direction of the ascend-rs programming model and explains why the PTO path delivers more than just a cleaner API.
The Barrier Maintenance Problem
Look at the buffer-API kernel from section 4.3. Even at this simple scale, the programmer must:
- Allocate named queues for each pipeline stage (
TQue<QuePosition::VECIN, 1>) - Issue
EnQue/DeQueat every producer/consumer boundary - Insert
pipe_barrier(PIPE_ALL)at function exit to drain all in-flight ops - Know the Ascend pipeline model (Mte2 → Vector → Mte1 DMA stages) well enough to place barriers correctly
A missing barrier is a silent data race — no compiler error, no runtime fault at small sizes, a subtle wrong-answer bug at scale. A spurious PIPE_ALL stall is a performance regression that is invisible in correctness tests. As kernels grow — Flash Attention, multi-head attention, fused softmax+dropout — this hand-maintained barrier graph diverges from the actual data dependencies. Bugs compound.
Ownership as Implicit Sequencing
The tile API sidesteps this through Rust’s ownership model:
// Each step consumes its input — you cannot accidentally reuse t_in after softmax
let t_in: Tile<1, 1024, f32> = tile_load_f32::<1, 1024>(input_ptr);
let t_out: Tile<1, 1024, f32> = tile_softmax_f32::<1, 1024>(t_in); // t_in moved
tile_store_f32::<1, 1024>(output_ptr, t_out); // t_out moved
This encodes the data-flow graph in the type system:
tile_load_f32produces aTilecarrying a logical “Mte2 pending” tokentile_softmax_f32waits for that token, then produces aTilewith a “V pending” tokentile_store_f32waits for the V token, then issues Mte1
mlir_to_pto.rs translates this ownership chain to PTO-MLIR ops with no barrier calls at all (line 503 explicitly suppresses ascend_pipe_barrier). ptoas then sees a clean dependency graph and places set_flag/wait_flag only at the minimal required points.
What Async Rust Would Add
Ownership chains handle sequential pipelines well. For more complex patterns — double-buffering, speculative prefetch, interleaved load-compute-store across multiple tiles — a sequential chain forces an artificial total order on operations that could overlap.
An async-based tile API would express independent ops as concurrent futures:
// Hypothetical async tile API — two independent loads can overlap on Mte2
async fn softmax_kernel(input: *const f32, output: *mut f32) {
let (t0, t1) = join!(
tile_load_f32::<1, 1024>(input),
tile_load_f32::<1, 1024>(input.wrapping_add(1024)),
).await;
let (r0, r1) = join!(
tile_softmax_f32::<1, 1024>(t0),
tile_softmax_f32::<1, 1024>(t1),
).await;
tile_store_f32::<1, 1024>(output, r0).await;
tile_store_f32::<1, 1024>(output.wrapping_add(1024), r1).await;
}
The .await points mark where one stage must wait for another’s result — only exactly where required. join! expresses that the two loads can be issued to the Mte2 DMA engine simultaneously, letting the hardware overlap them.
What This Gives ptoas
The Ascend NPU has five independent hardware pipes: Scalar, Mte1 (UB→GM), Mte2 (GM→UB), Vector, and Cube. With async tile ops, mlir_to_pto.rs emits PTO-MLIR where the only sequencing edges are true data dependencies. ptoas’s --enable-insert-sync then inserts set_flag/wait_flag pairs only where a dst-pipe op consumes a src-pipe op’s output — no other barriers.
For the softmax decomposition, this means:
trowmax(Vector) waits fortload(Mte2) → oneset_flag(MTE2, V, 0)trowexpandsub → texp → trowsum → trowexpanddivare all Vector ops with sequential deps → no barriers between them (same pipe, hardware queues enforce order)tstore(Mte1) waits fortrowexpanddiv(Vector) → oneset_flag(V, MTE1, 0)
Total: 2 fine-grained flags, compared to pipe_barrier(PIPE_ALL) at every step in the buffer-API path. The 16×512 shape reaching 12.9 GB/s is a direct measurement of this — 16 independent row-softmax ops exposed to ptoas as a single wide tile op, letting the scheduler find the optimal overlap.
Current State
| Layer | Status |
|---|---|
| Tile API (sync ownership chain) | ✅ Working, benchmarked on 910B2 |
mlir_to_pto.rs barrier suppression | ✅ Done — ascend_pipe_barrier dropped |
ptoas --enable-insert-sync | ✅ Working — auto-inserts fine-grained sync |
Async tile API (tile_join_load, tile_prefetch) | ✅ Done — tile_join_load_f32 and tile_prefetch_f32 added to ascend_std |
| Multi-tile double-buffering | ✅ Done — GEP offset fix in mlir_to_pto.rs; verified on 910B2 |
Double-Buffering Results (910B2, 2026-04-02)
tile_softmax_double_buf processes two 1×1024 tiles per launch using tile_prefetch_f32 to issue the second load before the first tile’s compute begins. ptoas schedules the two pto.tload ops concurrently on Mte2 because they have distinct partition_view offsets ([%c0,%c0] and [%c1,%c0]) — no data dependency between them.
| Kernel | Tiles/launch | Per-tile avg | Per-tile min |
|---|---|---|---|
tile_softmax_1x1024 (baseline) | 1 | 0.0055 ms | 0.0045 ms |
tile_softmax_double_buf | 2 | 0.0034 ms | 0.0025 ms |
1.62× per-tile throughput (avg); 1.82× best-case. See Appendix J §J.4 for full kernel source, generated PTO-MLIR, and the two-bug fix in mlir_to_pto.rs that made this possible.