English | 中文版
Appendix J: Step-by-Step Reproducible Examples
This appendix walks through three complete, runnable ascend-rs examples from scratch. Each example includes the full source code, the exact shell commands to build and run it, the expected terminal output, and screenshots from real hardware runs. The goal is to let anyone with an Ascend NPU reproduce every result in this book.
Prerequisites
Hardware and Software
| Requirement | Minimum | Tested |
|---|---|---|
| Ascend NPU | Ascend 310P / 910B | Ascend 310P3, Ascend 910B2 |
| CANN | 8.1.RC1 | 8.1.RC1 (310P), 8.5.0 (910B) |
| Rust toolchain | nightly-2025-05-01 | nightly-2025-08-04 |
| OS | Linux aarch64 / x86_64 | Ubuntu 22.04 aarch64 |
| Driver | ≥ 24.1 | bundled with CANN |
One-time Environment Setup
# 1. Clone the repository
git clone https://github.com/ascend-rs/ascend-rs
cd ascend-rs
# 2. Source the CANN environment (adjust path for your installation)
source /usr/local/Ascend/ascend-toolkit/latest/bin/setenv.bash
# Or for CANN 8.5 standalone:
# source /usr/local/Ascend/cann-8.5.0/set_env.sh
# 3. Set the target SoC (adjust for your hardware)
export ACLRS_SOC_VERSION=Ascend310P3 # for 310P
# export ACLRS_SOC_VERSION=Ascend910B2 # for 910B2
# export ACLRS_SOC_VERSION=Ascend910_9392 # for 910 (older 9392 variant)
# 4. Verify the NPU is visible
npu-smi info
Expected output of npu-smi info (310P example):
+-------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2 Version: 24.1.rc2 |
+------------------+-------------------+-------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) HBM-Usage(MB) Aicore(%) |
| Chip | | Bus-Id |
+==================+===================+=================================================+
| 0 310P3 | OK | 14 42 372 / 8192 0 |
| 0 | | 0000:82:00.0 |
+------------------+-------------------+-------------------------------------------------+
Example 1: Hello World — ACL Device Initialization
The simplest possible ascend-rs program: initialize the ACL runtime, open a device, create a context and stream, print the device descriptor, and exit. This verifies that your driver, CANN, and Rust toolchain are all working together.
Source Code
examples/acl_hello_world/src/main.rs:
use anyhow::Result;
use ascend_rs::prelude::*;
use log::info;
use simple_logger::SimpleLogger;
fn main() -> Result<()> {
SimpleLogger::new().env().init().ok();
// Each of these RAII wrappers acquires a resource on construction
// and releases it automatically on drop. The compiler enforces the
// correct lifetime nesting: Device < AclContext < AclStream.
let acl = Acl::new()?;
let device = Device::new(&acl)?;
let context = AclContext::new(&device)?;
let stream = AclStream::new(&context)?;
info!("Device {} initialized successfully", device.descriptor());
info!("Context handle: {:p}", context.as_ptr());
info!("Stream handle: {:p}", stream.as_ptr());
// Resources are released in reverse order when they go out of scope.
Ok(())
}
Build and Run
# From the repository root:
cd examples/acl_hello_world
RUST_LOG=info cargo run --release
Expected Output
2026-03-31T09:14:02Z INFO [acl_hello_world] Device Ascend310P3 initialized successfully
2026-03-31T09:14:02Z INFO [acl_hello_world] Context handle: 0x55a7b2c30010
2026-03-31T09:14:02Z INFO [acl_hello_world] Stream handle: 0x55a7b2c30080
The device name (Ascend310P3, Ascend910B2, etc.) will match the SoC set in
ACLRS_SOC_VERSION. If you see Device startup failed the driver is not
running — check npu-smi info and ensure the device shows Health: OK.
Screenshot (310P hardware)
$ cd examples/acl_hello_world && RUST_LOG=info cargo run --release
Compiling acl_hello_world v0.1.0
Finished `release` profile [optimized] target(s) in 3.2s
Running `target/release/acl_hello_world`
2026-03-31T09:14:02Z INFO [acl_hello_world] Device Ascend310P3 initialized successfully
2026-03-31T09:14:02Z INFO [acl_hello_world] Context handle: 0x55a7b2c30010
2026-03-31T09:14:02Z INFO [acl_hello_world] Stream handle: 0x55a7b2c30080
What the output tells you:
Device Ascend310P3 initialized successfully— the ACL runtime found the device and the CANN driver stack is functional.- The context and stream handles are non-null kernel objects allocated by the
driver; they are freed automatically when
mainreturns.
Example 2: Vector Softmax — Rust Kernel on Real Hardware
This example runs the full softmax kernel from Chapter 4 on real NPU hardware:
a 1024-element f32 array passes through max → exp → sum → divide on the NPU
vector pipeline, and the result is verified against a CPU reference.
Source Code
Kernel (examples/bench_softmax_rs/kernels/src/lib.rs):
#![feature(no_core)]
#![no_std]
#![no_core]
/// Vectorized row softmax kernel.
///
/// Uses the ascend_std vector intrinsics which the mlir_to_cpp backend
/// translates to AscendC DataCopy / ReduceMax / Exp / Muls / ReduceSum calls.
#[ascend_std::aiv_kernel]
pub unsafe fn softmax(input: *const f32, output: *mut f32, len_buf: *const u32) {
unsafe {
let n = *len_buf;
// Allocate UB (Unified Buffer) scratch tiles
let in_buf = ascend_std::ascend_buf_alloc(n);
let out_buf = ascend_std::ascend_buf_alloc(n);
let work = ascend_std::ascend_buf_alloc(n);
let rwork = ascend_std::ascend_buf_alloc(n);
// DMA: global memory → UB
ascend_std::ascend_buf_load_f32(in_buf, input, n);
ascend_std::ascend_pipe_barrier(); // wait for Mte2 engine
// Numerically stable softmax: subtract max before exp
let max_val = ascend_std::ascend_reduce_max_f32(work, in_buf, rwork, n);
ascend_std::ascend_adds_f32(out_buf, in_buf, 0.0f32 - max_val, n);
ascend_std::ascend_exp_f32(out_buf, out_buf, n);
let sum_val = ascend_std::ascend_reduce_sum_f32(work, out_buf, rwork, n);
ascend_std::ascend_muls_f32(out_buf, out_buf, 1.0f32 / sum_val, n);
// DMA: UB → global memory
ascend_std::ascend_pipe_barrier();
ascend_std::ascend_buf_store_f32(output, out_buf, n);
}
}
Host (examples/bench_softmax_rs/src/main.rs, abridged):
use ascend_rs::prelude::*;
fn main() -> anyhow::Result<()> {
let acl = Acl::new()?;
let device = Device::new(&acl)?;
let context = AclContext::new(&device)?;
let stream = AclStream::new(&context)?;
let n: u32 = 1024;
let input: Vec<f32> = (0..n as usize)
.map(|i| ((i as f32) * 0.01).sin() * 3.0)
.collect();
// Transfer input to device, allocate output and length buffers
let mut d_input = DeviceBuffer::from_slice(&input)?;
let mut d_output = unsafe { DeviceBuffer::<f32>::uninitialized(n as usize)? };
let mut d_len = DeviceBuffer::from_slice(&[n])?;
// Load and launch the kernel (1 block)
let kernel_loader = KernelLoader::new()?;
let kernel = kernel_loader.get_kernel("softmax")?;
let mut args: [*mut std::ffi::c_void; 3] = [
d_input.as_mut_ptr() as *mut _,
d_output.as_mut_ptr() as *mut _,
d_len.as_mut_ptr() as *mut _,
];
unsafe { kernel.launch(1, &stream, &mut args)?; }
stream.synchronize()?;
// Verify against CPU reference
let output = d_output.to_host()?;
let sum: f32 = output.iter().sum();
println!("sum = {:.6} (expected ≈ 1.0)", sum);
println!("output[0..4] = {:?}", &output[..4]);
Ok(())
}
Build and Run
cd examples/bench_softmax_rs
# Build the kernel (triggers the CANN compilation pipeline):
# Rust source → MLIR → C++ (mlir_to_cpp) → bisheng → .acl.o
RUST_LOG=info cargo run --release -- --csv /tmp/softmax_results.csv
The kernel compilation step (bisheng) takes ~5 seconds on first build;
subsequent builds use the cargo cache.
Expected Output
2026-03-31T09:15:44Z INFO [bench_softmax_rs] Device Ascend310P3 initialized
2026-03-31T09:15:44Z INFO [bench_softmax_rs] Running softmax benchmark
size=256 pass=true max_err=1.22e-8 sum=1.000000 rust_vec=0.077ms
size=1024 pass=true max_err=8.34e-9 sum=1.000000 rust_vec=0.076ms
size=4096 pass=true max_err=7.11e-9 sum=1.000000 rust_vec=0.079ms
size=16384 pass=true max_err=6.89e-9 sum=1.000000 rust_vec=0.087ms
Screenshot (310P hardware, full benchmark comparison)
$ RUST_LOG=info cargo run --release -- --csv /tmp/softmax_results.csv
Compiling bench_softmax_rs v0.1.0
Finished `release` profile [optimized] target(s) in 8.4s
Running `target/release/bench_softmax_rs --csv /tmp/softmax_results.csv`
2026-03-31T09:15:44Z INFO [bench_softmax_rs] Device Ascend310P3 initialized
2026-03-31T09:15:44Z INFO [bench_softmax_rs] size=256 rust_vec=0.077ms pass=true max_err=1.22e-8
2026-03-31T09:15:44Z INFO [bench_softmax_rs] size=1024 rust_vec=0.076ms pass=true max_err=8.34e-9
2026-03-31T09:15:44Z INFO [bench_softmax_rs] size=4096 rust_vec=0.079ms pass=true max_err=7.11e-9
2026-03-31T09:15:44Z INFO [bench_softmax_rs] size=16384 rust_vec=0.087ms pass=true max_err=6.89e-9
CSV written to /tmp/softmax_results.csv
Running the full comparison (Rust vs C++ side-by-side):
# From repository root:
cd benchmarks/softmax
bash bench.sh
=== Softmax Benchmark ===
--- Rust softmax benchmark ---
size=16384 rust_scalar=2.221ms rust_vec=0.087ms pass=true
--- C++ softmax benchmark ---
size=16384 cpp_naive=2.073ms cpp_opt=0.089ms pass=true
Performance summary (16384 elements):
Rust vector vs C++ optimized: 0.087ms vs 0.089ms → Rust is 1.02x faster
Vector speedup over scalar: 25.5x
Correctness: all sizes PASS (max_err < 1e-8)
How the Pipeline Works
Each step in the compilation pipeline can be inspected by looking at the
intermediate files in kernels/target/:
kernels/target/davinci-huawei-none/release/deps/
├── softmax_kernels.mlir ← MLIR output from rustc codegen
├── softmax_kernels.mlir.acl.gen.cpp ← C++ generated by mlir_to_cpp
└── softmax_kernels.acl.o ← NPU object file from bisheng
The generated C++ (acl.gen.cpp) shows the direct AscendC API calls that the
Rust intrinsics compile to:
// Generated from: ascend_std::ascend_exp_f32(out_buf, out_buf, n)
Exp(out_buf_local, out_buf_local, n);
pipe_barrier(PIPE_V);
Example 3: Tile Softmax — PTO Codegen Path on Ascend 910B
This example demonstrates the newer PTO (Programmable Tile Operations)
codegen path, which targets the Ascend 910B (dav-c220) matrix pipeline. The
tile API expresses 2D tile operations (tile_load, tile_softmax,
tile_store) that compile through ptoas — the PTO assembler — rather than
the standard C++ codegen.
This is the most advanced example and requires an Ascend 910B device with
ptoas available. It demonstrates the complete pipeline:
Rust tile API → MLIR → PTO-MLIR → ptoas → CCE C++ → ccec → .acl.o
Source Code
Kernel (examples/tile_softmax/kernels/src/lib.rs):
#![feature(no_core)]
#![no_std]
#![no_core]
use ascend_std::tile::{tile_load_f32, tile_softmax_f32, tile_store_f32, Tile};
/// Row-wise softmax over a ROWS × COLS tile of f32 values.
///
/// The tile API is a 2D abstraction over the NPU's vector engine:
/// - `tile_load_f32` → PTO `tload` (DMA from global memory to UB tile)
/// - `tile_softmax_f32` → PTO reduction ops: trowmax → trowexpandsub →
/// texp → trowsum → trowexpanddiv
/// - `tile_store_f32` → PTO `tstore` (DMA from UB tile to global memory)
///
/// The `ptoas --enable-insert-sync` flag automatically inserts set_flag /
/// wait_flag barriers between tile operations.
#[ascend_std::aiv_kernel]
pub unsafe fn tile_softmax(input: *const f32, output: *mut f32) {
let block_idx = ascend_std::get_block_idx() as usize;
let offset = block_idx * 1 * 1024; // ROWS=1, COLS=1024
// Load tile from global memory
let t_in: Tile<1, 1024, f32> =
tile_load_f32::<1, 1024>(input.wrapping_add(offset));
// Compute softmax: max → shift → exp → sum → divide
let t_out: Tile<1, 1024, f32> = tile_softmax_f32::<1, 1024>(t_in);
// Store result to global memory
tile_store_f32::<1, 1024>(output.wrapping_add(offset), t_out);
}
Host (examples/tile_softmax/src/main.rs, abridged):
use ascend_rs::prelude::*;
fn main() -> anyhow::Result<()> {
const ROWS: usize = 1;
const COLS: usize = 1024;
let acl = Acl::new()?;
let device = Device::new(&acl)?;
let context = AclContext::new(&device)?;
let stream = AclStream::new(&context)?;
// Sinusoidal input for visual verification
let input: Vec<f32> = (0..ROWS * COLS)
.map(|i| ((i as f32) * 0.01).sin() * 3.0)
.collect();
let mut d_input = DeviceBuffer::from_slice(&input)?;
let mut d_output = unsafe { DeviceBuffer::<f32>::uninitialized(ROWS * COLS)? };
let kernel_loader = KernelLoader::new()?;
let kernel = kernel_loader.get_kernel("tile_softmax")?;
let mut args: [*mut std::ffi::c_void; 2] = [
d_input.as_mut_ptr() as *mut _,
d_output.as_mut_ptr() as *mut _,
];
unsafe { kernel.launch(1, &stream, &mut args)?; } // 1 block
stream.synchronize()?;
let output = d_output.to_host()?;
let sum: f32 = output.iter().sum();
let max_err = output.iter()
.zip(softmax_cpu(&input, ROWS, COLS).iter())
.map(|(a, b)| (a - b).abs())
.fold(0.0f32, f32::max);
println!("tile_softmax: max_err={:.4e} sum={:.6} {}",
max_err, sum,
if max_err < 1e-5 && (sum - 1.0).abs() < 1e-4 { "PASS" } else { "FAIL" });
Ok(())
}
Build and Run
# Required environment (Ascend 910B with CANN 8.5 and ptoas)
export ACLRS_CANN_PATH=/usr/local/Ascend/cann-8.5.0
export ACLRS_SOC_VERSION=Ascend910_9392 # adjust for your SoC
export ACLRS_CODEGEN_PATH=pto # enable PTO path
export ACLRS_PTOAS_PATH=/path/to/ptoas # ptoas assembler binary
export ACLRS_PTO_ISA_PATH=/path/to/pto-isa/include # pto-isa headers
export LD_LIBRARY_PATH=/data/llvm20/lib:${ACLRS_CANN_PATH}/aarch64-linux/lib64:\
/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/driver/lib64/common
source ${ACLRS_CANN_PATH}/set_env.sh
export PATH=${ACLRS_CANN_PATH}/tools/ccec_compiler/bin:$PATH
cd examples/tile_softmax
cargo run --release
Compilation Pipeline Trace
The build system prints each step. With RUST_LOG=debug you can see the exact
commands:
# Step 1: Rust → MLIR (rustc with custom codegen backend)
rustc --crate-type lib -Z codegen-backend=librustc_codegen_mlir.so ...
→ tile_softmax_kernels.mlir
# Step 2: MLIR → PTO-MLIR (mlir_to_pto.rs)
→ tile_softmax_kernels.acl.pto
# Step 3: PTO-MLIR → CCE C++ (ptoas)
ptoas --enable-insert-sync --pto-arch=a3 tile_softmax_kernels.acl.pto \
-o tile_softmax_kernels.acl.pto.cpp
# Step 4: CCE C++ → NPU object (ccec)
ccec -c -O3 -x cce -DMEMORY_BASE --cce-aicore-arch=dav-c220-vec \
-mllvm -cce-aicore-addr-transform \
-mllvm -cce-aicore-dcci-insert-for-scalar=false \
-I/path/to/pto-isa/include \
tile_softmax_kernels.acl.pto.cpp \
-o tile_softmax_kernels.acl.o
Intermediate Artifacts (Committed)
The intermediate files generated during the verified 2026-04-01 run on Ascend 910B2
are committed to the repository under examples/tile_softmax/artifacts/. You can
inspect each stage of the pipeline without installing any tools:
| File | Stage | Description |
|---|---|---|
tile_softmax_kernels.acl.pto | MLIR → PTO-MLIR | PTO-MLIR dialect emitted by mlir_to_pto.rs |
tile_softmax_kernels.acl.pto.cpp | PTO-MLIR → CCE C++ | AscendC C++ generated by ptoas --enable-insert-sync |
tile_softmax_kernels.acl.pto.compat-a3.hpp | CANN 8.5 shim | Compatibility header patched by pto-compat-cann85.hpp |
For the multi-shape benchmark, see the equivalent artifacts in
examples/bench_softmax_tile/artifacts/.
The complete PTO-MLIR output for the 1×1024 softmax kernel
(tile_softmax_kernels.acl.pto):
// Generated by ascend-rs mlir_to_pto — DO NOT EDIT
// Compile: ptoas --enable-insert-sync <file.pto> -o <file.cpp>
module {
func.func @tile_softmax(%arg601: !pto.ptr<f32>, %arg602: !pto.ptr<f32>) {
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%c1024 = arith.constant 1024 : index
%pto0 = pto.make_tensor_view %arg601, shape = [%c1, %c1024], strides = [%c1024, %c1] : !pto.tensor_view<?x?xf32>
%pto1 = pto.partition_view %pto0, offsets = [%c0, %c0], sizes = [%c1, %c1024] : !pto.tensor_view<?x?xf32> -> !pto.partition_tensor_view<1x1024xf32>
%pto2 = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1024, v_row=1, v_col=1024, blayout=row_major, slayout=none_box, fractal=512, pad=0>
pto.tload ins(%pto1 : !pto.partition_tensor_view<1x1024xf32>) outs(%pto2 : !pto.tile_buf<...>)
// scratch tile for trowmax
%pto3 = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=8, cols=1, ...> // row-max result
%pto4 = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1024, ...> // scratch
%pto5 = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1024, ...> // shifted
%pto6 = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1024, ...> // exp result
%pto7 = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=8, cols=1, ...> // row-sum result
%pto8 = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1024, ...> // final output
// softmax decomposition:
pto.trowmax ins(%pto2, %pto4 : ...) outs(%pto3 : ...) // Step 1: max per row
pto.trowexpandsub ins(%pto2, %pto3 : ...) outs(%pto5 : ...) // Step 2: x - max
pto.texp ins(%pto5 : ...) outs(%pto6 : ...) // Step 3: exp(x - max)
pto.trowsum ins(%pto6, %pto4 : ...) outs(%pto7 : ...) // Step 4: sum
pto.trowexpanddiv ins(%pto6, %pto7 : ...) outs(%pto8 : ...) // Step 5: / sum
%pto9 = pto.make_tensor_view %arg602, shape = [%c1, %c1024], strides = [%c1024, %c1] : !pto.tensor_view<?x?xf32>
%pto10 = pto.partition_view %pto9, offsets = [%c0, %c0], sizes = [%c1, %c1024] : !pto.tensor_view<?x?xf32> -> !pto.partition_tensor_view<1x1024xf32>
pto.tstore ins(%pto8 : ...) outs(%pto10 : ...)
return
}
}
After ptoas --enable-insert-sync, the CCE C++ kernel entry point
(tile_softmax_kernels.acl.pto.cpp,
excerpt):
extern "C" __global__ AICORE void tile_softmax(__gm__ float* v1, __gm__ float* v2) {
// ptoas allocates UB tiles at compile-time offsets (v8..v14)
Tile<TileType::Vec, float, 1, 1024, BLayout::RowMajor, ...> v18; // input tile
TLOAD(v18, v17);
set_flag(PIPE_MTE2, PIPE_V, EVENT_ID0); // auto-inserted sync
// Softmax reduction ops:
wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID0);
TROWMAX(v20, v18, v23); pipe_barrier(PIPE_V);
TROWEXPANDSUB(v24, v18, v20); pipe_barrier(PIPE_V);
TEXP(v25, v24); pipe_barrier(PIPE_V);
TROWSUM(v27, v25, v23); pipe_barrier(PIPE_V);
TROWEXPANDDIV(v30, v25, v27);
set_flag(PIPE_V, PIPE_MTE3, EVENT_ID0); // auto-inserted sync
wait_flag(PIPE_V, PIPE_MTE3, EVENT_ID0);
TSTORE(v33, v30);
pipe_barrier(PIPE_ALL);
}
The __global__ keyword marks this as a host-callable entry point. Without it,
ccec compiles the function successfully but the runtime cannot dispatch it
(symptom: MTE DDR address out of range, error code 0x800000). This was a
non-obvious bug fixed in commit 04c80ac6.
Expected Output
2026-04-01T12:17:35Z INFO [tile_softmax] tile_softmax test: ROWS=1, COLS=1024, n=1024
2026-04-01T12:17:35Z INFO [tile_softmax] Device Ascend910_9392 initialized
2026-04-01T12:17:35Z INFO [tile_softmax] Launching tile_softmax kernel (1 block, 1x1024 f32)...
2026-04-01T12:17:36Z INFO [tile_softmax] tile_softmax: max_err=1.8626e-9 sum=1.000000 sum_ok=true PASS
2026-04-01T12:17:36Z INFO [tile_softmax] tile_softmax PASSED
The max_err=1.8626e-9 result was recorded on 2026-04-01 on Ascend 910B2 hardware (Ascend910_9392, dav-c220). The PTO tile reduction instructions (TROWMAX, TROWSUM) accumulate with higher internal precision before returning f32, achieving ~10× better numerical accuracy than the scalar mlir_to_cpp path (which yields max_err ≈ 1e-8 on the same data).
What Makes This Different from Example 2
| Example 2 (Vector Softmax) | Example 3 (Tile Softmax) | |
|---|---|---|
| Codegen path | mlir_to_cpp → bisheng | mlir_to_pto → ptoas → ccec |
| Abstraction | Scalar intrinsics (ascend_reduce_max_f32) | 2D tile ops (tile_softmax_f32) |
| Target hardware | 310P or 910B (vector engine) | 910B (dav-c220, a2a3 path) |
| Intermediate format | AscendC C++ | PTO-MLIR dialect |
| Barriers | Manual (ascend_pipe_barrier) | Auto-inserted by ptoas --enable-insert-sync |
| Parallelism model | 1 block, scalar loops | 1 block, 2D tile |
| Verified max_err | ~1e-8 (310P hardware) | ~1.9e-9 (910B2 hardware, 2026-04-01) |
Example 4: Double-Buffer Tile Softmax
Extends Example 3 to process two tiles per kernel launch using tile_prefetch_f32, overlapping Mte2 DMA (tile 1 load) with Vector compute (tile 0 softmax). See §4.7 for the performance results.
Source Code
Kernel (examples/tile_softmax_double_buf/kernels/src/lib.rs):
#![feature(no_core)]
#![no_std]
#![no_core]
use ascend_std::tile::{
tile_load_f32, tile_prefetch_f32, tile_softmax_f32, tile_store_f32, Tile,
};
#[ascend_std::aiv_kernel]
pub unsafe fn tile_softmax_double_buf(input: *const f32, output: *mut f32) {
const ROWS: usize = 1;
const COLS: usize = 1024;
const TILE_ELEMS: usize = ROWS * COLS;
// --- Prologue: issue both loads before any compute ---
// t0 loads tile 0 (offset 0); t1 prefetches tile 1 (offset TILE_ELEMS).
let t0: Tile<ROWS, COLS, f32> = tile_load_f32::<ROWS, COLS>(input);
let t1: Tile<ROWS, COLS, f32> =
tile_prefetch_f32::<ROWS, COLS>(input.wrapping_add(TILE_ELEMS));
// --- Compute tile 0 (Mte2 for t1 can overlap this on the hardware) ---
let r0: Tile<ROWS, COLS, f32> = tile_softmax_f32::<ROWS, COLS>(t0);
// --- Compute tile 1 ---
let r1: Tile<ROWS, COLS, f32> = tile_softmax_f32::<ROWS, COLS>(t1);
// --- Store results ---
tile_store_f32::<ROWS, COLS>(output, r0);
tile_store_f32::<ROWS, COLS>(output.wrapping_add(TILE_ELEMS), r1);
}
The move-ownership pattern enforces the pipeline at compile time: t0 is consumed by tile_softmax_f32 before t1 is used, so there is no data race. tile_prefetch_f32 is identical to tile_load_f32 at the hardware level; the different name documents the programmer’s intent.
Build and Run
# Same environment as Example 3 (Ascend 910B with CANN 8.5 and ptoas)
export ACLRS_CANN_PATH=/usr/local/Ascend/cann-8.5.0
export ACLRS_SOC_VERSION=Ascend910_9392
export ACLRS_CODEGEN_PATH=pto
export ACLRS_PTOAS_PATH=/path/to/ptoas
export ACLRS_PTO_ISA_PATH=/path/to/pto-isa/include
export LD_LIBRARY_PATH=/data/llvm20/lib:${ACLRS_CANN_PATH}/aarch64-linux/lib64:\
/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/driver/lib64/common
source ${ACLRS_CANN_PATH}/set_env.sh
export PATH=${ACLRS_CANN_PATH}/tools/ccec_compiler/bin:$PATH
cd examples/tile_softmax_double_buf
cargo run --release
Generated PTO-MLIR
The key difference from Example 3 is that the two loads produce distinct partition_view ops with different row offsets:
// tile 0: load from row 0
%pto1 = pto.partition_view %pto0, offsets = [%c0, %c0], sizes = [%c1, %c1024] : ...
pto.tload ins(%pto1 : ...) outs(%pto2 : ...)
// tile 1: load from row 1 (offset 1024 elements = row 1 with cols=1024)
%pto3 = pto.partition_view %pto0, offsets = [%c1, %c0], sizes = [%c1, %c1024] : ...
pto.tload ins(%pto3 : ...) outs(%pto4 : ...)
// softmax(t0) — Vector pipe; Mte2 can overlap with tload above
pto.trowmax ins(%pto2, ...) outs(...)
pto.trowexpandsub ...
pto.texp ...
pto.trowsum ...
pto.trowexpanddiv ins(...) outs(%pto10 : ...)
// softmax(t1)
pto.trowmax ins(%pto4, ...) outs(...)
...
pto.trowexpanddiv ins(...) outs(%pto16 : ...)
// stores — both at row 0 and row 1 of output
%pto18 = pto.partition_view %pto17, offsets = [%c0, %c0], ...
pto.tstore ins(%pto10 : ...) outs(%pto18 : ...)
%pto19 = pto.partition_view %pto17, offsets = [%c1, %c0], ...
pto.tstore ins(%pto16 : ...) outs(%pto19 : ...)
With offsets=[%c0,%c0] and offsets=[%c1,%c0] encoding different rows, ptoas recognises the two tload ops as accessing independent memory regions and schedules them concurrently on the Mte2 pipe.
Expected Output
2026-04-02T06:14:07Z INFO [tile_softmax_double_buf] double_buf 2×(1×1024): total avg=0.0068ms min=0.0049ms max=0.0140ms | per-tile avg=0.0034ms min=0.0024ms | max_err=3.26e-9 PASS
Raw results: examples/tile_softmax_double_buf/results/bench_double_buf_910b2_2026-04-02.csv.
The GEP Offset Bug Fix
Before this example could work correctly, mlir_to_pto.rs had two bugs:
Bug 1 — make_pv always emitted offsets=[%c0,%c0]:
The GEP index was tracked in gep_offsets but never passed to make_pv. Fixed by adding elem_offset: u32 to make_pv and converting it to (row_off, col_off) using cols as stride.
Bug 2 — Pattern 3 alias chain was flattened:
The load-from-alloca pattern (Pattern 3) called ctx.resolve_ptr(&stored) before inserting the alias, which skipped the intermediate GEP node (%gep → %arg0) where gep_offsets[%gep] = 1024 was recorded. Fixed by storing the immediate alias without resolving first, so resolve_offset can traverse the full chain.
Example 5: Linalg-Bridge Softmax — Upstream MLIR on 910B2
This example walks the same softmax kernel through the linalg ingress bridge, on real 910B2 hardware. The Rust front-end is not used at all; the source is a two-line linalg.softmax op in upstream MLIR, the kind of fixture you would find in an upstream MLIR test suite or extract from a torch-mlir FX export. See §4.7 for the explanatory context.
Source
The complete fixture is two lines of upstream linalg:
// benchmarks/linalg/kernels_upstream_shape_matched/softmax_upstream_1x1024.mlir
func.func @upstream_softmax_1x1024(%arg0: tensor<1x1024xf32>) -> tensor<1x1024xf32> {
%0 = tensor.empty() : tensor<1x1024xf32>
%1 = linalg.softmax dimension(1) ins(%arg0 : tensor<1x1024xf32>)
outs(%0 : tensor<1x1024xf32>) -> tensor<1x1024xf32>
return %1 : tensor<1x1024xf32>
}
A torch-mlir exported equivalent, captured from a 4-line PyTorch script (/tmp/torch_mlir_linalg/dump_simple.py on adablue), arrives in essentially the same shape — see benchmarks/linalg/kernels_torch_mlir_shape_matched/ for add_tm.mlir, exp_tm.mlir, silumul_tm.mlir. The torch-mlir softmax wheel did not export linalg.softmax directly at the version pinned (torch-mlir-20260421.789); instead it lowers to a linalg.generic reduction sequence that the bridge handles via the GenericUnaryKind::Exp + GenericBinop matchers added in commit 299de147.
Build and Run
# adablue (host-side build) — turn upstream linalg into AscendC C++
cd /home/y00577373/ascend-rs-priv
cargo build -p mlir_to_cpp_tests --release --bin linalg_to_ascendc
crates/mlir_to_cpp_tests/target/release/linalg_to_ascendc \
benchmarks/linalg/kernels_upstream_shape_matched/softmax_upstream_1x1024.mlir \
/tmp/sm_upstream.cce
# 910c (NPU-side build + run) — sync, then compile to .acl.o and execute
ssh 910c
cd /data/yuyijun/ascend-rs/benchmarks/linalg_bridge_bench
ASCEND_DEVICE_ID=2 cargo run --release
Expected Output (910B2 chip 2, 2026-04-22, 3 repeat runs)
[bridge_bench] pair=softmax_1x1024
ascendrs (hand-written) : min= 4.83 µs p50= 5.21 µs mean= 5.34 µs
upstream linalg (bridge): min= 4.95 µs p50= 5.27 µs mean= 5.42 µs
Δmin= 0.12 µs Δp50= 0.06 µs Δmean= 0.08 µs (<8% across all stats)
max_err vs CPU reference = 1.86e-9 PASS
[bridge_bench] pair=add_1x1024
ascendrs : min= 4.18 µs upstream: min= 4.20 µs Δ= 0.02 µs PASS
[bridge_bench] pair=exp_1x1024
ascendrs : min= 4.46 µs upstream: min= 4.54 µs Δ= 0.08 µs PASS
[bridge_bench] pair=matmul_32x64x32
ascendrs : min= 1586.1 µs upstream: min= 1586.4 µs Δ= 0.3 µs PASS (<0.02%)
What the AscendC C++ Looks Like (Byte-Identical Check)
The headline claim is byte-identical emit. The host-only test that proves it:
$ cargo test -p mlir_to_cpp_tests --release \
--test upstream_matches_ascendrs_byte_identical -- --nocapture
running 5 tests
test add_1x1024_byte_identical ... ok
test exp_1x1024_byte_identical ... ok
test softmax_1x1024_byte_identical ... ok
test matmul_32x64x32_byte_identical ... ok
test silumul_1x1024_byte_identical ... ok (CPU side; not runnable on 910B2 today)
5 passed; 0 failed
Each test runs linalg_to_ascendc on both kernels_ascendrs/<name>.mlir (hand-written ascendrs-form) and kernels_upstream_shape_matched/<name>_upstream.mlir (upstream linalg), then byte-compares the resulting .cce. Zero diff bytes means the bridge is structurally a no-op past hop 1; the downstream mlir_to_cpp emitter has nothing to distinguish them.
Pipeline Diagram
Rust path (Examples 2–4)
┌────────────────────────────┐
softmax.rs ── rustc ──┐ │ rustc_codegen_mlir │
│ │ │ │
│ │ ▼ │
│ │ MLIR (LLVM-D) │
│ └─────────────┬───────────────┘
│ │
│ Bridge path (this example)
│ ┌─────────────────────────────┐
upstream.mlir ────────┴─────► │ linalg_to_ascend_tile │
torch-mlir.mlir ──────────► │ │ │
│ ▼ │
│ ascend_tile MLIR │
└─────────────┬───────────────┘
│
▼ (same emitter from here)
mlir_to_cpp
│
▼
AscendC C++
│
▼
bisheng
│
▼
910B2 NPU
Branches reconverge at mlir_to_cpp. Past that point the hardware sees the same bytes regardless of which branch the kernel started on.
Example 6: Safety Oracle on Softmax — ptoas Says OK, Oracle Says No
The previous five examples all show kernels that work. This one shows the kernel that appears to work — passes ptoas, ccec, and bisheng — but produces silently wrong output, and demonstrates the oracle catching it. See §11.3 for the chapter discussion; this is the runnable demo.
Two Fixtures, One Compiler
Both fixtures are PTO-MLIR .acl.pto files for 1×1024 f32 softmax. The “good” one is what mlir_to_pto emits from the upstream linalg fixture in Example 5 (or equivalently from the Rust tile API kernel in Example 3). The “bad” one is the same file with 48 extra pto.alloc_tile + pto.tload ops injected before the reduction — each tile is 1×1024 f32, none is ever read, and ptoas’s PlanMemoryPass ends up stacking several at the same UB offset as live tiles %3 and %11.
# Generate the two fixtures
cd /home/y00577373/ascend-rs-priv
python3 blog/mdbook/scripts/ch11_make_bad_softmax.py /tmp/ch11_sm_bad.acl.pto
# (the "good" file is committed)
cp examples/tile_softmax/artifacts/tile_softmax_kernels.acl.pto /tmp/ch11_sm_good.acl.pto
Run Both Through ptoas
PTOAS=/usr/local/bin/ptoas-bin/ptoas # or $HOME/ptoas-x86/bin/ptoas on adablue
$PTOAS /tmp/ch11_sm_good.acl.pto -o /tmp/good.cpp
echo "good rc=$?"
$PTOAS /tmp/ch11_sm_bad.acl.pto -o /tmp/bad.cpp
echo "bad rc=$?"
good rc=0
bad rc=0
ptoas accepts both. ccec accepts both. bisheng links both. On 910B2, the “good” kernel produces max_err=1.86e-9; the “bad” kernel produces garbage that varies run-to-run depending on what the dead tiles happened to overwrite.
Run Both Through the Oracle
PTO_DIFF=/data/yuyijun/ascend-rs/target/release/pto-diff # or local build
$PTO_DIFF --from-pto /tmp/ch11_sm_good.acl.pto --ptoas $PTOAS
$PTO_DIFF --from-pto /tmp/ch11_sm_bad.acl.pto --ptoas $PTOAS
=== /tmp/ch11_sm_good.acl.pto ===
0 errors, 0 warnings (clean)
=== /tmp/ch11_sm_bad.acl.pto ===
[error] capacity: vec high-water 393216 B exceeds capacity 196608 B
(on Ascend910B2 (CANN 8.5))
[error] aliasing: tiles `%3` and `%108` overlap at vec offset 0x1000
[error] dead-tile: tile `%108` is written but never read
... (94 more findings) ...
96 errors, 0 warnings
Exit code: 3 for the bad fixture, 0 for the good one. Same pto-diff binary, same ptoas underneath — the only thing that distinguishes the two outcomes is that the oracle inspects the post-PlanMemoryPass MLIR in a way ptoas does not.
One-Shot Demo Script
Both runs are bundled in blog/mdbook/scripts/ch11_bad_demo.sh, which is also what drives the demo recording in §11.6. To replay locally:
PTOAS=/usr/local/bin/ptoas-bin/ptoas \
PTO_DIFF=/data/yuyijun/ascend-rs/target/release/pto-diff \
bash blog/mdbook/scripts/ch11_bad_demo.sh
The Same Contrast on the Linalg Ingress Path
For completeness, here is the equivalent end-to-end demo that starts from upstream linalg rather than hand-edited PTO, exercising both Path A (projector) and Path C (full ptoas pipeline):
BIN=crates/mlir_to_cpp_tests/target/release/linalg_to_ascendc
SM=benchmarks/linalg/kernels_upstream_shape_matched/softmax_upstream_1x1024.mlir
ADV=benchmarks/linalg/kernels_adversarial/capacity_overflow_1x131072.mlir
echo "--- clean softmax via Path A ---"
ACLRS_LINALG_SAFETY=path-a $BIN $SM /tmp/clean.cce 2>&1 \
| grep linalg-safety || echo "(clean — no findings)"
echo "--- adversarial fixture via Path A ---"
ACLRS_LINALG_SAFETY=path-a $BIN $ADV /tmp/adv.cce 2>&1 \
| grep linalg-safety || echo "(clean)"
echo "--- adversarial fixture via Path C ---"
ACLRS_PTOAS_BIN=$HOME/ptoas-x86/bin/ptoas \
ACLRS_LINALG_SAFETY=path-c $BIN $ADV /tmp/adv.cce 2>&1 \
| grep linalg-safety || echo "(clean)"
--- clean softmax via Path A ---
(clean — no findings)
--- adversarial fixture via Path A ---
linalg-safety [path-a] [error] capacity: vec high-water 1048576 B exceeds capacity 196608 B
(on Ascend910B2 (CANN 8.5)) (in `adv_capacity_overflow`)
--- adversarial fixture via Path C ---
linalg-safety [path-c] [error] ptoas: vec overflow, requires 8388608 bits while 1572864 bits avaliable
(in `adv_capacity_overflow`)
Both Paths catch the capacity bug, by different mechanisms, on the same input — which is the whole point of having two complementary safety surfaces on the bridge.
Troubleshooting
Device startup failed
The NPU driver is not running or the device is in a fault state. Check:
npu-smi info # look for Health: OK (not Critical)
npu-smi reset -i 0 # reset device 0 (requires root)
Could not determine ASCEND_HOME_PATH
ACLRS_CANN_PATH is not set or the path doesn’t exist:
export ACLRS_CANN_PATH=/usr/local/Ascend/cann-8.5.0
# verify it exists:
ls $ACLRS_CANN_PATH/tools/ccec_compiler/bin/bisheng
ptoas assembler not found
Set ACLRS_PTOAS_PATH to the full path of the ptoas binary:
export ACLRS_PTOAS_PATH=/path/to/ptoas/build/tools/ptoas/ptoas
ptoas is part of the pto-isa
project and is only required for the PTO codegen path (Example 3).
ccec PTO compilation failed: set_mask_count does not support target feature
This means the wrong --cce-aicore-arch was used. Ensure:
ACLRS_SOC_VERSIONis set correctly for your chip- ascend-rs is on the
claude_codeormainbranch (fix committed ind45ab4e3andadbf7294)
error: definition of type 'bfloat16_t' conflicts with typedef
Your ccec version already defines bfloat16_t. This was fixed in commit
adbf7294. Update to the latest branch.
Correctness check fails (max_err > 1e-5)
- For the vector softmax on 310P: expected max_err < 1e-8 (hardware f32 math)
- For the tile softmax on 910B: expected max_err < 1e-9 (PTO reduction instructions use higher internal precision; verified result is max_err=1.86e-9)
- Values larger than 1e-5 may indicate the wrong SoC version is set, causing
mismatched UB buffer size assumptions, or a missing
__global__on the kernel entry point (fixed in commit04c80ac6)
Summary: Pipeline Comparison at a Glance
Example 1: Hello World
Rust host code → cargo build → binary → ACL runtime → NPU device
(No kernel — pure host/driver interaction)
Example 2: Vector Softmax (mlir_to_cpp path)
Rust kernel → rustc → MLIR → mlir_to_cpp → AscendC C++
→ bisheng → .acl.o → KernelLoader → NPU execution
Example 3: Tile Softmax (PTO path)
Rust kernel → rustc → MLIR → mlir_to_pto → PTO-MLIR dialect
→ ptoas → CCE C++ → ccec → .acl.o
→ KernelLoader → NPU execution
All three pipelines share the same host-side runtime (ascend_rs::prelude::*):
Acl, Device, AclContext, AclStream, DeviceBuffer, KernelLoader.
The only difference is in how the .acl.o kernel binary is produced.