English | 中文版
7. End-to-End Pipeline Walkthrough
Let’s trace the complete journey from source code to NPU execution during a single cargo run.
7.1 Compilation Phase
graph TD
A["Rust Kernel Source<br/>kernels/src/lib.rs"] -->|"rustc + rustc_codegen_mlir"| B["Rust MIR<br/>Type-checked, monomorphized"]
B -->|"builder_methods.rs:<br/>MIR ops → MLIR ops"| C["MLIR Modules<br/>LLVM · Arith · CF dialects<br/>hacc.entry attribute"]
C -->|"compile_ascend.rs:<br/>merge all modules"| D["Merged MLIR<br/>kernel code + ascend_std deps"]
D -->|"mlir_to_cpp<br/>(default)"| E["Generated C++<br/>AscendC class with TBuf,<br/>DataCopy, ReduceMax, Exp, ..."]
D -->|"mlir_to_pto<br/>(ACLRS_CODEGEN_PATH=pto)"| P["PTO Assembly<br/>pto.tload, pto.tadd, pto.tmatmul,<br/>pto.trowmax, pto.texp, ..."]
P -->|"ptoas --enable-insert-sync"| E
E --> F["ascend_compile crate<br/>Target abstraction · Validation<br/>Bisheng invocation · C ABI + CLI"]
F -->|"310P: --cce-aicore-arch=dav-m200"| G["NPU Binary · kernel.acl.o<br/>Ascend 310P machine code"]
F -->|"910B: --cce-aicore-arch=dav-c220"| H["NPU Binary · kernel.acl.o<br/>Ascend 910B machine code<br/>(413 tests verified)"]
7.1.1 The ascend_compile Compilation Hub
The ascend_compile crate (crates/ascend_compile/) is a standalone compilation library that decouples kernel compilation from the rustc_codegen_mlir backend. Any C++ kernel generator — whether from ascend-rs’s own MLIR-to-C++ pipeline, TileLang, Triton, PyPTO (CANN’s tile-level operator DSL), or future frontends — can use it to compile AscendC kernels:
graph TD
A1["ascend-rs<br/>Rust→MLIR→C++"] --> E["AscendC C++ kernel source"]
A2["TileLang<br/>Python DSL→AscendC (planned)"] -.-> E
A3["Triton<br/>GPU kernel compiler (planned)"] -.-> E
A4["PyTorch<br/>torch.compile (planned)"] -.-> E
A5["PyPTO<br/>CANN tile-level DSL (planned)"] -.-> E
E --> F["ascend_compile<br/><br/>Rust API · C ABI · CLI · Python<br/><br/>3 validation passes<br/>Dual flag paths · 310P + 910B<br/>Object or shared library output"]
F --> G["NPU Binary · .o / .so"]
This architecture enables the broader Ascend ecosystem to benefit from ascend-rs’s validated compilation pipeline without depending on Rust or rustc. The dashed edges indicate planned integrations not yet implemented.
7.1.2 Alternative Codegen Path: PTOAS (Programmable Tile Operation Assembly)
In addition to the default mlir_to_cpp path, ascend-rs supports an experimental PTO (Programmable Tile Operations) codegen path that targets the pto-isa virtual ISA — the same tile-level instruction set used internally by CANN’s FlashAttention implementation on Ascend 910B.
Activation. Set ACLRS_CODEGEN_PATH=pto to route kernel compilation through the PTO path instead of direct C++ generation:
export ACLRS_CODEGEN_PATH=pto # Enable PTO path (default: cpp)
export ACLRS_PTOAS_PATH=/path/to/ptoas # Optional: explicit ptoas binary location
Pipeline. The PTO path adds an intermediate representation layer between MLIR and the final C++ that bisheng compiles:
graph LR
A["Merged MLIR<br/>(LLVM dialect)"] -->|"mlir_to_pto"| B["PTO Assembly<br/>(pto dialect MLIR)"]
B -->|"ptoas<br/>--enable-insert-sync"| C["AscendC C++"]
C -->|"bisheng"| D[".acl.o"]
The key advantage of this intermediate step is that ptoas automatically inserts synchronization barriers (set_flag/wait_flag) between pipeline stages. In the direct C++ path, the codegen must explicitly emit pipe_barrier(PIPE_ALL) between DMA and compute operations — getting this wrong causes silent data corruption or NPU hangs. The PTO path delegates barrier insertion to the ptoas assembler, which has exact knowledge of the hardware pipeline topology.
Tile intrinsics API. The ascend_std::tile module provides safe Rust wrappers for PTO tile operations:
#![allow(unused)]
fn main() {
use ascend_std::tile::*;
pub unsafe fn tile_softmax(input: *const f32, output: *mut f32) {
// Load 32×32 tile from global memory
let x: Tile<32, 32, f32> = tile_load_f32(input);
// Numerically-stable softmax decomposition (5 PTO ops):
// 1. Row-wise max: pto.trowmax
// 2. Subtract max: pto.trowexpandsub
// 3. Exponential: pto.texp
// 4. Row-wise sum: pto.trowsum
// 5. Divide by sum: pto.trowexpanddiv
let y: Tile<32, 32, f32> = tile_softmax_f32(x);
// Store result to global memory
tile_store_f32(output, y);
}
}
The Tile<ROWS, COLS, T> type is a move-only handle (no Copy) that ensures single-ownership semantics — preventing double-DMA and enforcing compile-time safety. Const generic parameters carry shape information through the type system, catching dimension mismatches at compile time rather than at NPU runtime.
Matmul via cube unit. Tile matmul maps to the hardware’s cube engine through a multi-level memory hierarchy pipeline:
#![allow(unused)]
fn main() {
// (M×K) @ (K×N) → (M×N), routed through L1→L0A/L0B→Cube→L0C
let a: Tile<32, 32, f32> = tile_load_f32(a_ptr);
let b: Tile<32, 32, f32> = tile_load_f32(b_ptr);
let c: Tile<32, 32, f32> = tile_matmul_f32(a, b); // pto.tmatmul
tile_store_f32(c_ptr, c);
}
The mlir_to_pto translator generates the full cube-unit pipeline: GM→CBUF staging tiles (pto.tload), CBUF→L0A/L0B movement (pto.tmov), matrix multiply on L0C (pto.tmatmul), and writeback — all with correct buffer layout attributes (blayout, slayout, fractal) for each memory level.
PTO virtual ISA. The translator emits the following PTO-dialect operations:
| Category | Operations | Description |
|---|---|---|
| Memory | pto.tload, pto.tstore | GM↔local tile DMA transfers |
| Element-wise | pto.tadd, pto.tmul, pto.texp | Vectorized arithmetic and transcendentals |
| Reduction | pto.trowmax, pto.trowsum, pto.trowexpandsub, pto.trowexpanddiv | Row-wise reductions with broadcast |
| Cube | pto.tmatmul, pto.tmov | Matrix multiply and inter-level data movement |
| Memory mgmt | pto.alloc_tile, pto.make_tensor_view, pto.partition_view | Buffer allocation and GM partitioning |
Each PTO tile buffer carries explicit layout metadata specifying its memory level (vec, mat, left, right, acc), data layout (row_major/col_major), and fractal size — enabling ptoas to generate correct data movement instructions for the hardware’s fractal memory architecture.
7.2 Runtime Phase
graph TD
subgraph Host["Host CPU"]
H1["Acl::new()"] --> H2["Device::new"]
H2 --> H3["AclContext"]
H3 --> H4["AclStream"]
H4 --> H5["DeviceBuffer::from_slice()"]
H5 --> H6["kernel.launch()"]
H6 --> H7["stream.sync()"]
H7 --> H8["z_device.to_host()"]
H8 --> H9["Verify results"]
H9 --> H10["RAII Drop · auto-clean"]
end
subgraph Device["NPU Device"]
D1["AI Core 0<br/>block_idx=0<br/>Process x 0..8"]
D2["AI Core 1<br/>block_idx=1<br/>Process x 8..16"]
D3["Device Memory<br/>x: Input A · y: Input B<br/>z: Output = A * B"]
end
H4 -.->|"stream binds"| D3
H5 -.->|"Host → Device copy"| D3
H6 -.->|"Kernel execution"| D1
H6 -.->|"Kernel execution"| D2
H7 -.->|"Completion signal"| Device
H8 -.->|"Device → Host transfer"| D3
H10 -.->|"Resources freed"| Device
7.3 Memory Safety Guarantees
Throughout this process, ascend-rs provides the following compile-time safety guarantees:
| Safety Issue | C++ Approach | ascend-rs Approach |
|---|---|---|
| Device memory leak | Manual aclrtFree | Drop on DeviceBuffer<T> |
| Wrong deallocation order | Programmer convention | Lifetime system prevents at compile time |
| Use-after-free stream | No check | Compile error |
| Send unsafe type to device | No check | DeviceSend trait bound |
| Forgetting to synchronize | Silent data corruption | Type system extensible to enforce |