English | 中文版
Appendix G: Tile API vs. Buffer API — A Comparison with FlashTile/PTO
Summary finding: The tile-based kernel API (
ascend_std::tile) shrinks a 50-line softmax to 5 lines and eliminates all explicit pipe barrier management. For V-pipe workloads like softmax, PTO provides no runtime performance advantage over the buffer API — both target the same vector engine. The true performance case for PTO is cube-unit (M-pipe) kernels:pto.tmatmuldrives L0A/L0B/L0C memory and the matrix multiplier, which is architecturally inaccessible through the buffer/vector API. A GEMM benchmark comparingmlir_to_ptoagainst the buffer path is the correct experiment to demonstrate this advantage.
G.1 The Three Codegen Paths in ascend-rs
ascend-rs supports three distinct codegen paths for Rust NPU kernels. Each path targets a different level of the Ascend software stack, and all three share a common Rust frontend and MLIR intermediate stage:
┌─────────────────────────────────────────────────────────────────────────────┐
│ ascend-rs Toolchain │
│ │
│ Rust kernel source (.rs) │
│ │ │
│ ▼ │
│ rustc + rustc_codegen_mlir ←── custom codegen backend (melior/MLIR) │
│ │ │
│ ▼ │
│ LLVM-dialect MLIR (.mlir) │
│ │ │
│ ┌─────┴──────────────────────────────┐ │
│ │ │ │
│ │ ACLRS_CODEGEN_PATH=cpp (default) │ ACLRS_CODEGEN_PATH=pto │
│ │ │ │
│ ▼ ▼ │
│ mlir_to_cpp.rs mlir_to_pto.rs │
│ (5,956 lines) (714 lines) │
│ │ │ │
│ ▼ ▼ │
│ AscendC C++ (.cpp) PTO-MLIR (.pto) │
│ │ │ │ │
│ │ │ ┌─────┘ │
│ │ │ ▼ │
│ │ │ ptoas (PTO assembler) │
│ │ │ [Huawei internal tool] │
│ │ │ │ │
│ │ └────────────────────────┤ │
│ │ ▼ │
│ │ AscendC C++ (.cpp) │
│ │ │ │
│ └────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ bisheng / ccec (Huawei CCE compiler) │
│ │ │
│ ▼ │
│ NPU binary (.o / .bin) │
│ │ │
│ ▼ │
│ KernelLoader + AclStream (ascend_rs host API) │
│ │ │
│ ▼ │
│ Ascend NPU hardware execution │
└─────────────────────────────────────────────────────────────────────────────┘
PyPTO (FlashTile) is a parallel Python-based codegen path that targets the same PTO assembler from a different frontend:
┌─────────────────────────────────────────────────────────────────────────────┐
│ PyPTO / FlashTile Toolchain (for comparison) │
│ │
│ Python DSL (FlashTile decorators) │
│ │ │
│ ▼ │
│ to_pto_converter (Python bindings → MLIR Python API) │
│ │ │
│ ▼ │
│ PTO-MLIR (.pto) │
│ │ │
│ ▼ │
│ ptoas → AscendC C++ → bisheng → NPU binary │
└─────────────────────────────────────────────────────────────────────────────┘
The ascend-rs PTO path and the PyPTO path share the same ptoas assembler and the same PTO-MLIR dialect. This means the two ecosystems are interoperable at the .pto boundary: a tile kernel described in either Rust or Python produces structurally identical intermediate representation.
The three ascend-rs paths differ in their target use cases:
| Path | Env var | Translator | Output | Status |
|---|---|---|---|---|
| Buffer API | ACLRS_CODEGEN_PATH=cpp (default) | mlir_to_cpp | AscendC C++ with TBuf, DataCopy, pipe_barrier | Production — verified on 310P and 910B2 |
| Tile→CPP | ACLRS_CODEGEN_PATH=cpp + tile intrinsics | mlir_to_cpp tile handlers | AscendC C++ with scalar GetValue/SetValue loops | Working — all 6 multi-row shapes pass correctness; ~10 Melem/s (scalar bottleneck) |
| Tile→PTO | ACLRS_CODEGEN_PATH=pto | mlir_to_pto | PTO-MLIR dialect for ptoas | Experimental — full softmax (trowmax→trowexpandsub→texp→trowsum→trowexpanddiv) verified through ptoas; blocked at bisheng step (CANN 8.5.0 / pto-inst.hpp incompatibility) |
The tile API path implements Phase 3 of the PTO/FlashTile integration plan, where PTO (Programmable Tile Operations) is a virtual ISA for Ascend NPUs, with ptoas being its assembler. FlashTile refers to the tile-level programming model exposed through the PTO ISA — tile loads, stores, and fused operations like tile.softmax — as distinct from the lower-level buffer/DMA model of AscendC.
G.2 The Usability Gap: Softmax as a Case Study
The same row-wise softmax computation requires very different amounts of code in each API:
Buffer API (mha/kernels/src/lib.rs, ~50 lines of kernel code):
#![allow(unused)]
fn main() {
#[ascend_std::aiv_kernel]
pub unsafe fn softmax_rows_f16(
input: *const u16, output: *mut u16,
row_len: *const u32, num_rows: *const u32,
) {
let cols = *row_len;
let rows = *num_rows;
let buf_in = ascend_std::ascend_buf_alloc(cols);
let buf_out = ascend_std::ascend_buf_alloc(cols);
let buf_work = ascend_std::ascend_buf_alloc(cols);
let buf_rwork = ascend_std::ascend_buf_alloc(cols);
let mut row = 0u32;
loop {
if row >= rows { break; }
let in_ptr = input.wrapping_add((row * cols) as usize);
let out_ptr = output.wrapping_add((row * cols) as usize);
ascend_std::ascend_buf_load_f16(buf_in, in_ptr, cols);
ascend_std::ascend_pipe_barrier(); // barrier 1
let max_val = ascend_std::ascend_reduce_max_f16(buf_rwork, buf_in, buf_work, cols);
ascend_std::ascend_pipe_barrier(); // barrier 2
ascend_std::ascend_adds_f16(buf_out, buf_in, -max_val, cols);
ascend_std::ascend_pipe_barrier(); // barrier 3
ascend_std::ascend_exp_f16(buf_out, buf_out, cols);
ascend_std::ascend_pipe_barrier(); // barrier 4
let sum_val = ascend_std::ascend_reduce_sum_f16(buf_rwork, buf_out, buf_work, cols);
ascend_std::ascend_pipe_barrier(); // barrier 5
ascend_std::ascend_muls_f16(buf_out, buf_out, 1.0f32 / sum_val, cols);
ascend_std::ascend_pipe_barrier(); // barrier 6
ascend_std::ascend_buf_store_f16(out_ptr, buf_out, cols);
ascend_std::ascend_pipe_barrier(); // barrier 7
row = row + 1;
}
}
}
Tile API (tile_softmax/kernels/src/lib.rs, 5 lines of kernel logic):
#![allow(unused)]
fn main() {
#[ascend_std::aiv_kernel]
pub unsafe fn tile_softmax<const ROWS: usize, const COLS: usize>(
input: *const f32,
output: *mut f32,
) {
let block_idx = ascend_std::get_block_idx() as usize;
let offset = block_idx * ROWS * COLS;
let t_in = tile_load_f32::<ROWS, COLS>(input.add(offset));
let t_out = tile_softmax_f32::<ROWS, COLS>(t_in);
tile_store_f32::<ROWS, COLS>(output.add(offset), t_out);
}
}
The difference is stark: 7 explicit pipe_barrier() calls, 4 named buffer allocations, and a manual row loop in the buffer API vs. zero barriers, zero explicit buffers, and no loop in the tile API. The tile API codegen path — mlir_to_pto — automatically suppresses pipe_barrier calls because PTO manages pipeline synchronization implicitly.
G.3 The PTO Format: MLIR Dialect, Not Text Assembly
A critical finding emerged when ptoas was located on the 910c server. The actual .pto format consumed by ptoas is MLIR with a pto dialect — not a line-oriented text assembly.
The correct PTO format uses MLIR structured ops:
module {
func.func @softmax_kernel(%arg0: !pto.ptr<f32>, %arg1: !pto.ptr<f32>) {
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%c32 = arith.constant 32 : index
%in_tv = pto.make_tensor_view %arg0, shape = [%c32, %c32] strides = [%c32, %c1] : !pto.tensor_view<2xf32>
%out_tv = pto.make_tensor_view %arg1, shape = [%c32, %c32] strides = [%c32, %c1] : !pto.tensor_view<2xf32>
%in_pt = pto.partition_view %in_tv, offsets = [%c0, %c0], sizes = [%c32, %c32] : !pto.tensor_view<2xf32> -> !pto.partition_tensor_view<32x32xf32>
%out_pt = pto.partition_view %out_tv, offsets = [%c0, %c0], sizes = [%c32, %c32] : !pto.tensor_view<2xf32> -> !pto.partition_tensor_view<32x32xf32>
%buf_in = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=32, cols=32, ...>
%buf_out = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=32, cols=32, ...>
pto.tload ins(%in_pt : ...) outs(%buf_in : ...)
pto.tsoftmax ins(%buf_in : ...) outs(%buf_out : ...)
pto.tstore ins(%buf_out : ...) outs(%out_pt : ...)
return
}
}
When this is fed to ptoas, the tool lowers it through several MLIR passes (PTO Infer Mem Scope → PTO plan Mem → PTOToEmitC) and emits AscendC C++:
#include "common/pto_instr.hpp"
using namespace pto;
__global__ AICORE void softmax_kernel(__gm__ float* v1, __gm__ float* v2) {
using T = float;
// ... pto-generated AscendC vector ops ...
}
This means ptoas is a source-to-source compiler from PTO-MLIR to AscendC C++, not an assembler to machine code. The pipeline is:
PTO-MLIR (.pto file) → ptoas → AscendC C++ (.cpp) → bisheng → NPU binary (.o)
G.4 Comparison with FlashTile
FlashTile (PyPTO) is CANN’s tile-level operator programming framework. It exposes approximately 90 tile operations through a Python DSL that compiles via ptoas to AscendC C++. The ascend-rs tile API (ascend_std::tile) targets the same PTO ISA from the Rust side.
| Dimension | FlashTile/PyPTO | ascend-rs buffer API | ascend-rs tile→CPP | ascend-rs tile→PTO |
|---|---|---|---|---|
| Frontend language | Python DSL | Rust (no_std) | Rust (no_std) | Rust (no_std) |
| Tile shape encoding | Runtime Python objects | Runtime count args | Compile-time const generics | Compile-time const generics |
| Shape mismatch detection | Runtime error | Runtime (wrong result) | Compile error | Compile error |
| Barrier management | Implicit (PTO) | Explicit (7 per softmax) | Implicit (generated) | Implicit (PTO) |
| Memory safety | Python GC; no device-side safety | Rust ownership | Rust ownership | Rust ownership |
| Codegen path | Python → PTO-MLIR → ptoas → C++ | Rust → MLIR → mlir_to_cpp → C++ | Rust → MLIR → mlir_to_cpp (tile handlers) → C++ | Rust → MLIR → mlir_to_pto → PTO-MLIR → ptoas → C++ |
| MLIR optimization stage | None | No (pass-through) | No (pass-through) | Yes — MLIR passes before ptoas |
ptoas required | Yes | No | No | Yes — same dependency |
| V-pipe (softmax, eltwise) | ~same as buffer API | Best (440–788 Melem/s) | ~10 Melem/s (scalar workaround) | ~same as buffer API |
| M-pipe (GEMM, matmul) | Full cube-unit via pto.tmatmul | Not accessible (V-pipe only) | Not accessible (V-pipe only) | Full cube-unit via pto.tmatmul |
| Current hardware status | CANN internal distribution | Production (310P + 910B2) | Working — all 6 softmax shapes pass | Experimental — mlir_to_pto.rs done; build integration pending |
The key structural advantage of the Rust approach over PyPTO is the compile-time shape system: Tile<16, 1024, f32> is a distinct type from Tile<1, 1024, f32>, and passing the wrong tile to tile_softmax_f32 is a type error caught by rustc before any code runs. In Python, tile shape mismatches are runtime errors.
The key advantage of PyPTO is maturity: it ships with CANN and is tested against real hardware. ascend-rs’s tile path depends on ptoas, which is not yet publicly available.
G.5 Quantitative Summary
V-pipe workloads (softmax) — ergonomics
| Metric | Buffer API | Tile→CPP | Tile→PTO |
|---|---|---|---|
| Kernel source lines | ~50 | 5 | 5 |
Explicit pipe_barrier calls | 7/row | 0 | 0 |
| Named buffer allocations | 4 | 0 | 0 |
| Multi-row correctness | 1D only | ✓ 6 shapes | expected |
| Shape safety | runtime | compile-time | compile-time |
V-pipe workloads (softmax) — performance on Ascend 910B2
| Size | Buffer API | Tile→CPP (scalar) | Tile→PTO (expected) |
|---|---|---|---|
| 1×1,024 | 0.0085 ms | 0.109 ms | ~0.009 ms |
| 1×4,096 | 0.0093 ms | 0.419 ms | ~0.010 ms |
| 1×8,192 | 0.0104 ms | 0.831 ms | ~0.011 ms |
| Throughput | 440–788 Melem/s | ~9–10 Melem/s | ~440–788 Melem/s |
| Hardware | ✓ 910B2 | ✓ 910B2, 6 shapes | bisheng compat pending |
M-pipe workloads (matrix multiply/GEMM)
| Metric | Buffer API | Tile→CPP | Tile→PTO |
|---|---|---|---|
| Cube unit accessible | No | No | Yes |
mlir_to_pto handler | — | — | ✓ loc=mat/left/right/acc |
| Measured perf | — | ~0.17–0.27 GFlop/s | — |
| Peak theoretical | V-pipe only | V-pipe only | ~32 TFlop/s |
| Hardware-verified | No | ✓ scalar, 5 shapes | bisheng compat pending |
The M-pipe row is where PTO’s performance rationale is strongest: the 910B2 cube unit is architecturally separate from the V-pipe and orders of magnitude faster for matrix operations — and it is only reachable through PTO.
G.6 Current Status and Next Steps
What is done: mlir_to_pto.rs has been rewritten (950+ lines) to emit correct PTO-MLIR dialect ops (pto.make_tensor_view, pto.partition_view, pto.alloc_tile, pto.tload, pto.tstore, pto.tadd, pto.texp, pto.trowmax, pto.trowsum, pto.trowexpandsub, pto.trowexpanddiv). The ptoas binary accepts the generated .pto files and emits AscendC C++. 10/10 unit tests pass. The full softmax decomposition (trowmax → trowexpandsub → texp → trowsum → trowexpanddiv) is E2E verified through ptoas — all five reduction ops are correctly compiled to TROWMAX/TROWEXPANDSUB/TEXP/TROWSUM/TROWEXPANDDIV AscendC C++. ptoas is now wired into KernelBuilder: ACLRS_CODEGEN_PATH=pto now runs the full MLIR → PTO-MLIR → ptoas → .pto.cpp → bisheng pipeline. SSA alias tracking (getelementptr, alloca store/load, bitcast) was added to mlir_to_pto.rs to resolve block-offset GM pointers generated by the real Rust→MLIR codegen.
translate_matmul() now emits correct cube-unit tile types — fixed. Previously the function used loc=vec tiles for all operands, which violates TMatmul.hpp static assertions in pto-isa. The corrected mlir_to_pto.rs now emits the full pipeline: pto.alloc_tile with loc=mat (CBUF staging), loc=left (L0A), loc=right (L0B), loc=acc (L0C, fractal=1024), followed by pto.tload GM→mat, pto.tmov mat→left/right (MTE1 pipeline), and pto.tmatmul left×right→acc. The output through ptoas is verified to emit the correct __ca__/__cb__/__cc__ AscendC buffer qualifiers.
What remains:
1. Compiler version gap.
ptoas-generated C++ includes pto/pto-inst.hpp, which is incompatible with the Clang 15-based bisheng in CANN 8.5.0. The errors span multiple symbols: MrgSortExecutedNumList (missing from pto_instr.hpp forward declarations), copy_gm_to_ubuf_align_b32 (builtin not supported for the dav-c220 target feature set in Clang 15), and bfloat16_t (not defined in Clang 15 aicore mode). This is an upstream compatibility issue: pto-inst.hpp is designed for a newer bisheng. Resolution: upgrade to CANN 9.x, or request a Clang 15 compatibility shim from the pto-isa maintainers.
2. Hardware benchmark comparison.
Once the compiler version gap is resolved, the efficiency question — whether ptoas-generated AscendC avoids the LocalTensor::operator[] sub-view issue that forces the scalar fallback in the mlir_to_cpp tile path — can be answered empirically on 910B2. Based on the data, PTO-generated code should achieve ~440–800 Melem/s instead of ~10 Melem/s, recovering the 40–80× gap currently left by the scalar fallback.