English | 中文版

Appendix G: Tile API vs. Buffer API — A Comparison with FlashTile/PTO

Summary finding: The tile-based kernel API (ascend_std::tile) shrinks a 50-line softmax to 5 lines and eliminates all explicit pipe barrier management. For V-pipe workloads like softmax, PTO provides no runtime performance advantage over the buffer API — both target the same vector engine. The true performance case for PTO is cube-unit (M-pipe) kernels: pto.tmatmul drives L0A/L0B/L0C memory and the matrix multiplier, which is architecturally inaccessible through the buffer/vector API. A GEMM benchmark comparing mlir_to_pto against the buffer path is the correct experiment to demonstrate this advantage.

G.1 The Three Codegen Paths in ascend-rs

ascend-rs supports three distinct codegen paths for Rust NPU kernels. Each path targets a different level of the Ascend software stack, and all three share a common Rust frontend and MLIR intermediate stage:

┌─────────────────────────────────────────────────────────────────────────────┐
│                         ascend-rs Toolchain                                 │
│                                                                             │
│  Rust kernel source (.rs)                                                   │
│         │                                                                   │
│         ▼                                                                   │
│  rustc + rustc_codegen_mlir  ←── custom codegen backend (melior/MLIR)      │
│         │                                                                   │
│         ▼                                                                   │
│  LLVM-dialect MLIR (.mlir)                                                  │
│         │                                                                   │
│   ┌─────┴──────────────────────────────┐                                   │
│   │                                    │                                   │
│   │  ACLRS_CODEGEN_PATH=cpp (default)  │  ACLRS_CODEGEN_PATH=pto          │
│   │                                    │                                   │
│   ▼                                    ▼                                   │
│  mlir_to_cpp.rs                       mlir_to_pto.rs                       │
│  (5,956 lines)                        (714 lines)                          │
│   │                                    │                                   │
│   ▼                                    ▼                                   │
│  AscendC C++ (.cpp)                   PTO-MLIR (.pto)                      │
│   │           │                              │                             │
│   │           │                        ┌─────┘                            │
│   │           │                        ▼                                   │
│   │           │               ptoas (PTO assembler)                        │
│   │           │               [Huawei internal tool]                       │
│   │           │                        │                                   │
│   │           └────────────────────────┤                                   │
│   │                                    ▼                                   │
│   │                          AscendC C++ (.cpp)                            │
│   │                                    │                                   │
│   └────────────────────────────────────┘                                   │
│                                    │                                        │
│                                    ▼                                        │
│               bisheng / ccec  (Huawei CCE compiler)                        │
│                                    │                                        │
│                                    ▼                                        │
│                    NPU binary (.o / .bin)                                   │
│                                    │                                        │
│                                    ▼                                        │
│                  KernelLoader + AclStream (ascend_rs host API)             │
│                                    │                                        │
│                                    ▼                                        │
│                     Ascend NPU hardware execution                           │
└─────────────────────────────────────────────────────────────────────────────┘

PyPTO (FlashTile) is a parallel Python-based codegen path that targets the same PTO assembler from a different frontend:

┌─────────────────────────────────────────────────────────────────────────────┐
│                  PyPTO / FlashTile Toolchain (for comparison)               │
│                                                                             │
│  Python DSL (FlashTile decorators)                                          │
│         │                                                                   │
│         ▼                                                                   │
│  to_pto_converter (Python bindings → MLIR Python API)                      │
│         │                                                                   │
│         ▼                                                                   │
│  PTO-MLIR (.pto)                                                            │
│         │                                                                   │
│         ▼                                                                   │
│  ptoas  →  AscendC C++  →  bisheng  →  NPU binary                         │
└─────────────────────────────────────────────────────────────────────────────┘

The ascend-rs PTO path and the PyPTO path share the same ptoas assembler and the same PTO-MLIR dialect. This means the two ecosystems are interoperable at the .pto boundary: a tile kernel described in either Rust or Python produces structurally identical intermediate representation.

The three ascend-rs paths differ in their target use cases:

Path	Env var	Translator	Output	Status
Buffer API	`ACLRS_CODEGEN_PATH=cpp` (default)	`mlir_to_cpp`	AscendC C++ with `TBuf`, `DataCopy`, `pipe_barrier`	Production — verified on 310P and 910B2
Tile→CPP	`ACLRS_CODEGEN_PATH=cpp` + tile intrinsics	`mlir_to_cpp` tile handlers	AscendC C++ with scalar GetValue/SetValue loops	Working — all 6 multi-row shapes pass correctness; ~10 Melem/s (scalar bottleneck)
Tile→PTO	`ACLRS_CODEGEN_PATH=pto`	`mlir_to_pto`	PTO-MLIR dialect for `ptoas`	Experimental — full softmax (`trowmax`→`trowexpandsub`→`texp`→`trowsum`→`trowexpanddiv`) verified through `ptoas`; blocked at `bisheng` step (CANN 8.5.0 / `pto-inst.hpp` incompatibility)

The tile API path implements Phase 3 of the PTO/FlashTile integration plan, where PTO (Programmable Tile Operations) is a virtual ISA for Ascend NPUs, with ptoas being its assembler. FlashTile refers to the tile-level programming model exposed through the PTO ISA — tile loads, stores, and fused operations like tile.softmax — as distinct from the lower-level buffer/DMA model of AscendC.

G.2 The Usability Gap: Softmax as a Case Study

The same row-wise softmax computation requires very different amounts of code in each API:

Buffer API (mha/kernels/src/lib.rs, ~50 lines of kernel code):

#![allow(unused)]
fn main() {
#[ascend_std::aiv_kernel]
pub unsafe fn softmax_rows_f16(
    input: *const u16, output: *mut u16,
    row_len: *const u32, num_rows: *const u32,
) {
    let cols = *row_len;
    let rows = *num_rows;
    let buf_in   = ascend_std::ascend_buf_alloc(cols);
    let buf_out  = ascend_std::ascend_buf_alloc(cols);
    let buf_work = ascend_std::ascend_buf_alloc(cols);
    let buf_rwork = ascend_std::ascend_buf_alloc(cols);

    let mut row = 0u32;
    loop {
        if row >= rows { break; }
        let in_ptr  = input.wrapping_add((row * cols) as usize);
        let out_ptr = output.wrapping_add((row * cols) as usize);

        ascend_std::ascend_buf_load_f16(buf_in, in_ptr, cols);
        ascend_std::ascend_pipe_barrier();                          // barrier 1

        let max_val = ascend_std::ascend_reduce_max_f16(buf_rwork, buf_in, buf_work, cols);
        ascend_std::ascend_pipe_barrier();                          // barrier 2

        ascend_std::ascend_adds_f16(buf_out, buf_in, -max_val, cols);
        ascend_std::ascend_pipe_barrier();                          // barrier 3

        ascend_std::ascend_exp_f16(buf_out, buf_out, cols);
        ascend_std::ascend_pipe_barrier();                          // barrier 4

        let sum_val = ascend_std::ascend_reduce_sum_f16(buf_rwork, buf_out, buf_work, cols);
        ascend_std::ascend_pipe_barrier();                          // barrier 5

        ascend_std::ascend_muls_f16(buf_out, buf_out, 1.0f32 / sum_val, cols);
        ascend_std::ascend_pipe_barrier();                          // barrier 6

        ascend_std::ascend_buf_store_f16(out_ptr, buf_out, cols);
        ascend_std::ascend_pipe_barrier();                          // barrier 7

        row = row + 1;
    }
}
}

Tile API (tile_softmax/kernels/src/lib.rs, 5 lines of kernel logic):

#![allow(unused)]
fn main() {
#[ascend_std::aiv_kernel]
pub unsafe fn tile_softmax<const ROWS: usize, const COLS: usize>(
    input: *const f32,
    output: *mut f32,
) {
    let block_idx = ascend_std::get_block_idx() as usize;
    let offset = block_idx * ROWS * COLS;
    let t_in  = tile_load_f32::<ROWS, COLS>(input.add(offset));
    let t_out = tile_softmax_f32::<ROWS, COLS>(t_in);
    tile_store_f32::<ROWS, COLS>(output.add(offset), t_out);
}
}

The difference is stark: 7 explicit pipe_barrier() calls, 4 named buffer allocations, and a manual row loop in the buffer API vs. zero barriers, zero explicit buffers, and no loop in the tile API. The tile API codegen path — mlir_to_pto — automatically suppresses pipe_barrier calls because PTO manages pipeline synchronization implicitly.

G.3 The PTO Format: MLIR Dialect, Not Text Assembly

A critical finding emerged when ptoas was located on the 910c server. The actual .pto format consumed by ptoas is MLIR with a pto dialect — not a line-oriented text assembly.

The correct PTO format uses MLIR structured ops:

module {
  func.func @softmax_kernel(%arg0: !pto.ptr<f32>, %arg1: !pto.ptr<f32>) {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %c32 = arith.constant 32 : index
    %in_tv  = pto.make_tensor_view %arg0, shape = [%c32, %c32] strides = [%c32, %c1] : !pto.tensor_view<2xf32>
    %out_tv = pto.make_tensor_view %arg1, shape = [%c32, %c32] strides = [%c32, %c1] : !pto.tensor_view<2xf32>
    %in_pt  = pto.partition_view %in_tv,  offsets = [%c0, %c0], sizes = [%c32, %c32] : !pto.tensor_view<2xf32> -> !pto.partition_tensor_view<32x32xf32>
    %out_pt = pto.partition_view %out_tv, offsets = [%c0, %c0], sizes = [%c32, %c32] : !pto.tensor_view<2xf32> -> !pto.partition_tensor_view<32x32xf32>
    %buf_in  = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=32, cols=32, ...>
    %buf_out = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=32, cols=32, ...>
    pto.tload  ins(%in_pt  : ...) outs(%buf_in  : ...)
    pto.tsoftmax ins(%buf_in : ...) outs(%buf_out : ...)
    pto.tstore ins(%buf_out : ...) outs(%out_pt : ...)
    return
  }
}

When this is fed to ptoas, the tool lowers it through several MLIR passes (PTO Infer Mem Scope → PTO plan Mem → PTOToEmitC) and emits AscendC C++:

#include "common/pto_instr.hpp"
using namespace pto;
__global__ AICORE void softmax_kernel(__gm__ float* v1, __gm__ float* v2) {
  using T = float;
  // ... pto-generated AscendC vector ops ...
}

This means ptoas is a source-to-source compiler from PTO-MLIR to AscendC C++, not an assembler to machine code. The pipeline is:

PTO-MLIR (.pto file)  →  ptoas  →  AscendC C++ (.cpp)  →  bisheng  →  NPU binary (.o)

G.4 Comparison with FlashTile

FlashTile (PyPTO) is CANN’s tile-level operator programming framework. It exposes approximately 90 tile operations through a Python DSL that compiles via ptoas to AscendC C++. The ascend-rs tile API (ascend_std::tile) targets the same PTO ISA from the Rust side.

Dimension	FlashTile/PyPTO	ascend-rs buffer API	ascend-rs tile→CPP	ascend-rs tile→PTO
Frontend language	Python DSL	Rust (`no_std`)	Rust (`no_std`)	Rust (`no_std`)
Tile shape encoding	Runtime Python objects	Runtime count args	Compile-time `const` generics	Compile-time `const` generics
Shape mismatch detection	Runtime error	Runtime (wrong result)	Compile error	Compile error
Barrier management	Implicit (PTO)	Explicit (7 per softmax)	Implicit (generated)	Implicit (PTO)
Memory safety	Python GC; no device-side safety	Rust ownership	Rust ownership	Rust ownership
Codegen path	Python → PTO-MLIR → `ptoas` → C++	Rust → MLIR → `mlir_to_cpp` → C++	Rust → MLIR → `mlir_to_cpp` (tile handlers) → C++	Rust → MLIR → `mlir_to_pto` → PTO-MLIR → `ptoas` → C++
MLIR optimization stage	None	No (pass-through)	No (pass-through)	Yes — MLIR passes before `ptoas`
`ptoas` required	Yes	No	No	Yes — same dependency
V-pipe (softmax, eltwise)	~same as buffer API	Best (440–788 Melem/s)	~10 Melem/s (scalar workaround)	~same as buffer API
M-pipe (GEMM, matmul)	Full cube-unit via `pto.tmatmul`	Not accessible (V-pipe only)	Not accessible (V-pipe only)	Full cube-unit via `pto.tmatmul`
Current hardware status	CANN internal distribution	Production (310P + 910B2)	Working — all 6 softmax shapes pass	Experimental — `mlir_to_pto.rs` done; build integration pending

The key structural advantage of the Rust approach over PyPTO is the compile-time shape system: Tile<16, 1024, f32> is a distinct type from Tile<1, 1024, f32>, and passing the wrong tile to tile_softmax_f32 is a type error caught by rustc before any code runs. In Python, tile shape mismatches are runtime errors.

The key advantage of PyPTO is maturity: it ships with CANN and is tested against real hardware. ascend-rs’s tile path depends on ptoas, which is not yet publicly available.

G.5 Quantitative Summary

V-pipe workloads (softmax) — ergonomics

Metric	Buffer API	Tile→CPP	Tile→PTO
Kernel source lines	~50	5	5
Explicit `pipe_barrier` calls	7/row	0	0
Named buffer allocations	4	0	0
Multi-row correctness	1D only	✓ 6 shapes	expected
Shape safety	runtime	compile-time	compile-time

V-pipe workloads (softmax) — performance on Ascend 910B2

Size	Buffer API	Tile→CPP (scalar)	Tile→PTO (expected)
1×1,024	0.0085 ms	0.109 ms	~0.009 ms
1×4,096	0.0093 ms	0.419 ms	~0.010 ms
1×8,192	0.0104 ms	0.831 ms	~0.011 ms
Throughput	440–788 Melem/s	~9–10 Melem/s	~440–788 Melem/s
Hardware	✓ 910B2	✓ 910B2, 6 shapes	bisheng compat pending

M-pipe workloads (matrix multiply/GEMM)

Metric	Buffer API	Tile→CPP	Tile→PTO
Cube unit accessible	No	No	Yes
`mlir_to_pto` handler	—	—	✓ `loc=mat/left/right/acc`
Measured perf	—	~0.17–0.27 GFlop/s	—
Peak theoretical	V-pipe only	V-pipe only	~32 TFlop/s
Hardware-verified	No	✓ scalar, 5 shapes	bisheng compat pending

The M-pipe row is where PTO’s performance rationale is strongest: the 910B2 cube unit is architecturally separate from the V-pipe and orders of magnitude faster for matrix operations — and it is only reachable through PTO.

G.6 Current Status and Next Steps

What is done: mlir_to_pto.rs has been rewritten (950+ lines) to emit correct PTO-MLIR dialect ops (pto.make_tensor_view, pto.partition_view, pto.alloc_tile, pto.tload, pto.tstore, pto.tadd, pto.texp, pto.trowmax, pto.trowsum, pto.trowexpandsub, pto.trowexpanddiv). The ptoas binary accepts the generated .pto files and emits AscendC C++. 10/10 unit tests pass. The full softmax decomposition (trowmax → trowexpandsub → texp → trowsum → trowexpanddiv) is E2E verified through ptoas — all five reduction ops are correctly compiled to TROWMAX/TROWEXPANDSUB/TEXP/TROWSUM/TROWEXPANDDIV AscendC C++. ptoas is now wired into KernelBuilder: ACLRS_CODEGEN_PATH=pto now runs the full MLIR → PTO-MLIR → ptoas → .pto.cpp → bisheng pipeline. SSA alias tracking (getelementptr, alloca store/load, bitcast) was added to mlir_to_pto.rs to resolve block-offset GM pointers generated by the real Rust→MLIR codegen.

translate_matmul() now emits correct cube-unit tile types — fixed. Previously the function used loc=vec tiles for all operands, which violates TMatmul.hpp static assertions in pto-isa. The corrected mlir_to_pto.rs now emits the full pipeline: pto.alloc_tile with loc=mat (CBUF staging), loc=left (L0A), loc=right (L0B), loc=acc (L0C, fractal=1024), followed by pto.tload GM→mat, pto.tmov mat→left/right (MTE1 pipeline), and pto.tmatmul left×right→acc. The output through ptoas is verified to emit the correct __ca__/__cb__/__cc__ AscendC buffer qualifiers.

What remains:

1. Compiler version gap. ptoas-generated C++ includes pto/pto-inst.hpp, which is incompatible with the Clang 15-based bisheng in CANN 8.5.0. The errors span multiple symbols: MrgSortExecutedNumList (missing from pto_instr.hpp forward declarations), copy_gm_to_ubuf_align_b32 (builtin not supported for the dav-c220 target feature set in Clang 15), and bfloat16_t (not defined in Clang 15 aicore mode). This is an upstream compatibility issue: pto-inst.hpp is designed for a newer bisheng. Resolution: upgrade to CANN 9.x, or request a Clang 15 compatibility shim from the pto-isa maintainers.

2. Hardware benchmark comparison. Once the compiler version gap is resolved, the efficiency question — whether ptoas-generated AscendC avoids the LocalTensor::operator[] sub-view issue that forces the scalar fallback in the mlir_to_cpp tile path — can be answered empirically on 910B2. Based on the data, PTO-generated code should achieve ~440–800 Melem/s instead of ~10 Melem/s, recovering the 40–80× gap currently left by the scalar fallback.

Keyboard shortcuts