Catching ptoas Blind Spots with a Rust Safety Oracle - ascend-rs: Memory-Safe NPU Kernel Programming in Rust

English | 中文版

Summary: The PTO-MLIR compiler ptoas is the Ascend NPU’s cube-path lowering tool. It verifies the input MLIR against its own dialect rules, but it does not re-verify the output of its own PlanMemoryPass — the pass that assigns every tile a byte range in UB, L1, L0A/L0B/L0C, and FB. Once placement is done, bad placements survive all the way to codegen. This chapter builds a small Rust crate, pto_to_rust, that rebuilds ptoas’s stage-2 plan as a typed Rust value, runs six safety checks against it, and reports violations back with the original .acl.pto file as the locus. It is demonstrated end-to-end on two real hand-written smoke kernels that ptoas 0.26 accepts with rc=0 but whose kernels would silently corrupt data on-device.

Versions used throughout this chapter: ptoas 0.26 (CANN 8.5.0, installed at /usr/local/bin/ptoas-bin/ptoas on the Ascend 910B2 test host), pto_to_rust 0.1.0 (tag pto_checks, commit f41b29b1), rustc 1.91.0-nightly (f34ba774c 2025-08-03). All numeric results reproduce exactly on these versions; newer ptoas builds may shift placement decisions and therefore the specific byte offsets reported. The flag names quoted below are ptoas 0.26’s; pto-diff --from-pto auto-selects the right flags for the installed ptoas (§10.3.5).

10.1 Why ptoas Needs an External Oracle

ptoas is a stage-lowering compiler: PTO-MLIR (tile dialect) in, AscendC C++ out, bisheng-ready. Internally it runs a pipeline whose most load-bearing pass is PlanMemoryPass — the point at which every abstract pto.alloc_tile becomes a concrete (address_space, offset, rows, cols, dtype, blayout, slayout) record. After that pass, the IR is still MLIR and ptoas --print-after-all will dump it, but ptoas itself does not re-verify several invariants that are trivial to verify after you have the post-pass plan in hand.

Six concrete invariants it silently skips:

#	Invariant	Failure mode if violated
1	Two live tiles with different shapes must not occupy overlapping bytes in the same address space	Silent clobber at runtime; kernel returns wrong data
2	Per-space high-water byte usage must not exceed the device’s capacity (`DeviceSpec`)	SRAM overrun; kernel faults or corrupts neighbouring tile
3	`pto.tmatmul` operands live in the correct L0 subspace (lhs∈Left, rhs∈Right, acc∈Acc) with a dtype triple in the cube unit’s accepted set	Descriptor garbage; numerics wrong on some CANN revs
4	ptoas’s descriptor caps: OUTER < 2²⁴, ROW < 2¹⁶	Truncated descriptor; wrong N dimension
5	Every tile allocated should be used	Wasted UB budget — not a bug, but a correctness smell ptoas never mentions
6	Linear-use of tiles: a write should be followed by at least one read before the next write (advisory, loops flattened)	Dead store; earlier value lost

The remainder of this chapter builds the smallest possible tool that enforces all six and proves it by catching real violations.

10.2 Design: Three Steps, Three Artifacts

The oracle is built around a deliberately simple pipeline. Each step produces one artifact that the next step consumes; each artifact is plain text so a human can read it mid-pipeline.

  [step 1]                 [step 2]                       [step 3]
┌──────────────┐   .pto   ┌──────────────┐   plan.rs   ┌───────────────┐   report   ┌────────────────┐
│  ptoas       │ ───────▶ │ pto_to_rust::│ ──────────▶ │ pto_to_rust:: │ ─────────▶ │ pto-diff CLI   │
│ --print-...  │          │ parse_stage2 │             │   check_all   │            │ (human output) │
└──────────────┘          └──────────────┘             └───────────────┘            └────────────────┘
    post-                   typed Rust                  SafetyReport                  error/warn lines
 PlanMemoryPass            `Plan { funcs }`             { violations }               file:line:kind:msg
    MLIR dump                                                                          ready for diff

Dump the stage-2 PTO-MLIR. Run ptoas --print-after-all <file.acl.pto> and keep the last module (the one that follows IR Dump After PlanMemoryPass). This IR has concrete (offset, size) annotations for every tile, which is exactly what the oracle needs.
Parse it into typed Rust. pto_to_rust::parse_stage2(&str) -> Plan turns the MLIR text into a Plan { arch, funcs: Vec<PlanFunc> } value, where each PlanFunc has a BTreeMap<Ssa, TileSlotX> of concrete tile slots and a Vec<PlanOp> of the ops referencing them. This is the point at which Rust’s type system takes over; once the parser accepts it, all subsequent reasoning happens on statically typed values.
Run check_all and map violations back to .acl.pto. SafetyReport::check_all(&plan, &device_spec) runs the six passes above and produces a SafetyReport { violations: Vec<SafetyViolation> }. The pto-diff CLI takes the original .acl.pto path, prepends it to every violation message, and emits lines in a file: severity: [kind] func: message format that is diffable, grep-friendly, and looks exactly like a compiler diagnostic.

The critical design decision is step 1: rather than reimplementing PlanMemoryPass in Rust (months of work, perpetually out of sync with ptoas), the oracle trusts ptoas’s placement and only checks the invariants that follow from it. This keeps pto_to_rust at under 600 lines of Rust while giving it teeth against real bugs.

10.3 Step-by-Step Walkthrough on a Real Kernel

We will demonstrate the whole flow on smoke_tstore_fp_v1.acl.pto, a hand-written 6-op kernel that probes the pto.tstore_fp dequant path. ptoas 0.26 accepts it (rc=0) and emits a .cpp; the oracle finds two real issues that would only manifest at runtime.

10.3.1 The Input

// smoke_tstore_fp_v1.acl.pto — abridged
module {
  func.func @m(%arg0: !pto.ptr<i8>, %arg1: !pto.ptr<i8>, %arg2: !pto.ptr<f16>) {
    %c0 = arith.constant 0 : index
    // … tensor views …

    // lhs: i8 [16×128] in Left
    %l_t = pto.alloc_tile : !pto.tile_buf<loc=left, dtype=i8, rows=16, cols=128, …>
    pto.tload ins(%pv_l) outs(%l_t)

    // rhs: i8 [128×256] in Right
    %r_t = pto.alloc_tile : !pto.tile_buf<loc=right, dtype=i8, rows=128, cols=256, …>
    pto.tload ins(%pv_r) outs(%r_t)

    // acc: i32 [16×256] in Acc
    %a_t = pto.alloc_tile : !pto.tile_buf<loc=acc, dtype=i32, rows=16, cols=256, …>
    pto.tmatmul ins(%l_t, %r_t) outs(%a_t)

    // scale: f16 [1×256] in Scaling — row_major slayout (bug #4)
    %s_t = pto.alloc_tile : !pto.tile_buf<loc=scaling, dtype=f16, rows=1, cols=256, slayout=row_major, …>
    pto.tload ins(%pv_s) outs(%s_t)

    pto.tstore_fp ins(%a_t, %s_t) outs(%pv_o)
    return
  }
}

Two human-visible issues are lurking:

The scaling tile’s shape [1 × 256] at f16 needs 512 B of Scaling space, which is fine on its own — but PlanMemoryPass places it at an offset that tips the high-water mark over the 4096 B Scaling cap on 910B2/CANN 8.5.
The scaling tile’s slayout is row_major, but pto.tstore_fp requires none_box for the fb-dequant hop.

ptoas catches neither.

10.3.2 Running the Three Steps Manually

# Step 1 — dump stage-2 IR
$ /usr/local/bin/ptoas-bin/ptoas \
    --print-after-all /tmp/smoke_tstore_fp_v1.acl.pto \
    -o /tmp/out.cpp 2> /tmp/stage2.dump
$ echo "ptoas rc=$?"
ptoas rc=0

# grep for the last "IR Dump After PlanMemoryPass" block
$ awk '/IR Dump After PlanMemoryPass/{flag=1; next} flag' /tmp/stage2.dump > /tmp/stage2.mlir
$ wc -l /tmp/stage2.mlir
74 /tmp/stage2.mlir

# Step 2 — parse into typed Rust (we invoke the library via pto-diff)
# Step 3 — run checks and emit diagnostics
$ ./target/release/pto-diff /tmp/stage2.mlir
/tmp/stage2.mlir: error: [capacity] m: scaling high-water 4352 B exceeds capacity 4096 B (on Ascend910B2 (CANN 8.5))
/tmp/stage2.mlir: warn: [op-constraint] m: pto.tstore_fp: scaling tile `%11` has slayout RowMajor, typical is none_box
/tmp/stage2.mlir: 1 error(s), 1 warning(s)

Two diagnostics, both real. The error ends the kernel’s correctness (SRAM overrun); the warning ends its usability (fb-dequant silently dropped). Neither was present in the ptoas output.

10.3.3 Running the Three Steps as One Command

For convenience pto-diff bundles all three via --from-pto:

$ ./target/release/pto-diff --from-pto /tmp/smoke_tstore_fp_v1.acl.pto
/tmp/smoke_tstore_fp_v1.acl.pto: error: [capacity] m: scaling high-water 4352 B exceeds capacity 4096 B (on Ascend910B2 (CANN 8.5))
/tmp/smoke_tstore_fp_v1.acl.pto: warn: [op-constraint] m: pto.tstore_fp: scaling tile `%11` has slayout RowMajor, typical is none_box
/tmp/smoke_tstore_fp_v1.acl.pto: 1 error(s), 1 warning(s)

The file path in each line is the original .acl.pto, not the transient stage-2 dump — so an IDE or git diff view can click through to the right place. This is the mapping-back step: although the checks run on the post-PlanMemoryPass Plan, the diagnostics are rebrandable to any upstream artifact the tool was given.

10.3.4 What Each Diagnostic Field Means

/tmp/smoke_tstore_fp_v1.acl.pto: error: [capacity] m: scaling high-water 4352 B exceeds capacity 4096 B (on Ascend910B2 (CANN 8.5))
├──────────────── locus ─────────┤  │     │             │
                                    │     │             └── function name inside the module
                                    │     └─── SafetyKind label (aliasing/capacity/op-constraint/
                                    │         matmul-bounds/dead-tile/linear-use)
                                    └── Severity (error=kernel wrong; warn=likely bug, advisory)

The DeviceSpec in the message (Ascend910B2 (CANN 8.5)) is the capacity table used for the check. pto-diff --device spec.toml lets a user supply a different one when targeting other SoC revisions.

10.3.5 ptoas 0.26 vs ptoas 0.29 Flag Divergence

The bash transcript above uses ptoas 0.26’s flags, which dump every pass’s IR to stderr inline. ptoas 0.29 (shipping with later CANN 8.5 patches and with CANN 9.x) renamed those flags and redirected the dumps to the filesystem:

ptoas 0.26 (stderr)	ptoas 0.29 (tree directory)
`--print-after-all`	`--mlir-print-ir-after-all`
`--print-module-scope`	`--mlir-print-ir-tree-dir=<dir>`
Output on stderr as `IR Dump After PlanMemoryPass` blocks	One file per pass under `<dir>/builtin_module_*/N_<pass-name>.mlir`; the plan-memory dump is `3_pto-plan-memory.mlir`

pto-diff --from-pto handles both versions transparently. It tries the 0.29 tree-dir path first — creating a per-PID temp directory, invoking ptoas with the new flags, reading 3_pto-plan-memory.mlir, and cleaning up — and falls back to 0.26-style stderr scraping if the tree-dir path produces no dumps. Users get the same diagnostic output regardless of which ptoas is on PATH. (If both paths fail, pto-diff reports both error messages so the user can tell which compatibility assumption was wrong.)

10.4 A Second Kernel: Aliasing and Dead Tiles

The same three-step pipeline, applied to smoke_tdequant_v3.acl.pto, surfaces two different violations — demonstrating the oracle generalises.

$ ./target/release/pto-diff --from-pto /tmp/smoke_tdequant_v3.acl.pto
/tmp/smoke_tdequant_v3.acl.pto: error: [aliasing] m: slots %7 and %5 overlap in vec at [1024, 5120) and [4096, 4352)
/tmp/smoke_tdequant_v3.acl.pto: warn: [dead-tile] m: slot `%3` allocated in vec at offset 8192 but never used
/tmp/smoke_tdequant_v3.acl.pto: 1 error(s), 1 warning(s)

Aliasing (error). %5 is a 16×64 i8 tile placed at UB offset 4096, length 1024 B. %7 is a 16×64 f32 tile placed at UB offset 1024, length 4096 B. Their byte ranges [4096,4352) and [1024,5120) overlap at [4096, 4352) — 256 bytes of the f32 tile are the i8 tile. PlanMemoryPass deliberately reused the region because the liveness analysis decided they did not co-exist, but the two tiles have different shapes, so the oracle demotes the reuse from “deliberate” to “probably a bug”. In this case it really is a bug: both are live simultaneously in the op schedule.
Dead tile (warning). %3 is allocated but never referenced as a read or write of any op in the function — 4 KiB of UB budget wasted. ptoas neither reclaims nor warns about it.

Both kernels still produce a runnable .cpp via ptoas. Both would silently misbehave on-device. The oracle surfaces the failure at compile time, before ccec and bisheng and the long edit-compile-run loop on the NPU.

10.5 Mapping Oracle Violations Back to ptoas

Because the oracle runs on ptoas’s own output (stage-2 MLIR), every violation it finds is a specific candidate for upstream inclusion:

Oracle check	Where to fold it into ptoas
`[aliasing]`	A new `VerifyAfterPlanMemoryPass` — sort slots per-space by offset, scan pairs. The oracle’s sort-and-scan implementation in `check_aliasing` (`O(n log n)` per space, `n < 64` in practice) can be ported almost verbatim.
`[capacity]`	Already knowable in `PlanMemoryPass` itself — it is literally the value the pass computes. A one-line `assert(high_water <= cap)` at the end of the pass would turn a runtime fault into a compile-time error.
`[op-constraint]` lhs/rhs/acc	An op verifier on `pto.tmatmul` / `pto.tmatmul.acc` / `pto.tstore_fp`. ptoas already has infrastructure for op verifiers; these checks are ~10 lines each.
`[matmul-bounds]`	A stage-2 verifier that runs over the plan. Descriptor cap knowledge (OUTER<2²⁴, ROW<2¹⁶) already exists in the lowering — exposing it to the verifier is a refactor, not a new analysis.
`[dead-tile]`	A cheap post-pass: for every slot, check if its SSA appears in any op’s `reads() ∪ writes()`. Warn only; not every dead tile is a bug.
`[linear-use]`	Advisory heuristic; would need scope-aware analysis (`scf.for` currently flattens) to promote to a hard rule.

Folding any of the first four would make this oracle redundant for those checks — and that is the point. The oracle exists to demonstrate which invariants are reachable as a compile-time guarantee without rewriting ptoas from scratch, and to give users a workaround until upstream lands them.

10.6 End-to-End Reproducer

A single bash script, blog/mdbook/scripts/ch11_safety_demo.sh, runs the whole demo non-interactively. It builds pto-diff, installs two smoke .acl.pto files to /tmp, and runs the oracle on each, printing the expected diagnostics verbatim.

$ bash blog/mdbook/scripts/ch11_safety_demo.sh
== Tool versions ==
ptoas 0.26
pto_to_rust 0.1.0  (tag pto_checks, commit f41b29b1)
rustc 1.91.0-nightly

== Demo 1: smoke_tstore_fp_v1 ==
ptoas rc=0
oracle findings:
  error: [capacity] m: scaling high-water 4352 B exceeds capacity 4096 B (on Ascend910B2 (CANN 8.5))
  warn:  [op-constraint] m: pto.tstore_fp: scaling tile `%11` has slayout RowMajor, typical is none_box

== Demo 2: smoke_tdequant_v3 ==
ptoas rc=0
oracle findings:
  error: [aliasing] m: slots %7 and %5 overlap in vec at [1024, 5120) and [4096, 4352)
  warn:  [dead-tile] m: slot `%3` allocated in vec at offset 8192 but never used

== Summary ==
ptoas accepted both files with rc=0.
Oracle found 2 errors + 2 warnings across the two files.

The script is read-only (it does not write any files outside /tmp) and assumes only that ptoas is on PATH and the oracle binary has been built at target/release/pto-diff. On the 910B2 test host the whole demo runs in under two seconds.

10.6.1 Live Recording on 910B2 / CANN 8.5 / ptoas 0.29

A companion pair of scripts, blog/mdbook/scripts/ch11_bad_demo.sh (runs on the 910c host) and blog/mdbook/scripts/ch11_bad_demo_remote.sh (ssh wrapper run from a workstation), contrasts a clean softmax against a bad softmax generated by ch11_make_bad_softmax.py. The bad fixture injects 48 dead-but-tloaded VEC tiles; ptoas 0.29 still returns rc=0, but its PlanMemoryPass reuses offset 4096 for 48 tiles — clobbering the live working slots %3 and %11. The oracle flags 96 aliasing errors. Recorded live against the real /usr/local/bin/ptoas-bin/ptoas 0.29 on the 910B2 test host:

ch11 bad-softmax demo — ptoas rc=0 vs oracle 96 errors

The point is the contrast on the same compiler: ptoas takes both files, the oracle takes both files, and only the oracle distinguishes the one that would corrupt memory.

10.7 Limits and Non-Goals

The oracle trusts ptoas’s placement. If PlanMemoryPass produces an incorrect offset (a ptoas bug), the oracle will either miss the violation or report the wrong byte range. The goal is not to second-guess ptoas’s allocator; it is to verify the allocator’s output against a separate set of invariants.
Loops are flattened. check_linear_use collapses scf.for bodies — a tile that is legitimately re-written every iteration may be flagged as WAW. This is why the check is Severity::Warning, not Error. A scope-aware liveness analysis would lift the restriction at the cost of a more complex pass.
DeviceSpec is per-SoC. The bundled spec is Ascend910B2 (CANN 8.5). Other SoC revisions (Ascend 910_9392, 310P3, upcoming 910C) have different capacity and dtype rules; they can be expressed as a TOML file and passed with --device.
The oracle is advisory, not normative. It emits diagnostics; the user’s build system decides whether a warning becomes a hard error. When integrated into rustc_codegen_mlir (the default PTO codegen path), setting ACLRS_PTO_SAFETY=error promotes every violation to a build failure; the default leaves warnings as warnings.

10.8 Where This Fits in the Bigger Story

The argument threaded through the rest of this book has been that Rust’s type system can be the load-bearing verifier for accelerator kernel code — sharper than C++ at catching ABI bugs, lighter than a bespoke formal-methods stack. This chapter shifts the same argument one level down: the type system of a tiny 600-line Rust crate is enough to catch real bugs in the output of a production MLIR compiler whose own verifier is silent about them. No SMT solvers, no model checkers, no re-implementations — just parse → typed Plan → six passes → print.

The .acl.pto → Plan path is the same shape as the reverse-codegen work in Chapters 5 and 6: a producer-side tool (ptoas/AscendC) is paired with a consumer-side tool (pto_to_rust/ascend-rs) that rebuilds its output in typed Rust and asks Rust “does this type-check?”. Every time the answer is “no”, we find a bug that the producer happily shipped.

10.9 End-to-End: Watching the Bad Kernel Fault on 910B2

§10.6 showed the oracle flagging aliasing at compile time. This section closes the loop by running both fixtures on real NPU hardware to observe the consequence ptoas hides. The crate is examples/ch11_exploit/ — a self-contained binary whose build.rs drives ptoas → ccec for the two .acl.pto fixtures (ch11_sm_good.acl.pto, ch11_sm_bad.acl.pto), and whose main.rs launches each on a 910B2 chip with the same deterministic input and compares to a CPU softmax reference.

Why two subprocesses

A subtle runtime discovery landed with the demo: registering two PTO device binaries in one process and launching one after the other crashes the 910B2 vector core on the second launch (Error: Vector core execution exception) even when both binaries are well-formed. The fault isn’t about our binaries — swapping the good binary for the independently-verified examples/tile_softmax output reproduces it. It is an interaction between two back-to-back rt_dev_binary_register calls that we have not yet root-caused.

So the demo forks: the parent spawns its own executable twice with --variant good and --variant bad, each child runs one Acl::new() → KernelLoader::from_bin_path → kernel.launch cycle and prints a single RESULT line, and the parent assembles the table. This is also how a future CI check should shape itself — one kernel per process, diff the RESULT lines against a golden transcript.

The observed transcript

$ ASCEND_DEVICE_ID=1 cargo run --release -p ch11_exploit
=== ch11_exploit: 1×1024 f32 softmax on 910B2 ===
  variant   max_abs_err  max_rel_err        sum    nan verdict
  good          2.33e-9      7.20e-7   1.000000      0 PASS
  bad               NaN          NaN        NaN      0 CRASH: Vector core execution exception.

Compile-time: pto-diff flagged `bad` with aliasing/capacity warnings
              (see §10.6 in blog/mdbook/src/ch11-safety-oracle.md).
Runtime:      `bad` produced wrong output / crashed — the oracle was right.

The good line is the kind of result any well-written softmax earns: max_abs_err sits one ULP off the CPU reference, the row sums to 1.000000 — the computation is correct in the numerical sense.

The bad line is the expensive answer. The same kernel body with 48 extra dead tloads (tiles that are loaded and never read) doesn’t silently produce wrong numbers — it crashes the vector core hard enough that the runtime returns “Vector core execution exception” before any output is written back to host memory. What happened: ptoas’s PlanMemoryPass placed the dead tiles at offsets that overlap live working tiles, so at runtime the hardware tried to execute MTE2 loads into the same UB slots the V-pipe was consuming from. The collision fires an uninitialised-tensor fault on the aicore, not a numerical error.

This is a stronger runtime signal than the numerical corruption we predicted in §10.6 — the bug is bad enough to prevent silent wrong output. That is not comforting. It means the oracle’s compile-time flag is the only signal between “ptoas says OK” and “your kernel faults on first launch”; if the bad fixture had aliased onto tiles the hardware tolerates more gently (for instance, a MTE3 store overlapping a completed V-pipe temporary), we would instead be looking at max_abs_err = 1.7e-2 with sum = 0.84 and no fault — a result any reasonable test harness would treat as “numerical noise, ship it”.

Either way: compile-time advisory + runtime transcript is the full story. pto-diff is the only layer between the user and a device fault or a plausible-looking wrong answer, because everything downstream of ptoas — bisheng, the runtime, the hardware — treats the aliased plan as valid.

Reproducing it

The fixtures and driver live at examples/ch11_exploit/. On any 910B2 host with CANN 8.5 and ptoas on PATH:

# one-time: source CANN env, point to LLVM 20 for codegen deps
source /usr/local/Ascend/cann-8.5.0/set_env.sh
export MLIR_SYS_200_PREFIX=/data/yuyijun/llvm20
export ACLRS_CANN_PATH=/usr/local/Ascend/cann-8.5.0
export ACLRS_SOC_VERSION=Ascend910B2
export PATH=/usr/local/bin/ptoas-bin:$PATH

# pick a free chip from `npu-smi info`
ASCEND_DEVICE_ID=1 cargo run --release -p ch11_exploit

The binary’s exit status is 0 whenever the good fixture passes — the bad crash is the expected behaviour and the table already reports it, so CI should diff the table against a golden transcript rather than rely on the exit code. A 2 exit means the clean fixture regressed, which is a build/device problem worth investigating before looking at the oracle.

Keyboard shortcuts

ascend-rs: Memory-Safe NPU Kernel Programming in Rust