Catching ptoas Blind Spots with a Rust Safety Oracle - ascend-rs: Memory-Safe NPU Kernel Programming in Rust

English | 中文版

Summary: The PTO-MLIR compiler ptoas is the Ascend NPU’s cube-path lowering tool. It verifies the input MLIR against its own dialect rules, but it does not re-verify the output of its own PlanMemoryPass — the pass that assigns every tile a byte range in UB, L1, L0A/L0B/L0C, and FB. Once placement is done, bad placements survive all the way to codegen. This chapter builds a small Rust crate, pto_to_rust, that rebuilds ptoas’s stage-2 plan as a typed Rust value, runs six safety checks against it, and reports violations back with the original .acl.pto file as the locus. It is demonstrated end-to-end on two real hand-written smoke kernels that ptoas 0.26 accepts with rc=0 but whose kernels would silently corrupt data on-device.

Versions used throughout this chapter: ptoas 0.26 (CANN 8.5.0, installed at /usr/local/bin/ptoas-bin/ptoas on the Ascend 910B2 test host), pto_to_rust 0.1.0 (tag pto_checks, commit f41b29b1), rustc 1.91.0-nightly (f34ba774c 2025-08-03). All numeric results reproduce exactly on these versions; newer ptoas builds may shift placement decisions and therefore the specific byte offsets reported.

11.1 Why ptoas Needs an External Oracle

ptoas is a stage-lowering compiler: PTO-MLIR (tile dialect) in, AscendC C++ out, bisheng-ready. Internally it runs a pipeline whose most load-bearing pass is PlanMemoryPass — the point at which every abstract pto.alloc_tile becomes a concrete (address_space, offset, rows, cols, dtype, blayout, slayout) record. After that pass, the IR is still MLIR and ptoas --print-after-all will dump it, but ptoas itself does not re-verify several invariants that are trivial to verify after you have the post-pass plan in hand.

Six concrete invariants it silently skips:

#	Invariant	Failure mode if violated
1	Two live tiles with different shapes must not occupy overlapping bytes in the same address space	Silent clobber at runtime; kernel returns wrong data
2	Per-space high-water byte usage must not exceed the device’s capacity (`DeviceSpec`)	SRAM overrun; kernel faults or corrupts neighbouring tile
3	`pto.tmatmul` operands live in the correct L0 subspace (lhs∈Left, rhs∈Right, acc∈Acc) with a dtype triple in the cube unit’s accepted set	Descriptor garbage; numerics wrong on some CANN revs
4	ptoas’s descriptor caps: OUTER < 2²⁴, ROW < 2¹⁶	Truncated descriptor; wrong N dimension
5	Every tile allocated should be used	Wasted UB budget — not a bug, but a correctness smell ptoas never mentions
6	Linear-use of tiles: a write should be followed by at least one read before the next write (advisory, loops flattened)	Dead store; earlier value lost

The remainder of this chapter builds the smallest possible tool that enforces all six and proves it by catching real violations.

11.2 Design: Three Steps, Three Artifacts

The oracle is built around a deliberately simple pipeline. Each step produces one artifact that the next step consumes; each artifact is plain text so a human can read it mid-pipeline.

  [step 1]                 [step 2]                       [step 3]
┌──────────────┐   .pto   ┌──────────────┐   plan.rs   ┌───────────────┐   report   ┌────────────────┐
│  ptoas       │ ───────▶ │ pto_to_rust::│ ──────────▶ │ pto_to_rust:: │ ─────────▶ │ pto-diff CLI   │
│ --print-...  │          │ parse_stage2 │             │   check_all   │            │ (human output) │
└──────────────┘          └──────────────┘             └───────────────┘            └────────────────┘
    post-                   typed Rust                  SafetyReport                  error/warn lines
 PlanMemoryPass            `Plan { funcs }`             { violations }               file:line:kind:msg
    MLIR dump                                                                          ready for diff

Dump the stage-2 PTO-MLIR. Run ptoas --print-after-all <file.acl.pto> and keep the last module (the one that follows IR Dump After PlanMemoryPass). This IR has concrete (offset, size) annotations for every tile, which is exactly what the oracle needs.
Parse it into typed Rust. pto_to_rust::parse_stage2(&str) -> Plan turns the MLIR text into a Plan { arch, funcs: Vec<PlanFunc> } value, where each PlanFunc has a BTreeMap<Ssa, TileSlotX> of concrete tile slots and a Vec<PlanOp> of the ops referencing them. This is the point at which Rust’s type system takes over; once the parser accepts it, all subsequent reasoning happens on statically typed values.
Run check_all and map violations back to .acl.pto. SafetyReport::check_all(&plan, &device_spec) runs the six passes above and produces a SafetyReport { violations: Vec<SafetyViolation> }. The pto-diff CLI takes the original .acl.pto path, prepends it to every violation message, and emits lines in a file: severity: [kind] func: message format that is diffable, grep-friendly, and looks exactly like a compiler diagnostic.

The critical design decision is step 1: rather than reimplementing PlanMemoryPass in Rust (months of work, perpetually out of sync with ptoas), the oracle trusts ptoas’s placement and only checks the invariants that follow from it. This keeps pto_to_rust at under 600 lines of Rust while giving it teeth against real bugs.

11.3 Step-by-Step Walkthrough on a Real Kernel

We will demonstrate the whole flow on smoke_tstore_fp_v1.acl.pto, a hand-written 6-op kernel that probes the pto.tstore_fp dequant path. ptoas 0.26 accepts it (rc=0) and emits a .cpp; the oracle finds two real issues that would only manifest at runtime.

11.3.1 The Input

// smoke_tstore_fp_v1.acl.pto — abridged
module {
  func.func @m(%arg0: !pto.ptr<i8>, %arg1: !pto.ptr<i8>, %arg2: !pto.ptr<f16>) {
    %c0 = arith.constant 0 : index
    // … tensor views …

    // lhs: i8 [16×128] in Left
    %l_t = pto.alloc_tile : !pto.tile_buf<loc=left, dtype=i8, rows=16, cols=128, …>
    pto.tload ins(%pv_l) outs(%l_t)

    // rhs: i8 [128×256] in Right
    %r_t = pto.alloc_tile : !pto.tile_buf<loc=right, dtype=i8, rows=128, cols=256, …>
    pto.tload ins(%pv_r) outs(%r_t)

    // acc: i32 [16×256] in Acc
    %a_t = pto.alloc_tile : !pto.tile_buf<loc=acc, dtype=i32, rows=16, cols=256, …>
    pto.tmatmul ins(%l_t, %r_t) outs(%a_t)

    // scale: f16 [1×256] in Scaling — row_major slayout (bug #4)
    %s_t = pto.alloc_tile : !pto.tile_buf<loc=scaling, dtype=f16, rows=1, cols=256, slayout=row_major, …>
    pto.tload ins(%pv_s) outs(%s_t)

    pto.tstore_fp ins(%a_t, %s_t) outs(%pv_o)
    return
  }
}

Two human-visible issues are lurking:

The scaling tile’s shape [1 × 256] at f16 needs 512 B of Scaling space, which is fine on its own — but PlanMemoryPass places it at an offset that tips the high-water mark over the 4096 B Scaling cap on 910B2/CANN 8.5.
The scaling tile’s slayout is row_major, but pto.tstore_fp requires none_box for the fb-dequant hop.

ptoas catches neither.

11.3.2 Running the Three Steps Manually

# Step 1 — dump stage-2 IR
$ /usr/local/bin/ptoas-bin/ptoas \
    --print-after-all /tmp/smoke_tstore_fp_v1.acl.pto \
    -o /tmp/out.cpp 2> /tmp/stage2.dump
$ echo "ptoas rc=$?"
ptoas rc=0

# grep for the last "IR Dump After PlanMemoryPass" block
$ awk '/IR Dump After PlanMemoryPass/{flag=1; next} flag' /tmp/stage2.dump > /tmp/stage2.mlir
$ wc -l /tmp/stage2.mlir
74 /tmp/stage2.mlir

# Step 2 — parse into typed Rust (we invoke the library via pto-diff)
# Step 3 — run checks and emit diagnostics
$ ./target/release/pto-diff /tmp/stage2.mlir
/tmp/stage2.mlir: error: [capacity] m: scaling high-water 4352 B exceeds capacity 4096 B (on Ascend910B2 (CANN 8.5))
/tmp/stage2.mlir: warn: [op-constraint] m: pto.tstore_fp: scaling tile `%11` has slayout RowMajor, typical is none_box
/tmp/stage2.mlir: 1 error(s), 1 warning(s)

Two diagnostics, both real. The error ends the kernel’s correctness (SRAM overrun); the warning ends its usability (fb-dequant silently dropped). Neither was present in the ptoas output.

11.3.3 Running the Three Steps as One Command

For convenience pto-diff bundles all three via --from-pto:

$ ./target/release/pto-diff --from-pto /tmp/smoke_tstore_fp_v1.acl.pto
/tmp/smoke_tstore_fp_v1.acl.pto: error: [capacity] m: scaling high-water 4352 B exceeds capacity 4096 B (on Ascend910B2 (CANN 8.5))
/tmp/smoke_tstore_fp_v1.acl.pto: warn: [op-constraint] m: pto.tstore_fp: scaling tile `%11` has slayout RowMajor, typical is none_box
/tmp/smoke_tstore_fp_v1.acl.pto: 1 error(s), 1 warning(s)

The file path in each line is the original .acl.pto, not the transient stage-2 dump — so an IDE or git diff view can click through to the right place. This is the mapping-back step: although the checks run on the post-PlanMemoryPass Plan, the diagnostics are rebrandable to any upstream artifact the tool was given.

11.3.4 What Each Diagnostic Field Means

/tmp/smoke_tstore_fp_v1.acl.pto: error: [capacity] m: scaling high-water 4352 B exceeds capacity 4096 B (on Ascend910B2 (CANN 8.5))
├──────────────── locus ─────────┤  │     │             │
                                    │     │             └── function name inside the module
                                    │     └─── SafetyKind label (aliasing/capacity/op-constraint/
                                    │         matmul-bounds/dead-tile/linear-use)
                                    └── Severity (error=kernel wrong; warn=likely bug, advisory)

The DeviceSpec in the message (Ascend910B2 (CANN 8.5)) is the capacity table used for the check. pto-diff --device spec.toml lets a user supply a different one when targeting other SoC revisions.

11.4 A Second Kernel: Aliasing and Dead Tiles

The same three-step pipeline, applied to smoke_tdequant_v3.acl.pto, surfaces two different violations — demonstrating the oracle generalises.

$ ./target/release/pto-diff --from-pto /tmp/smoke_tdequant_v3.acl.pto
/tmp/smoke_tdequant_v3.acl.pto: error: [aliasing] m: slots %7 and %5 overlap in vec at [1024, 5120) and [4096, 4352)
/tmp/smoke_tdequant_v3.acl.pto: warn: [dead-tile] m: slot `%3` allocated in vec at offset 8192 but never used
/tmp/smoke_tdequant_v3.acl.pto: 1 error(s), 1 warning(s)

Aliasing (error). %5 is a 16×64 i8 tile placed at UB offset 4096, length 1024 B. %7 is a 16×64 f32 tile placed at UB offset 1024, length 4096 B. Their byte ranges [4096,4352) and [1024,5120) overlap at [4096, 4352) — 256 bytes of the f32 tile are the i8 tile. PlanMemoryPass deliberately reused the region because the liveness analysis decided they did not co-exist, but the two tiles have different shapes, so the oracle demotes the reuse from “deliberate” to “probably a bug”. In this case it really is a bug: both are live simultaneously in the op schedule.
Dead tile (warning). %3 is allocated but never referenced as a read or write of any op in the function — 4 KiB of UB budget wasted. ptoas neither reclaims nor warns about it.

Both kernels still produce a runnable .cpp via ptoas. Both would silently misbehave on-device. The oracle surfaces the failure at compile time, before ccec and bisheng and the long edit-compile-run loop on the NPU.

11.5 Mapping Oracle Violations Back to ptoas

Because the oracle runs on ptoas’s own output (stage-2 MLIR), every violation it finds is a specific candidate for upstream inclusion:

Oracle check	Where to fold it into ptoas
`[aliasing]`	A new `VerifyAfterPlanMemoryPass` — sort slots per-space by offset, scan pairs. The oracle’s sort-and-scan implementation in `check_aliasing` (`O(n log n)` per space, `n < 64` in practice) can be ported almost verbatim.
`[capacity]`	Already knowable in `PlanMemoryPass` itself — it is literally the value the pass computes. A one-line `assert(high_water <= cap)` at the end of the pass would turn a runtime fault into a compile-time error.
`[op-constraint]` lhs/rhs/acc	An op verifier on `pto.tmatmul` / `pto.tmatmul.acc` / `pto.tstore_fp`. ptoas already has infrastructure for op verifiers; these checks are ~10 lines each.
`[matmul-bounds]`	A stage-2 verifier that runs over the plan. Descriptor cap knowledge (OUTER<2²⁴, ROW<2¹⁶) already exists in the lowering — exposing it to the verifier is a refactor, not a new analysis.
`[dead-tile]`	A cheap post-pass: for every slot, check if its SSA appears in any op’s `reads() ∪ writes()`. Warn only; not every dead tile is a bug.
`[linear-use]`	Advisory heuristic; would need scope-aware analysis (`scf.for` currently flattens) to promote to a hard rule.

Folding any of the first four would make this oracle redundant for those checks — and that is the point. The oracle exists to demonstrate which invariants are reachable as a compile-time guarantee without rewriting ptoas from scratch, and to give users a workaround until upstream lands them.

11.6 End-to-End Reproducer

A single bash script, blog/mdbook/scripts/ch11_safety_demo.sh, runs the whole demo non-interactively. It builds pto-diff, installs two smoke .acl.pto files to /tmp, and runs the oracle on each, printing the expected diagnostics verbatim.

$ bash blog/mdbook/scripts/ch11_safety_demo.sh
== Tool versions ==
ptoas 0.26
pto_to_rust 0.1.0  (tag pto_checks, commit f41b29b1)
rustc 1.91.0-nightly

== Demo 1: smoke_tstore_fp_v1 ==
ptoas rc=0
oracle findings:
  error: [capacity] m: scaling high-water 4352 B exceeds capacity 4096 B (on Ascend910B2 (CANN 8.5))
  warn:  [op-constraint] m: pto.tstore_fp: scaling tile `%11` has slayout RowMajor, typical is none_box

== Demo 2: smoke_tdequant_v3 ==
ptoas rc=0
oracle findings:
  error: [aliasing] m: slots %7 and %5 overlap in vec at [1024, 5120) and [4096, 4352)
  warn:  [dead-tile] m: slot `%3` allocated in vec at offset 8192 but never used

== Summary ==
ptoas accepted both files with rc=0.
Oracle found 2 errors + 2 warnings across the two files.

The script is read-only (it does not write any files outside /tmp) and assumes only that ptoas is on PATH and the oracle binary has been built at target/release/pto-diff. On the 910B2 test host the whole demo runs in under two seconds.

11.7 Limits and Non-Goals

The oracle trusts ptoas’s placement. If PlanMemoryPass produces an incorrect offset (a ptoas bug), the oracle will either miss the violation or report the wrong byte range. The goal is not to second-guess ptoas’s allocator; it is to verify the allocator’s output against a separate set of invariants.
Loops are flattened. check_linear_use collapses scf.for bodies — a tile that is legitimately re-written every iteration may be flagged as WAW. This is why the check is Severity::Warning, not Error. A scope-aware liveness analysis would lift the restriction at the cost of a more complex pass.
DeviceSpec is per-SoC. The bundled spec is Ascend910B2 (CANN 8.5). Other SoC revisions (Ascend 910_9392, 310P3, upcoming 910C) have different capacity and dtype rules; they can be expressed as a TOML file and passed with --device.
The oracle is advisory, not normative. It emits diagnostics; the user’s build system decides whether a warning becomes a hard error. When integrated into rustc_codegen_mlir (the default PTO codegen path), setting ACLRS_PTO_SAFETY=error promotes every violation to a build failure; the default leaves warnings as warnings.

11.8 Where This Fits in the Bigger Story

The argument threaded through the rest of this book has been that Rust’s type system can be the load-bearing verifier for accelerator kernel code — sharper than C++ at catching ABI bugs, lighter than a bespoke formal-methods stack. This chapter shifts the same argument one level down: the type system of a tiny 600-line Rust crate is enough to catch real bugs in the output of a production MLIR compiler whose own verifier is silent about them. No SMT solvers, no model checkers, no re-implementations — just parse → typed Plan → six passes → print.

The .acl.pto → Plan path is the same shape as the reverse-codegen work in Chapters 5 and 6: a producer-side tool (ptoas/AscendC) is paired with a consumer-side tool (pto_to_rust/ascend-rs) that rebuilds its output in typed Rust and asks Rust “does this type-check?”. Every time the answer is “no”, we find a bug that the producer happily shipped.

Keyboard shortcuts

ascend-rs: Memory-Safe NPU Kernel Programming in Rust