English | 中文版
11. Catching ptoas Blind Spots with a Rust Safety Oracle
Summary: The PTO-MLIR compiler
ptoasis the Ascend NPU’s cube-path lowering tool. It verifies the input MLIR against its own dialect rules, but it does not re-verify the output of its ownPlanMemoryPass— the pass that assigns every tile a byte range in UB, L1, L0A/L0B/L0C, and FB. Once placement is done, bad placements survive all the way to codegen. This chapter builds a small Rust crate,pto_to_rust, that rebuilds ptoas’s stage-2 plan as a typed Rust value, runs six safety checks against it, and reports violations back with the original.acl.ptofile as the locus. It is demonstrated end-to-end on two real hand-written smoke kernels thatptoas 0.26accepts withrc=0but whose kernels would silently corrupt data on-device.Versions used throughout this chapter:
ptoas 0.26(CANN 8.5.0, installed at/usr/local/bin/ptoas-bin/ptoason the Ascend 910B2 test host),pto_to_rust 0.1.0(tagpto_checks, commitf41b29b1),rustc 1.91.0-nightly (f34ba774c 2025-08-03). All numeric results reproduce exactly on these versions; newer ptoas builds may shift placement decisions and therefore the specific byte offsets reported.
11.1 Why ptoas Needs an External Oracle
ptoas is a stage-lowering compiler: PTO-MLIR (tile dialect) in, AscendC C++ out, bisheng-ready. Internally it runs a pipeline whose most load-bearing pass is PlanMemoryPass — the point at which every abstract pto.alloc_tile becomes a concrete (address_space, offset, rows, cols, dtype, blayout, slayout) record. After that pass, the IR is still MLIR and ptoas --print-after-all will dump it, but ptoas itself does not re-verify several invariants that are trivial to verify after you have the post-pass plan in hand.
Six concrete invariants it silently skips:
| # | Invariant | Failure mode if violated |
|---|---|---|
| 1 | Two live tiles with different shapes must not occupy overlapping bytes in the same address space | Silent clobber at runtime; kernel returns wrong data |
| 2 | Per-space high-water byte usage must not exceed the device’s capacity (DeviceSpec) | SRAM overrun; kernel faults or corrupts neighbouring tile |
| 3 | pto.tmatmul operands live in the correct L0 subspace (lhs∈Left, rhs∈Right, acc∈Acc) with a dtype triple in the cube unit’s accepted set | Descriptor garbage; numerics wrong on some CANN revs |
| 4 | ptoas’s descriptor caps: OUTER < 2²⁴, ROW < 2¹⁶ | Truncated descriptor; wrong N dimension |
| 5 | Every tile allocated should be used | Wasted UB budget — not a bug, but a correctness smell ptoas never mentions |
| 6 | Linear-use of tiles: a write should be followed by at least one read before the next write (advisory, loops flattened) | Dead store; earlier value lost |
The remainder of this chapter builds the smallest possible tool that enforces all six and proves it by catching real violations.
11.2 Design: Three Steps, Three Artifacts
The oracle is built around a deliberately simple pipeline. Each step produces one artifact that the next step consumes; each artifact is plain text so a human can read it mid-pipeline.
[step 1] [step 2] [step 3]
┌──────────────┐ .pto ┌──────────────┐ plan.rs ┌───────────────┐ report ┌────────────────┐
│ ptoas │ ───────▶ │ pto_to_rust::│ ──────────▶ │ pto_to_rust:: │ ─────────▶ │ pto-diff CLI │
│ --print-... │ │ parse_stage2 │ │ check_all │ │ (human output) │
└──────────────┘ └──────────────┘ └───────────────┘ └────────────────┘
post- typed Rust SafetyReport error/warn lines
PlanMemoryPass `Plan { funcs }` { violations } file:line:kind:msg
MLIR dump ready for diff
- Dump the stage-2 PTO-MLIR. Run
ptoas --print-after-all <file.acl.pto>and keep the last module (the one that followsIR Dump After PlanMemoryPass). This IR has concrete(offset, size)annotations for every tile, which is exactly what the oracle needs. - Parse it into typed Rust.
pto_to_rust::parse_stage2(&str) -> Planturns the MLIR text into aPlan { arch, funcs: Vec<PlanFunc> }value, where eachPlanFunchas aBTreeMap<Ssa, TileSlotX>of concrete tile slots and aVec<PlanOp>of the ops referencing them. This is the point at which Rust’s type system takes over; once the parser accepts it, all subsequent reasoning happens on statically typed values. - Run
check_alland map violations back to.acl.pto.SafetyReport::check_all(&plan, &device_spec)runs the six passes above and produces aSafetyReport { violations: Vec<SafetyViolation> }. Thepto-diffCLI takes the original.acl.ptopath, prepends it to every violation message, and emits lines in afile: severity: [kind] func: messageformat that is diffable, grep-friendly, and looks exactly like a compiler diagnostic.
The critical design decision is step 1: rather than reimplementing PlanMemoryPass in Rust (months of work, perpetually out of sync with ptoas), the oracle trusts ptoas’s placement and only checks the invariants that follow from it. This keeps pto_to_rust at under 600 lines of Rust while giving it teeth against real bugs.
11.3 Step-by-Step Walkthrough on a Real Kernel
We will demonstrate the whole flow on smoke_tstore_fp_v1.acl.pto, a hand-written 6-op kernel that probes the pto.tstore_fp dequant path. ptoas 0.26 accepts it (rc=0) and emits a .cpp; the oracle finds two real issues that would only manifest at runtime.
11.3.1 The Input
// smoke_tstore_fp_v1.acl.pto — abridged
module {
func.func @m(%arg0: !pto.ptr<i8>, %arg1: !pto.ptr<i8>, %arg2: !pto.ptr<f16>) {
%c0 = arith.constant 0 : index
// … tensor views …
// lhs: i8 [16×128] in Left
%l_t = pto.alloc_tile : !pto.tile_buf<loc=left, dtype=i8, rows=16, cols=128, …>
pto.tload ins(%pv_l) outs(%l_t)
// rhs: i8 [128×256] in Right
%r_t = pto.alloc_tile : !pto.tile_buf<loc=right, dtype=i8, rows=128, cols=256, …>
pto.tload ins(%pv_r) outs(%r_t)
// acc: i32 [16×256] in Acc
%a_t = pto.alloc_tile : !pto.tile_buf<loc=acc, dtype=i32, rows=16, cols=256, …>
pto.tmatmul ins(%l_t, %r_t) outs(%a_t)
// scale: f16 [1×256] in Scaling — row_major slayout (bug #4)
%s_t = pto.alloc_tile : !pto.tile_buf<loc=scaling, dtype=f16, rows=1, cols=256, slayout=row_major, …>
pto.tload ins(%pv_s) outs(%s_t)
pto.tstore_fp ins(%a_t, %s_t) outs(%pv_o)
return
}
}
Two human-visible issues are lurking:
- The scaling tile’s shape
[1 × 256]at f16 needs512 Bof Scaling space, which is fine on its own — butPlanMemoryPassplaces it at an offset that tips the high-water mark over the4096 BScaling cap on 910B2/CANN 8.5. - The scaling tile’s
slayoutisrow_major, butpto.tstore_fprequiresnone_boxfor the fb-dequant hop.
ptoas catches neither.
11.3.2 Running the Three Steps Manually
# Step 1 — dump stage-2 IR
$ /usr/local/bin/ptoas-bin/ptoas \
--print-after-all /tmp/smoke_tstore_fp_v1.acl.pto \
-o /tmp/out.cpp 2> /tmp/stage2.dump
$ echo "ptoas rc=$?"
ptoas rc=0
# grep for the last "IR Dump After PlanMemoryPass" block
$ awk '/IR Dump After PlanMemoryPass/{flag=1; next} flag' /tmp/stage2.dump > /tmp/stage2.mlir
$ wc -l /tmp/stage2.mlir
74 /tmp/stage2.mlir
# Step 2 — parse into typed Rust (we invoke the library via pto-diff)
# Step 3 — run checks and emit diagnostics
$ ./target/release/pto-diff /tmp/stage2.mlir
/tmp/stage2.mlir: error: [capacity] m: scaling high-water 4352 B exceeds capacity 4096 B (on Ascend910B2 (CANN 8.5))
/tmp/stage2.mlir: warn: [op-constraint] m: pto.tstore_fp: scaling tile `%11` has slayout RowMajor, typical is none_box
/tmp/stage2.mlir: 1 error(s), 1 warning(s)
Two diagnostics, both real. The error ends the kernel’s correctness (SRAM overrun); the warning ends its usability (fb-dequant silently dropped). Neither was present in the ptoas output.
11.3.3 Running the Three Steps as One Command
For convenience pto-diff bundles all three via --from-pto:
$ ./target/release/pto-diff --from-pto /tmp/smoke_tstore_fp_v1.acl.pto
/tmp/smoke_tstore_fp_v1.acl.pto: error: [capacity] m: scaling high-water 4352 B exceeds capacity 4096 B (on Ascend910B2 (CANN 8.5))
/tmp/smoke_tstore_fp_v1.acl.pto: warn: [op-constraint] m: pto.tstore_fp: scaling tile `%11` has slayout RowMajor, typical is none_box
/tmp/smoke_tstore_fp_v1.acl.pto: 1 error(s), 1 warning(s)
The file path in each line is the original .acl.pto, not the transient stage-2 dump — so an IDE or git diff view can click through to the right place. This is the mapping-back step: although the checks run on the post-PlanMemoryPass Plan, the diagnostics are rebrandable to any upstream artifact the tool was given.
11.3.4 What Each Diagnostic Field Means
/tmp/smoke_tstore_fp_v1.acl.pto: error: [capacity] m: scaling high-water 4352 B exceeds capacity 4096 B (on Ascend910B2 (CANN 8.5))
├──────────────── locus ─────────┤ │ │ │
│ │ └── function name inside the module
│ └─── SafetyKind label (aliasing/capacity/op-constraint/
│ matmul-bounds/dead-tile/linear-use)
└── Severity (error=kernel wrong; warn=likely bug, advisory)
The DeviceSpec in the message (Ascend910B2 (CANN 8.5)) is the capacity table used for the check. pto-diff --device spec.toml lets a user supply a different one when targeting other SoC revisions.
11.4 A Second Kernel: Aliasing and Dead Tiles
The same three-step pipeline, applied to smoke_tdequant_v3.acl.pto, surfaces two different violations — demonstrating the oracle generalises.
$ ./target/release/pto-diff --from-pto /tmp/smoke_tdequant_v3.acl.pto
/tmp/smoke_tdequant_v3.acl.pto: error: [aliasing] m: slots %7 and %5 overlap in vec at [1024, 5120) and [4096, 4352)
/tmp/smoke_tdequant_v3.acl.pto: warn: [dead-tile] m: slot `%3` allocated in vec at offset 8192 but never used
/tmp/smoke_tdequant_v3.acl.pto: 1 error(s), 1 warning(s)
- Aliasing (error).
%5is a16×64 i8tile placed at UB offset4096, length1024 B.%7is a16×64 f32tile placed at UB offset1024, length4096 B. Their byte ranges[4096,4352)and[1024,5120)overlap at[4096, 4352)— 256 bytes of the f32 tile are the i8 tile.PlanMemoryPassdeliberately reused the region because the liveness analysis decided they did not co-exist, but the two tiles have different shapes, so the oracle demotes the reuse from “deliberate” to “probably a bug”. In this case it really is a bug: both are live simultaneously in the op schedule. - Dead tile (warning).
%3is allocated but never referenced as a read or write of any op in the function — 4 KiB of UB budget wasted. ptoas neither reclaims nor warns about it.
Both kernels still produce a runnable .cpp via ptoas. Both would silently misbehave on-device. The oracle surfaces the failure at compile time, before ccec and bisheng and the long edit-compile-run loop on the NPU.
11.5 Mapping Oracle Violations Back to ptoas
Because the oracle runs on ptoas’s own output (stage-2 MLIR), every violation it finds is a specific candidate for upstream inclusion:
| Oracle check | Where to fold it into ptoas |
|---|---|
[aliasing] | A new VerifyAfterPlanMemoryPass — sort slots per-space by offset, scan pairs. The oracle’s sort-and-scan implementation in check_aliasing (O(n log n) per space, n < 64 in practice) can be ported almost verbatim. |
[capacity] | Already knowable in PlanMemoryPass itself — it is literally the value the pass computes. A one-line assert(high_water <= cap) at the end of the pass would turn a runtime fault into a compile-time error. |
[op-constraint] lhs/rhs/acc | An op verifier on pto.tmatmul / pto.tmatmul.acc / pto.tstore_fp. ptoas already has infrastructure for op verifiers; these checks are ~10 lines each. |
[matmul-bounds] | A stage-2 verifier that runs over the plan. Descriptor cap knowledge (OUTER<2²⁴, ROW<2¹⁶) already exists in the lowering — exposing it to the verifier is a refactor, not a new analysis. |
[dead-tile] | A cheap post-pass: for every slot, check if its SSA appears in any op’s reads() ∪ writes(). Warn only; not every dead tile is a bug. |
[linear-use] | Advisory heuristic; would need scope-aware analysis (scf.for currently flattens) to promote to a hard rule. |
Folding any of the first four would make this oracle redundant for those checks — and that is the point. The oracle exists to demonstrate which invariants are reachable as a compile-time guarantee without rewriting ptoas from scratch, and to give users a workaround until upstream lands them.
11.6 End-to-End Reproducer
A single bash script, blog/mdbook/scripts/ch11_safety_demo.sh, runs the whole demo non-interactively. It builds pto-diff, installs two smoke .acl.pto files to /tmp, and runs the oracle on each, printing the expected diagnostics verbatim.
$ bash blog/mdbook/scripts/ch11_safety_demo.sh
== Tool versions ==
ptoas 0.26
pto_to_rust 0.1.0 (tag pto_checks, commit f41b29b1)
rustc 1.91.0-nightly
== Demo 1: smoke_tstore_fp_v1 ==
ptoas rc=0
oracle findings:
error: [capacity] m: scaling high-water 4352 B exceeds capacity 4096 B (on Ascend910B2 (CANN 8.5))
warn: [op-constraint] m: pto.tstore_fp: scaling tile `%11` has slayout RowMajor, typical is none_box
== Demo 2: smoke_tdequant_v3 ==
ptoas rc=0
oracle findings:
error: [aliasing] m: slots %7 and %5 overlap in vec at [1024, 5120) and [4096, 4352)
warn: [dead-tile] m: slot `%3` allocated in vec at offset 8192 but never used
== Summary ==
ptoas accepted both files with rc=0.
Oracle found 2 errors + 2 warnings across the two files.
The script is read-only (it does not write any files outside /tmp) and assumes only that ptoas is on PATH and the oracle binary has been built at target/release/pto-diff. On the 910B2 test host the whole demo runs in under two seconds.
11.7 Limits and Non-Goals
- The oracle trusts ptoas’s placement. If
PlanMemoryPassproduces an incorrect offset (a ptoas bug), the oracle will either miss the violation or report the wrong byte range. The goal is not to second-guess ptoas’s allocator; it is to verify the allocator’s output against a separate set of invariants. - Loops are flattened.
check_linear_usecollapsesscf.forbodies — a tile that is legitimately re-written every iteration may be flagged as WAW. This is why the check isSeverity::Warning, notError. A scope-aware liveness analysis would lift the restriction at the cost of a more complex pass. DeviceSpecis per-SoC. The bundled spec isAscend910B2 (CANN 8.5). Other SoC revisions (Ascend 910_9392, 310P3, upcoming 910C) have different capacity and dtype rules; they can be expressed as a TOML file and passed with--device.- The oracle is advisory, not normative. It emits diagnostics; the user’s build system decides whether a
warningbecomes a hard error. When integrated intorustc_codegen_mlir(the default PTO codegen path), settingACLRS_PTO_SAFETY=errorpromotes every violation to a build failure; the default leaves warnings as warnings.
11.8 Where This Fits in the Bigger Story
The argument threaded through the rest of this book has been that Rust’s type system can be the load-bearing verifier for accelerator kernel code — sharper than C++ at catching ABI bugs, lighter than a bespoke formal-methods stack. This chapter shifts the same argument one level down: the type system of a tiny 600-line Rust crate is enough to catch real bugs in the output of a production MLIR compiler whose own verifier is silent about them. No SMT solvers, no model checkers, no re-implementations — just parse → typed Plan → six passes → print.
The .acl.pto → Plan path is the same shape as the reverse-codegen work in Chapters 5 and 6: a producer-side tool (ptoas/AscendC) is paired with a consumer-side tool (pto_to_rust/ascend-rs) that rebuilds its output in typed Rust and asks Rust “does this type-check?”. Every time the answer is “no”, we find a bug that the producer happily shipped.