Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

English | 中文版

11. Extending the Oracle to Ingested linalg Kernels

Summary: Chapter 10 ran the safety oracle on .acl.pto files — PTO-MLIR produced by our own mlir_to_pto backend. This chapter extends the same oracle to kernels that arrive from outside the ascend-rs pipeline: upstream linalg dialect MLIR emitted by third-party frontends like torch-mlir. Two paths are described. Path A is a ~300-line projector that synthesises a stage-2 Plan directly from ascend_tile MLIR and runs a subset of the Chapter 10 passes on it. Path C pipes every ingested kernel through the real mlir_to_pto → ptoas --print-after-all → parse_stage2 chain and runs the full six-pass oracle on the post-PlanMemoryPass plan. Both are wired to the ingress driver via a single environment variable, ACLRS_LINALG_SAFETY, and both run host-only on adablue using an x86 ptoas build — no NPU required. The two paths are complementary: Path A is fast and catches whole-tile issues; Path C is higher fidelity, especially on blocked matmuls where Path A conservatively over-approximates capacity.


11.1 Path A: A Projector for Ingested ascend_tile

Chapter 10 ran the oracle on .acl.pto files — PTO-MLIR produced by our own mlir_to_pto backend. A fair follow-up question is whether the same oracle has anything to say about kernels that arrive from outside the ascend-rs pipeline: specifically, upstream linalg dialect MLIR emitted by third-party frontends like torch-mlir. The linalg bridge (Chapter 7) ingests those kernels, lowers them to our ascend_tile form, and hands them to mlir_to_cpp to produce AscendC. Ingested kernels were, until now, the one code path in the repo with zero Rust-side safety analysis — a visible gap in the “Rust safety card” story.

This section plugs that gap. The same check_* passes from section 10.4, re-aimed at the ingress path, catch three bug classes in the four adversarial fixtures shipped under benchmarks/linalg/kernels_adversarial/. The path is deliberately minimal — a ~300-line projector that synthesises a stage-2 Plan directly from ascend_tile MLIR — and is wired to the ingress driver by a single environment variable.

11.1.1 Why Ingress Looks Different from .acl.pto

The stage-2 oracle in section 10.2 starts from ptoas --print-after-all, where every tile already has a concrete (space, offset, rows, cols, dtype, blayout, slayout). Ingested linalg has none of that: the frontend emits llvm.func @kernel(...) attributes {hacc.entry} with a body of llvm.call @ascend_tile_<op>_<dt>(%args...) intrinsics — pure op-and-operand soup, no placement.

We have two honest options:

  1. Path A (this section): synthesise a naive stage-2 plan by giving every SSA its own UB slot with sequential offsets, then run the oracle against that plan. Cheap to build, bounded in value — can catch whole-tile issues but never sees real buffer reuse.
  2. Path C (section 11.2): pipe every ingested kernel through the real mlir_to_ptoptoas --print-after-allparse_stage2 chain and reuse section 10.2 verbatim. Higher fidelity, especially on blocked matmuls where Path A conservatively over-approximates capacity.

Path A is the fast default; Path C runs the full oracle on the post-PlanMemoryPass plan and catches one bug class Path A structurally cannot see. Everything below describes Path A as implemented at commit 381340fc; section 11.2 covers Path C.

11.1.2 An SSA Property That Changes Which Checks Apply

Before describing the projector, one observation about the input format is load-bearing: linalg from torch-mlir is in SSA form, and SSA form automatically renames duplicates. A source-level pattern like y = x + x lowers to a single tile passed twice to linalg.generic; a back-to-back WAW (%t = f(%a); %t = g(%b)) is impossible to express because the second binding gets a fresh name. Two of the oracle’s six checks therefore do not apply to ingested linalg:

  • check_aliasing looks for distinct SSA names sharing overlapping offsets — SSA form prevents that case by construction.
  • The original check_linear_use WAW rule looks for a write followed by another write to the same slot — SSA form renames the second write, so it cannot trigger.

What can survive into a projected plan is the pattern write-never-read: an op produces an SSA value that no later op consumes. This is the canonical shape into which both source-level aliasing and source-level WAW collapse after SSA. To catch it, we added one new check:

// crates/pto_to_rust/src/safety.rs
pub fn check_dead_writes(f: &PlanFunc, rep: &mut SafetyReport) {
    let mut read_slots:   BTreeSet<&Ssa> = BTreeSet::new();
    let mut written_slots: BTreeSet<&Ssa> = BTreeSet::new();
    for op in &f.ops {
        for s in op.reads()  { read_slots.insert(s); }
        for s in op.writes() { written_slots.insert(s); }
    }
    for w in &written_slots {
        if !read_slots.contains(w) {
            let producer = f.ops.iter()
                .position(|op| op.writes().iter().any(|s| s == w));
            let where_clause = producer
                .map(|i| format!(" (produced by op #{})", i))
                .unwrap_or_default();
            rep.violations.push(SafetyViolation::warn(
                &f.name, SafetyKind::DeadTile,
                format!("tile `{}` is written but never read{} \
                         — the producing op is dead code",
                        w.0, where_clause),
            ));
        }
    }
}

check_dead_writes is wired into check_all (so it also improves coverage on hand-written PTO — the original 50-case corpus still passes) and into the new check_ingress subset.

11.1.3 The Projector

pto_to_rust::project(&ascend_tile_src) -> ProjectResult { plan, warnings } (~300 LoC, crates/pto_to_rust/src/ascend_tile_ingress.rs) walks the ascend_tile MLIR text and emits a PlanFunc per llvm.func @name ... attributes {hacc.entry}. The rules are intentionally small:

Input formProjected slot/op
%c = llvm.mlir.constant(N : i32)remembered shape constant
llvm.call @ascend_tile_load_<dt>(%buf, %r, %c) -> %tallocate slot %t in UB at next sequential offset; TLoad op
llvm.call @ascend_tile_store_<dt>(%buf, %t, %r, %c)TStore { tile: %t }
llvm.call @ascend_tile_<unop>_<dt>(%a) -> %t for exp/log/sqrt/rsqrt/tanh/abs/neg/sigmoid/silu/relu/softmax/rms_normallocate %t; TUnary { src: %a, dst: %t }
llvm.call @ascend_tile_<binop>_<dt>(%a, %b) -> %t for add/sub/mul/div/max/minallocate %t; TBinary { a, b, dst }
llvm.call @ascend_tile_matmul_<dt>(%a, %b) -> %tallocate %t; TMatmul (all in UB — see below)
any other llvm.call @ascend_tile_*TUnary placeholder + warning

Two design choices are worth flagging:

  • Every SSA gets its own slot. The projector does not model buffer reuse — mlir_to_cpp’s real allocator does that later. Capacity is therefore a conservative over-approximation: a kernel that the real allocator slims to 64 KiB might project to 512 KiB. The trade-off is deliberate — for adversarial fixtures the over-approximation is exactly the signal we want; for production-fit kernels it will false-positive on capacity (a documented limit, on Path C’s to-do list).
  • Matmul goes in UB, not L0. There is no Left/Right/Acc annotation to recover from ascend_tile form. The projector puts every operand in UB and tags the op TMatmul, but the check_ingress subset does not run check_op_constraint or check_matmul_bounds — doing so would report every matmul as mis-placed. Those checks are Path C territory.

check_ingress runs exactly five of the six passes: aliasing + capacity + dead_tiles + dead_writes + linear_use. (The first two are still worth running — aliasing is vacuous on SSA-projected plans so it is effectively a no-op, and capacity catches egregious whole-tile cases.)

11.1.4 Wiring into the Ingress Driver

The linalg_to_ascendc binary (the tool that consumes linalg MLIR and produces an AscendC .cce) gained one opt-in block:

// crates/mlir_to_cpp_tests/src/bin/linalg_to_ascendc.rs
if let Ok(mode) = std::env::var("ACLRS_LINALG_SAFETY") {
    let projected = pto_to_rust::project(&ascend_tile);
    for w in &projected.warnings {
        eprintln!("linalg-safety [projector]: {}", w);
    }
    let spec = pto_to_rust::default_a5_910b2_cann85();
    let report = pto_to_rust::check_ingress(&projected.plan, &spec);
    let mut err_count = 0usize;
    for v in &report.violations {
        let sev = match v.severity {
            pto_to_rust::Severity::Error => { err_count += 1; "error" }
            pto_to_rust::Severity::Warning => "warning",
        };
        eprintln!("linalg-safety [{}] {}: {} (in `{}`)",
                  sev, v.kind.label(), v.message, v.func);
    }
    if mode == "error" && err_count > 0 {
        eprintln!("linalg-safety: {} error(s), aborting \
                   (ACLRS_LINALG_SAFETY=error)", err_count);
        std::process::exit(3);
    }
}
  • ACLRS_LINALG_SAFETY=1 runs advisory: warnings print, emission proceeds.
  • ACLRS_LINALG_SAFETY=error promotes any Severity::Error to exit code 3, matching the convention ACLRS_PTO_SAFETY=error already uses for the .acl.pto path.

A sibling helper, linalg_safety_dump, prints the projected Plan (slots + ops) alongside the full report — useful when an ingress fixture behaves unexpectedly and you want to see what the projector actually built.

11.1.5 Four Adversarial Fixtures

benchmarks/linalg/kernels_adversarial/ ships four .mlir inputs, each written to trigger one bug class. They are intentionally small (one function, ≤3 ops) so the projected plan is transparent.

FixtureSource-level patternExpected finding
aliasing_same_tensor_twice.mlirlinalg.generic { %arg0, %arg0 } → addclean — SSA dedupes the second operand
capacity_overflow_1x131072.mlirexp on a 1×131072 f32 tile (512 KiB)capacity error — UB cap 192 KiB
dead_tile_unused_intermediate.mlir%t = exp(%a) produced then discarded; return %a + %bdead-tile warning on %t
waw_double_write.mlirtwo linalg.generic ops with the same outsdead-tile warning on the first op’s SSA (renamed by SSA)

Running the driver on each with ACLRS_LINALG_SAFETY=1 gives the verbatim output below (captured on adablue, commit 381340fc, release build of linalg_to_ascendc):

$ for f in aliasing_same_tensor_twice capacity_overflow_1x131072 \
           dead_tile_unused_intermediate waw_double_write; do
    echo "=== $f ==="
    ACLRS_LINALG_SAFETY=1 crates/mlir_to_cpp_tests/target/release/linalg_to_ascendc \
      benchmarks/linalg/kernels_adversarial/$f.mlir /tmp/out.cce 2>&1 \
      | grep -E '^linalg-safety' || echo '(clean — no findings)'
  done
=== aliasing_same_tensor_twice ===
(clean — no findings)
=== capacity_overflow_1x131072 ===
linalg-safety [error] capacity: vec high-water 1048576 B exceeds capacity 196608 B
  (on Ascend910B2 (CANN 8.5)) (in `adv_capacity_overflow`)
=== dead_tile_unused_intermediate ===
linalg-safety [warning] dead-tile: tile `%t2` is written but never read
  (produced by op #2) — the producing op is dead code (in `adv_dead_tile`)
=== waw_double_write ===
linalg-safety [warning] dead-tile: tile `%t1` is written but never read
  (produced by op #1) — the producing op is dead code (in `adv_waw`)

The clean line on the aliasing fixture is the honest part of the story: SSA renames %arg0, %arg0 to a single operand before the projector ever sees it, and the oracle says so by staying silent. Error-mode promotes the capacity finding to exit 3:

$ ACLRS_LINALG_SAFETY=error crates/mlir_to_cpp_tests/target/release/linalg_to_ascendc \
    benchmarks/linalg/kernels_adversarial/capacity_overflow_1x131072.mlir /tmp/out.cce
linalg-safety [error] capacity: ...
linalg-safety: 1 error(s), aborting (ACLRS_LINALG_SAFETY=error)
$ echo $?
3

11.1.6 Reproducer

Two test suites cover the wiring end-to-end; both are green on adablue-probe:

$ cargo test -p pto_to_rust --test adversarial_ingress --release
test adv_aliasing_same_tensor_twice_clean            ... ok
test adv_capacity_overflow_flagged                   ... ok
test adv_dead_intermediate_and_dead_write_flagged    ... ok
test adv_waw_double_write_flagged                    ... ok
test ingress_aliasing_projects_cleanly               ... ok
test ingress_capacity_1x131072_flagged               ... ok
test ingress_dead_intermediate_caught_by_dead_write  ... ok
test ingress_waw_caught_as_dead_write                ... ok
8 passed; 0 failed

The first four exercise hand-crafted PlanFunc values (the oracle proper); the last four exercise the projector itself — starting from the .mlir text and asserting that project() + check_ingress() produces the expected Violation set. Adding a new adversarial pattern is therefore an .mlir plus one test: no new oracle code.

11.1.7 What This Path Does Not Catch

It is worth naming the limits explicitly so the claim lands as “Rust safety on ingested kernels, within these bounds” rather than “Rust catches all the things”:

  • Cross-op buffer reuse bugs. The projector gives every SSA its own slot, so real allocator-level collisions in mlir_to_cpp::analyze_kernel pass through unchecked. Closing this gap is the Path A follow-up: feed reuse decisions back into the projector so the capacity figure and aliasing surface match the shipping footprint.
  • Matmul placement + blocked shapes. No Left/Right/Acc in the projected plan, so check_op_constraint and check_matmul_bounds are deliberately skipped. Worse, for blocked matmuls — where mlir_to_pto tiles a large N into many per-op chunks — Path A’s capacity check reports the pre-blocking footprint, which is a false positive. Matmul fidelity is Path C (section 11.2); the matmul_row_overflow fixture below is the empirical demonstration.
  • Numerics. The oracle is structural; a fixture that produces wrong output but allocates correctly will pass.

Despite the limits, the four demo fixtures establish the new baseline: ingested linalg is no longer an un-analysed input. The same six-pass oracle from section 10.4 now sees both sides of the ascend-rs ingress boundary, and the ACLRS_LINALG_SAFETY=error setting gives downstream build systems the same advisory-or-hard knob that ACLRS_PTO_SAFETY=error already provides for self-emitted kernels.


11.2 Path C: The Full Oracle on Post-PlanMem Plans

Section 11.1 was honest about its ceiling: Path A can only see what a single text walk of ascend_tile tells it, and it cannot see buffer reuse, cannot see matmul placement, and cannot see mlir_to_pto’s own shape decisions (tile blocking, Kb selection, fractal packing). The interesting question is whether those gaps need a whole new analysis or whether the existing six-pass oracle from section 10.4 can be re-used verbatim on a plan that already has all that information in it. Path C says yes — just lower the ingested linalg through the real compilation pipeline and run the oracle on the post-PlanMemoryPass MLIR that ptoas --print-after-all emits. No new passes, no new plan format, one new driver.

11.2.1 Host-only on adablue

The assumption that blocked Path C earlier was that ptoas lives on 910c (aarch64, NPU hardware). It does not — it also ships as an x86 build at ~/ptoas-x86/bin/ptoas on adablue, and that binary produces correct --print-after-all output for static analysis without any NPU present. Path C is therefore cleanly separable from NPU execution: static safety analysis is a pure host concern, numerical validation is where you need 910c. This matches how cargo check and cargo test split on a cross-compilation project.

11.2.2 The Five Hops

 linalg.mlir                      ── hop 1 ── linalg_to_ascend_tile
  │
 ascend_tile MLIR                 ── hop 2 ── mlir_to_pto
  │
 .acl.pto (PTO-MLIR)              ── hop 3 ── ptoas --print-after-all (x86)
  │
 stage-2 MLIR in stderr           ── hop 4 ── pto_to_rust::parse_stage2
  │
 post-PlanMem `Plan`              ── hop 5 ── check_all (all 6 passes)
  │
 SafetyReport

Hops 1 and 2 are the existing ingress path. Hop 3 invokes the unmodified x86 ptoas as a subprocess and captures --print-after-all output on stderr. Hops 4 and 5 are the section 10.2 flow, unchanged — same parse_stage2, same check_all, same DeviceSpec. Path C contributes only the plumbing between them.

A standalone probe binary (linalg_path_c_probe, one .rs file) drives the full chain with PASS/FAIL per hop; it exists mainly as a diagnostic tool for adding new fixtures. Production use goes through the driver (section 11.2.4).

11.2.3 Where Path C Beats Path A

The advertised win of Path C is “tighter capacity, catches matmul bounds”, and we should be honest about what is actually empirically demonstrable on the current fixtures. Running Path C against every fixture in benchmarks/linalg/ (commit b6db7cae) produces the following findings table:

Fixturehop 3 rchop 5 findings
upstream/{add,exp,matmul,softmax}0clean
adv/aliasing_same_tensor_twice0clean (SSA dedup — matches Path A)
adv/capacity_overflow_1x1310721ptoas: vec overflow, requires 8388608 bits while 1572864 bits avaliable
adv/dead_tile_unused_intermediate0dead-tile on %5 (post-PlanMem SSA)
adv/waw_double_write0dead-tile on %3 (post-PlanMem SSA)
adv/matmul_row_overflow (16×16 × 16×65536)0clean — Path A reports capacity 8 MiB; Path C correct

The last row is the empirical value-add. Path A’s projector sums raw linalg tensor footprints: the output tile 16×65536×4 = 4 MiB alone exceeds the 192 KiB UB cap on 910B2, so check_capacity fires as an error. But mlir_to_pto blocks that N=65536 into many per-op chunks of N=32 before ever emitting pto.tmatmul. The post-PlanMemoryPass plan never has a tile that big, and Path C reports clean — the correct answer. This is the empirical instance of the “conservative over-approximation” caveat that section 11.1.7 warned about; Path C is the remedy.

Two other honest findings from running the probe end to end:

  • ptoas has its own sanity bounds. On large shapes (dims > 4095) ptoas’s built-in verifier rejects pto.tmatmul well before our check_matmul_bounds (ROW < 2^16) would trigger, so that pass is mostly dormant on ingested linalg. The rejection still surfaces — Path C treats ptoas rc≠0 as an Error finding, so the violation reaches the user either way — just from a different layer than section 10.4’s check.
  • SSA names differ between Path A and Path C. Path A reports %t2; Path C reports %5. Both are correct (same tile, different dialects — ascend_tile vs post-PlanMemoryPass MLIR), and Path C’s names match the emitted C++ byte-for-byte.

11.2.4 Driver Wiring

The ingress driver gained a Path C mode alongside the existing Path A one:

// crates/mlir_to_cpp_tests/src/bin/linalg_to_ascendc.rs
if let Ok(mode) = std::env::var("ACLRS_LINALG_SAFETY") {
    let abort_env = std::env::var("ACLRS_LINALG_SAFETY_ABORT")
        .ok().as_deref() == Some("1");
    let abort_on_error = abort_env || mode == "error";
    let err_count = if mode == "path-c" {
        run_path_c(&ascend_tile)   // hops 2..5
    } else {
        run_path_a(&ascend_tile)   // project + check_ingress
    };
    if abort_on_error && err_count > 0 {
        eprintln!("linalg-safety: {} error(s), aborting", err_count);
        std::process::exit(3);
    }
}

The knobs are:

Env varEffect
ACLRS_LINALG_SAFETY=1 | path-aPath A (projector + check_ingress), advisory
ACLRS_LINALG_SAFETY=path-cPath C (full pipeline through ptoas), advisory
ACLRS_LINALG_SAFETY=errorPath A + abort on error findings (back-compat with §11.1)
ACLRS_LINALG_SAFETY_ABORT=1Abort on error findings, combinable with either path
ACLRS_PTOAS_BIN=<path>Override the default $HOME/ptoas-x86/bin/ptoas

run_path_c surfaces a non-zero ptoas exit code as a Severity::Error finding rather than a hard crash. A broken kernel that ptoas itself rejects is a safety finding — it just happens to be one that a different layer catches. Treating it as a structured error keeps the reporting surface uniform.

11.2.5 Demo: Path A vs Path C on matmul_row_overflow

$ BIN=crates/mlir_to_cpp_tests/target/release/linalg_to_ascendc
$ ACLRS_LINALG_SAFETY=path-a $BIN \
    benchmarks/linalg/kernels_adversarial/matmul_row_overflow.mlir /tmp/a.cce \
    2>&1 | grep linalg-safety
linalg-safety [path-a] [error] capacity: vec high-water 8389632 B exceeds capacity
  196608 B (on Ascend910B2 (CANN 8.5)) (in `adv_matmul_row_overflow`)

$ ACLRS_LINALG_SAFETY=path-c $BIN \
    benchmarks/linalg/kernels_adversarial/matmul_row_overflow.mlir /tmp/c.cce \
    2>&1 | grep linalg-safety
(no output — Path C reports clean)

Path A false-positives with an 8.3 MiB capacity claim; Path C correctly sees the post-blocking plan and stays silent. Same kernel, same oracle passes, different layer of MLIR as input — and that’s the whole point of having Path C.

11.2.6 Reproducer

Three integration tests exercise the driver end-to-end; they spawn the release binary with each mode and assert on exit code + stderr:

$ cargo test --manifest-path crates/mlir_to_cpp_tests/Cargo.toml \
    --test path_c_driver --release
test path_c_clean_upstream_add                         ... ok
test path_c_clean_where_path_a_overapproximates        ... ok
test path_c_surfaces_ptoas_capacity_overflow           ... ok
3 passed; 0 failed

The tests auto-discover ptoas at $ACLRS_PTOAS_BIN or $HOME/ptoas-x86/bin/ptoas and skip with a message if neither exists, so CI on machines without an x86 ptoas build stays green.

11.2.7 Non-goals

Path C is not a claim that the ingress oracle has closed every gap:

  • check_op_constraint and check_matmul_bounds remain largely dormant on ingress. mlir_to_pto pre-filters most violating shapes at hop 2, and ptoas filters the rest at hop 3 with a tighter dims ≤ 4095 bound than the oracle’s ROW < 2^16. Those two checks stay useful for hand-written .acl.pto (the original section 10.2 target), but on the ingress path they are rarely the first line of defence.
  • Path C still trusts ptoas’s own pipeline. If ptoas silently accepts a plan with a placement bug that neither it nor our passes catch, Path C will report clean. The oracle-catches-ptoas-blind-spots claim from section 10.3 still applies only to the slots the oracle knows how to read.
  • Numerics remain out of scope. Same as Path A.

What Path C does close is the specific gap named in §11.1.7: blocked matmuls where Path A conservatively fails safe. Any future ingested matmul-heavy kernel (LLM MLPs, attention projections, batched GEMMs) now has a clean structural signal instead of a capacity false positive, and it does so by reusing the exact six passes from section 10.4 on a plan that ptoas already constructed. No new oracle code; the value comes from running the old oracle at a more informative point in the lowering.