Appendix D: Ecosystem Integration — Workflows, Demos, and Vulnerability Prevention

The Python AI/ML ecosystem generates NPU kernel code through multiple paths: TileLang lowers Python DSL to AscendC C++, PyTorch’s torch.compile with an Ascend backend produces fused kernels, Triton’s Ascend backend lowers GPU-style tile programs, and PyPTO compiles its virtual ISA to AscendC. All four paths share a common failure mode: the generated C++ is compiled by bisheng with no awareness of target hardware constraints. ascend_compile sits between code generation and compilation, catching hardware-specific bugs before they reach the NPU.

D.1 The `ascend_compile` Integration Hub

The ascend_compile crate provides four integration interfaces, each suited to a different ecosystem role:

Rust API — ascend_compile::compile_kernel(source, &config) for native Rust toolchains
C ABI — libascend_compile.so with extern "C" functions (ascend_compile_kernel, ascend_compile_config_new, etc.) for embedding in C/C++ runtimes
CLI — ascend-compile kernel.cpp --soc Ascend910B3 --shared for shell scripts and CI pipelines
Python wrapper — ascend_compile.py (ctypes over the C ABI) for direct use in Python ML frameworks

Before invoking the bisheng compiler, ascend_compile runs three validation passes that scan the kernel source text:

                 C++ kernel source
                        |
                        v
          +-----------------------------+
          |  Pass 1: Entry Point Check  |
          |  __aicore__ present?        |
          +-----------------------------+
                        |
                        v
          +-----------------------------+
          |  Pass 2: DMA/Sync Barrier   |
          |  DataCopy without           |
          |  pipe_barrier()?            |
          |  310P → error               |
          |  910B → warning             |
          +-----------------------------+
                        |
                        v
          +-----------------------------+
          |  Pass 3: Buffer Size Check  |
          |  InitBuffer size vs target  |
          |  UB limit:                  |
          |  910B → 192KB (196608 B)    |
          |  310P → 256KB (262144 B)    |
          +-----------------------------+
                        |
                        v
               bisheng compilation
                        |
                        v
                  kernel binary

The Rust implementation of these three passes (crates/ascend_compile/src/validate.rs) operates entirely on string scanning — no compilation or parsing is needed. The validate_kernel() function returns a Vec<ValidationDiagnostic>, where each diagnostic carries a severity (Error or Warning) and an optional line number:

// crates/ascend_compile/src/validate.rs
pub fn validate_kernel(source: &str, target: AscendTarget) -> Vec<ValidationDiagnostic> {
    let mut diags = Vec::new();
    check_entry_point(source, &mut diags);       // Pass 1
    check_sync_barriers(source, target, &mut diags); // Pass 2
    check_buffer_sizes(source, target, &mut diags);  // Pass 3
    diags
}

D.2 TileLang Integration

Note: The ascend_compile validation layer (D.1) works today on any C++ kernel source. The “ascend-rs mitigation” workflows described in D.2–D.5 are architectural designs showing how each tool could target Rust instead of C++. The Rust kernel examples compile through the MLIR backend, but the end-to-end integration (tool → Rust → MLIR → C++ → NPU) has not been implemented in any upstream tool. These sections describe a feasible path, not a shipped feature.

Workflow. TileLang generates AscendC C++ from its Python DSL through the LibraryGenerator.compile_lib() method, which internally runs subprocess.run(bisheng, ...). By replacing that final compilation step with ascend_compile.compile_kernel(), TileLang gains target-aware validation without modifying its code generation pipeline.

Demo — compiling a TileLang-generated matmul kernel with validation:

from ascend_compile import compile_kernel

# TileLang generates this C++ source from Python DSL
kernel_source = '''
#include "kernel_operator.h"
extern "C" __global__ __aicore__ void tilelang_matmul(
    GM_ADDR a, GM_ADDR b, GM_ADDR c, GM_ADDR workspace) {
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueA, inQueueB;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueC;
    pipe.InitBuffer(inQueueA, 1, 32 * sizeof(half));
    pipe.InitBuffer(inQueueB, 1, 32 * sizeof(half));
    pipe.InitBuffer(outQueueC, 1, 32 * sizeof(half));

    AscendC::GlobalTensor<half> aGm;
    aGm.SetGlobalBuffer((__gm__ half*)a);
    AscendC::LocalTensor<half> aLocal = inQueueA.AllocTensor<half>();
    // DMA load
    AscendC::DataCopy(aLocal, aGm, {1, 32, 0, 0});
    // compute — but no pipe_barrier between DMA and compute!
    AscendC::Mmad(cLocal, aLocal, bLocal, 16, 16, 16);
    // DMA store
    AscendC::DataCopy(cGm, cLocal, {1, 32, 0, 0});
}
'''

# Compile with validation — catches missing pipe_barrier!
try:
    binary = compile_kernel(
        kernel_source,
        soc="Ascend310P1",    # 310P requires explicit barriers
        shared=True,
        validate=True,
    )
except RuntimeError as e:
    print(f"Caught: {e}")
    # "validation failed:
    #   error: line 16: DMA operations found but no pipe_barrier/sync
    #   — required on Ascend310P1 (add pipe_barrier(PIPE_ALL)
    #     between DMA and compute)"

Vulnerability prevented. Without ascend_compile, TileLang’s bare subprocess.run(bisheng) would compile this kernel successfully. On 310P, the kernel would silently hang — DataCopy completes via the MTE2/MTE3 DMA pipelines, but the compute unit reads stale data from Unified Buffer because no pipe_barrier(PIPE_ALL) separates DMA from compute. The scalar pipeline sees old values, produces garbage output, and the kernel may never terminate. This is vulnerability pattern V6 (missing sync) from Appendix C. The 910B target has auto-sync support that can mask this bug, making it surface only on 310P hardware — exactly the kind of target-dependent failure that ascend_compile catches at compile time.

ascend-rs mitigation. While ascend_compile detects missing barriers, ascend-rs eliminates the vulnerability class entirely. In the safer workflow, TileLang’s Python DSL generates a Rust kernel instead of C++ — the ascend-rs codegen then produces C++ with barriers guaranteed by construction:

// Rust kernel: TileLang DSL → ascend-rs instead of raw C++
#[ascend_std::aiv_kernel]
pub unsafe fn tilelang_softmax(input: *const f32, output: *mut f32, n_ptr: *const u32) {
    unsafe {
        let n = *n_ptr;
        let buf_in  = ascend_std::ascend_buf_alloc(n);
        let buf_out = ascend_std::ascend_buf_alloc(n);
        let work    = ascend_std::ascend_buf_alloc(n);

        ascend_std::ascend_buf_load_f32(buf_in, input, n);
        ascend_std::ascend_pipe_barrier();  // codegen also auto-inserts after DMA

        // kernel_ops::softmax_f32 has 4 embedded pipe_barrier() calls —
        // impossible to forget any of them
        ascend_std::kernel_ops::softmax_f32(buf_out, buf_in, work, n);

        ascend_std::ascend_pipe_barrier();  // codegen also auto-inserts before DMA
        ascend_std::ascend_buf_store_f32(output, buf_out, n);
    }
}

The kernel_ops::softmax_f32 composite expands to ReduceMax → Adds → Exp → ReduceSum → Muls with a pipe_barrier(PIPE_ALL) between each step. Additionally, the MLIR→C++ codegen (mlir_to_cpp.rs) automatically inserts pipe_barrier(PIPE_ALL) after every DMA load and before every DMA store — providing a second layer of defense even if the programmer omits the explicit call. The result: synchronization bugs are structurally impossible in ascend-rs kernels, not merely detected.

D.3 PyTorch Integration

Workflow. torch.compile with an Ascend backend generates AscendC C++ for fused operator subgraphs. The backend calls ascend_compile via the C ABI (libascend_compile.so), which the Python wrapper ascend_compile.py binds through ctypes. This path is suitable for production deployment where the compilation service runs as a long-lived process.

Demo — catching a buffer overflow in a torch.compile-generated kernel:

import torch
from ascend_compile import compile_kernel

# torch.compile's Ascend backend generates AscendC C++ for a fused GELU.
# The code generator computed buffer sizes for a GPU with 48KB shared memory
# per SM, but the Ascend 910B UB is 192KB — and the generated size is wrong.
generated_cpp = '''
#include "kernel_operator.h"
extern "C" __global__ __aicore__ void gelu_kernel(
    GM_ADDR input, GM_ADDR output, GM_ADDR workspace) {
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueue;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueue;
    // torch.compile generated a 300KB buffer — exceeds 910B's 192KB UB!
    pipe.InitBuffer(inQueue, 1, 300000);
    pipe.InitBuffer(outQueue, 1, 300000);
    AscendC::GlobalTensor<float> inputGm;
    inputGm.SetGlobalBuffer((__gm__ float*)input);
    AscendC::LocalTensor<float> xLocal = inQueue.AllocTensor<float>();
    AscendC::DataCopy(xLocal, inputGm, {1, 64, 0, 0});
    pipe_barrier(PIPE_ALL);
    // ... GELU computation ...
}
'''

try:
    binary = compile_kernel(generated_cpp, soc="Ascend910B3")
except RuntimeError as e:
    print(f"Caught: {e}")
    # "validation failed:
    #   error: line 10: InitBuffer size 300000 bytes exceeds
    #   Ascend910B3 UB limit of 196608 bytes
    #   error: line 11: InitBuffer size 300000 bytes exceeds
    #   Ascend910B3 UB limit of 196608 bytes"

Vulnerability prevented. Without ascend_compile, a buffer size that exceeds the NPU’s Unified Buffer would compile without error — bisheng does not validate buffer sizes against hardware SRAM limits. At runtime, the kernel writes past physical SRAM boundaries, corrupting adjacent memory regions. On the Ascend NPU, the UB is partitioned across multiple AI Cores; an oversized buffer on one core can overwrite another core’s working data, causing silent data corruption across independent kernels. This is a hardware-level buffer overflow that no C++ compiler can catch. ascend_compile validates InitBuffer sizes against each target’s exact UB limit: 196,608 bytes (192KB) for 910B, 262,144 bytes (256KB) for 310P.

ascend-rs mitigation. In the safer workflow, torch.compile’s Ascend backend generates a Rust kernel instead of C++. Buffer management is handled through typed newtype IDs (UbBuf, L1Buf, L0aBuf, etc.) returned by ascend_buf_alloc() — not raw pointers, not FreeTensor handles. The newtypes prevent mixing buffer memory levels (e.g., passing an L0aBuf to a UB vector operation is a compile error). The codegen translates these IDs to AscendC TBuf<TPosition::VECCALC> objects with sizes computed from the kernel’s data flow analysis:

// Rust kernel: torch.compile → ascend-rs instead of raw C++
#[ascend_std::aiv_kernel]
pub unsafe fn fused_gelu(input: *const f32, output: *mut f32, n_ptr: *const u32) {
    unsafe {
        let n = *n_ptr;
        // Typed buffer IDs (UbBuf) — no pointer arithmetic, no sizing errors
        let buf = ascend_std::ascend_buf_alloc(n);
        let tmp = ascend_std::ascend_buf_alloc(n);
        let work = ascend_std::ascend_buf_alloc(n);

        ascend_std::ascend_buf_load_f32(buf, input, n);
        ascend_std::ascend_pipe_barrier();

        // GELU via composites: x * sigmoid(1.702 * x)
        ascend_std::kernel_ops::gelu_f32(tmp, buf, work, n);

        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_buf_store_f32(output, tmp, n);
    }
}

The codegen determines InitBuffer sizes from the kernel’s ascend_buf_alloc(n) calls and the target’s UB limit — if n elements exceed UB capacity, it can tile the computation automatically. No manual buffer size calculation is needed, and no raw byte count is passed to InitBuffer by the programmer. The result: buffer overflow is eliminated by design, not merely detected.

D.4 Triton Integration

Workflow. Triton’s Ascend backend lowers Triton IR (designed for GPU tile programs) to AscendC C++ source. The lowering must translate GPU concepts (thread blocks, shared memory, tl.load/tl.store) to NPU concepts (AI Core blocks, Unified Buffer, DataCopy). A common translation error is omitting the __aicore__ attribute, since GPU kernels use __global__ alone.

Demo — catching a missing entry point annotation:

from ascend_compile import compile_kernel

# Triton's Ascend backend lowered a vector_add kernel from GPU IR to AscendC C++.
# The GPU→NPU translation preserved __global__ but forgot __aicore__.
triton_generated = '''
#include "kernel_operator.h"
extern "C" __global__ void vector_add(    // Missing __aicore__!
    GM_ADDR x, GM_ADDR y, GM_ADDR z, GM_ADDR workspace) {
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueX, inQueueY;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueZ;
    pipe.InitBuffer(inQueueX, 1, 32768);
    pipe.InitBuffer(inQueueY, 1, 32768);
    pipe.InitBuffer(outQueueZ, 1, 32768);
    AscendC::GlobalTensor<float> xGm;
    xGm.SetGlobalBuffer((__gm__ float*)x);
    AscendC::LocalTensor<float> xLocal = inQueueX.AllocTensor<float>();
    AscendC::DataCopy(xLocal, xGm, {1, 64, 0, 0});
    pipe_barrier(PIPE_ALL);
    // ... vector add computation ...
}
'''

try:
    binary = compile_kernel(triton_generated, soc="Ascend910B3")
except RuntimeError as e:
    print(f"Caught: {e}")
    # "validation failed:
    #   error: no __aicore__ entry point found"

Vulnerability prevented. The __aicore__ attribute instructs bisheng to generate code for the NPU’s AI Core processor rather than the host ARM/x86 CPU. Without it, bisheng may compile the function with the wrong calling convention, wrong register allocation, and wrong instruction set. The resulting binary exists and loads onto the NPU, but executes with a host ABI on AI Core hardware — producing garbage results, corrupting the stack, or hanging the AI Core entirely. This is a silent, catastrophic failure: no error is raised, the kernel binary is valid ELF, but every computation is wrong. ascend_compile catches it with a single string scan before compilation begins.

ascend-rs mitigation. In the safer workflow, a Triton-Ascend backend lowers Triton IR to a Rust kernel marked with #[aiv_kernel]. The codegen unconditionally emits the correct MLIR attributes (hacc.entry, hacc.function_kind = #hacc.function_kind<DEVICE>) and the C++ entry point with both __global__ and __aicore__:

// Rust kernel: Triton IR → ascend-rs instead of raw C++
#[ascend_std::aiv_kernel]  // ← triggers automatic __aicore__ in codegen
pub unsafe fn vector_add(
    x: *const f32, y: *const f32, z: *mut f32, n_ptr: *const u32,
) {
    unsafe {
        let n = *n_ptr;
        let bx = ascend_std::ascend_buf_alloc(n);
        let by = ascend_std::ascend_buf_alloc(n);

        ascend_std::ascend_buf_load_f32(bx, x, n);
        ascend_std::ascend_buf_load_f32(by, y, n);
        ascend_std::ascend_pipe_barrier();

        ascend_std::ascend_add_f32(bx, bx, by, n);

        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_buf_store_f32(z, bx, n);
    }
}

The codegen in declare.rs detects the #[aiv_kernel] attribute and unconditionally adds the MLIR entry-point attributes. There is no code path where a Rust kernel function can be compiled without the __aicore__ annotation — the attribute is applied by the compiler, not by the programmer. This converts a human-error-prone annotation task into an automatic, toolchain-guaranteed property.

D.5 PyPTO Integration

Workflow. PyPTO defines a virtual ISA of approximately 90 tile-level instructions (pto.load, pto.matmul, pto.store, etc.) that compile to AscendC C++. PyPTO’s tile scheduler optimizes for throughput by using double-buffered tiles, which doubles the memory footprint. When the tile scheduler targets a GPU with abundant shared memory and the generated code is redirected to an NPU target with smaller SRAM, buffer sizes may exceed the physical Unified Buffer.

Demo — catching an oversized double-buffered allocation:

from ascend_compile import compile_kernel

# PyPTO generated C++ from tile-level Python operations:
# pto.load(tile_a) -> pto.matmul(tile_a, tile_b) -> pto.store(tile_c)
# The tile scheduler allocated 2 x 256KB for double-buffered tiles.
pypto_generated = '''
#include "kernel_operator.h"
extern "C" __global__ __aicore__ void pypto_tile_op(
    GM_ADDR input, GM_ADDR output, GM_ADDR workspace) {
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 2> inQueue;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueue;
    // PyPTO allocated 256KB per buffer for double-buffered tiles
    // 2 buffers x 256KB = 512KB total — but 910B UB is only 192KB!
    pipe.InitBuffer(inQueue, 2, 256 * 1024);
    pipe.InitBuffer(outQueue, 1, 32768);

    AscendC::GlobalTensor<float> inputGm;
    inputGm.SetGlobalBuffer((__gm__ float*)input);
    AscendC::LocalTensor<float> aLocal = inQueue.AllocTensor<float>();
    AscendC::DataCopy(inputGm, aLocal, {1, 64, 0, 0});
    pipe_barrier(PIPE_ALL);
}
'''

try:
    binary = compile_kernel(pypto_generated, soc="Ascend910B3")
except RuntimeError as e:
    print(f"Caught: {e}")
    # "validation failed:
    #   error: line 10: InitBuffer size 262144 bytes exceeds
    #   Ascend910B3 UB limit of 196608 bytes"

Vulnerability prevented. PyPTO’s tile scheduler optimizes for throughput by maximizing buffer sizes, but has no knowledge of the target NPU’s physical SRAM capacity. Without target-aware validation, the compiled kernel would attempt to use more Unified Buffer than physically exists. On the Ascend NPU, UB is not virtualizable — there is no page fault mechanism, no swap space, and no memory protection between buffers within a single AI Core. An oversized InitBuffer causes the runtime to lay out buffers that overlap in physical SRAM, resulting in silent memory corruption where one pipeline stage’s DMA writes overwrite another stage’s compute data. ascend_compile catches this because it stores each target’s exact UB size: 196,608 bytes for 910B variants, 262,144 bytes for 310P variants.

ascend-rs mitigation. In the safer workflow, PyPTO’s tile-level operations map to ascend-rs kernel_ops composites. Buffer allocation uses ascend_buf_alloc(n) with element counts, not byte sizes — the codegen computes the physical InitBuffer byte count from the element count and data type, and validates it against the target’s UB limit during code generation:

// Rust kernel: PyPTO tile ops → ascend-rs instead of raw C++
#[ascend_std::aiv_kernel]
pub unsafe fn pypto_tile_matmul(
    a: *const u16, b: *const u16, c: *mut f32, n_ptr: *const u32,
) {
    unsafe {
        let n = *n_ptr;
        // Typed buffer allocation — codegen maps to TBuf with correct TPosition
        let l1_a  = ascend_std::ascend_buf_alloc_l1(n);   // L1 buffer
        let l0a   = ascend_std::ascend_buf_alloc_l0a(n);  // L0A buffer (cube input A)
        let l0b   = ascend_std::ascend_buf_alloc_l0b(n);  // L0B buffer (cube input B)
        let l0c   = ascend_std::ascend_buf_alloc_l0c(n);  // L0C buffer (cube output)

        // Each alloc maps to a specific TBuf<TPosition::*> in codegen
        // L0A → TBuf<TPosition::A1>, L0B → TBuf<TPosition::B1>, etc.
        // Mixing positions is a compile error in the generated C++
        ascend_std::ascend_mmad_f16(l0c, l0a, l0b, n, n, n, 1);
    }
}

The codegen emits TBuf<TPosition::A1> for L0A, TBuf<TPosition::B1> for L0B, and TBuf<TPosition::CO1> for L0C — the AscendC type system enforces that L0A buffers cannot be passed to L0B operations, and vice versa. Combined with element-count-based allocation (not raw byte counts), buffer sizing errors are caught at code generation time rather than at hardware runtime. PyPTO’s tile scheduler can target ascend-rs kernels knowing that buffer position and size constraints are enforced by the type system.

D.6 Summary: Detection vs. Structural Mitigation

ascend_compile detects vulnerabilities in C++ code; ascend-rs eliminates the vulnerability class entirely. The following table contrasts both levels of defense:

Tool	Vulnerability	`ascend_compile` Detection	ascend-rs Structural Mitigation
TileLang	V6: Missing sync barriers	Error on 310P if `DataCopy` without `pipe_barrier`	`kernel_ops` composites embed all barriers; codegen auto-inserts DMA barriers
PyTorch	Buffer size overflow	Error if `InitBuffer` > target UB limit	`ascend_buf_alloc(n)` uses element counts; codegen computes byte sizes
Triton	Missing `__aicore__` entry	Error if `__aicore__` not found in source	`#[aiv_kernel]` triggers unconditional `hacc.entry` attribute in codegen
PyPTO	Buffer exceeds UB limit	Error if `InitBuffer` > target UB limit	Typed `TBuf<TPosition::*>` positions; element-count allocation

The two layers are complementary. ascend_compile validation operates on any C++ kernel source, regardless of origin — it protects the entire ecosystem today. ascend-rs mitigation goes further by making the vulnerability structurally impossible in kernels authored through its Rust→MLIR→C++ pipeline. Tools that adopt ascend-rs as their backend would get both layers automatically. As of this writing, ascend_compile validation is ready for integration; the ascend-rs Rust backend is an architectural option that tool developers could adopt in future versions.

These three validation passes are lightweight — they operate on string scanning with no compilation, parsing, or AST construction needed. The validate_kernel() function adds less than 1ms to the compilation pipeline, even for large kernels. On the NPU, a hung kernel produces no stack trace, no core dump, and no error message — only a timeout. ascend_compile converts these opaque runtime failures into actionable compile-time errors with line numbers and target-specific explanations.

D.7 Golden-Value Testing with PyTorch

Beyond compilation integration, PyTorch serves a second role in the ascend-rs ecosystem: verification. The generate.py script (tests/kernel_correctness/golden/generate.py) produces reference outputs for 72 test cases across 6 categories, using PyTorch and NumPy as the source of truth.

# tests/kernel_correctness/golden/generate.py (excerpt)
import torch
import torch.nn.functional as F

# Generate reference conv2d output with deterministic seed
rng = torch.manual_seed(42)
x = torch.randn(1, 3, 7, 7)
w = torch.randn(8, 3, 3, 3)
y = F.conv2d(x, w, stride=1, padding=0)
# -> conv_golden.json: loaded by `cargo test -p kernel_correctness`

The golden values cover all kernel categories that require non-trivial numerical verification:

Category	Test Cases	Operations
Convolution	16	conv1d, conv2d, conv3d, depthwise, transposed
Index	14	argmax/min, gather, scatter, scatter_add, embedding, index_select, masked_fill
Pooling	12	max_pool1d/2d/3d, avg_pool1d/2d/3d
Matmul	13	transposed_a, transposed_b, transposed_both, lower/upper triangular
Resize	8	bilinear upsample, nearest upsample, trilinear, bilinear downsample
Misc	9	where_broadcast, logic_and, power, masked_cumsum, triplet_loss, lamb_update
Total	72

The Rust test harness (cargo test -p kernel_correctness) loads these JSON files, runs the corresponding ascend-rs kernel implementations on CPU, and compares outputs against PyTorch’s reference values with a tolerance of 1e-4 for floating-point operations.

Vulnerability prevention. Golden-value testing catches implementation errors that compile-time validation cannot: a gather kernel with an off-by-one index error (vulnerability pattern V2 from Appendix C) compiles cleanly and passes all three ascend_compile validation passes, but produces wrong outputs that diverge from PyTorch’s reference. The golden-value test catches it. Similarly, a conv2d kernel that accumulates in the wrong order (swapping input channel and spatial dimensions) produces numerically valid but semantically wrong results — only comparison against a reference implementation reveals the bug. By generating golden values from PyTorch — the same framework that most ML practitioners use — ascend-rs ensures that its kernel implementations match the numerical behavior that users expect from their models.

Keyboard shortcuts

ascend-rs: Memory-Safe NPU Kernel Programming in Rust