Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

English | 中文版

Memory-Safe NPU Kernel Programming in Rust: The ascend-rs Project


Abstract

This article introduces ascend-rs, a framework providing safe Rust bindings for Huawei Ascend NPUs, currently in a private repository pending an open-source release decision. Starting from a Hello World example, we walk through an end-to-end vector multiplication kernel to demonstrate memory-safe NPU programming on both the host and device sides. We cover the current open-source landscape, the technical approach behind ascend-rs, and the road ahead.


English | 中文版

1. Background: The State of NPU Programming

Why Memory Safety Matters

In heterogeneous computing, GPU/NPU programming has long relied on C/C++ ecosystems. Frameworks like CUDA, OpenCL, and SYCL are powerful but inherit all of C/C++’s memory safety problems: dangling pointers, buffer overflows, data races, and resource leaks. These issues are especially tricky in heterogeneous environments, where interactions between device and host memory add another layer of complexity.

A typical NPU programming mistake might look like this:

// C++ AscendC: Forgetting to free device memory → memory leak
void* devPtr;
aclrtMalloc(&devPtr, size, ACL_MEM_MALLOC_HUGE_FIRST);
// ... use devPtr for computation ...
// If an exception occurs here, aclrtFree is never called
aclrtFree(devPtr);

Rust’s ownership system and RAII (Resource Acquisition Is Initialization) pattern eliminate such problems at compile time. This is the core motivation behind the ascend-rs project.

The Open-Source Landscape

Several open-source projects have explored memory-safe heterogeneous computing:

ProjectTargetApproachStatus
rust-cudaNVIDIA GPURust → PTX compilation, safe CUDA bindingsInactive
rust-gpuGPU (Vulkan)Rust → SPIR-V compilationActive
krnlGPU (Vulkan)Safe GPU compute kernelsActive
cudarcNVIDIA GPUSafe CUDA runtime bindingsActive
ascend-rsHuawei Ascend NPURust → MLIR → NPU, safe ACL bindingsIn development

As you can see, ascend-rs is the only project in the Ascend NPU ecosystem attempting memory-safe Rust programming on both the host and device sides. This fills an important gap in the Ascend ecosystem.

ascend-rs Architecture

ascend-rs uses a three-layer architecture:

graph TD
    A["Application Layer<br/>User's Rust Program"] --> B["Host API Layer<br/>ascend_rs + ascend_sys<br/>Safe RAII wrappers"]
    A --> C["Device Runtime Layer<br/>ascend_std + rustc_codegen_mlir<br/>#![no_core] runtime | MLIR codegen backend"]
    B --> D["CANN SDK · Native C/C++ Libraries<br/>ACL Runtime · AscendCL · bisheng · bishengir · HIVM"]
    C --> D

The Host API layer uses bindgen to auto-generate FFI bindings, then builds safe Rust wrappers on top: Acl, Device, AclContext, AclStream, DeviceBuffer<T>, etc., using Rust’s lifetime system to enforce correct resource ordering.

The Device Runtime layer is more innovative: it contains a custom rustc codegen backend that compiles Rust code to MLIR. From there, a mlir_to_cpp translation pass converts the MLIR into C++ source with AscendC API calls, which is then compiled by bisheng (the CANN C++ compiler) into NPU-executable binaries for both Ascend 910B and 310P targets. This MLIR-to-C++ path is what enables the full AscendC feature set — DMA operations, vector intrinsics, pipe barriers, and TPipe infrastructure. The translator recognizes ascend_* function calls in MLIR and emits the corresponding AscendC vector operations.


English | 中文版

2. Hello World: Your First NPU Program

Installation

ascend-rs is distributed as a self-contained package with a pre-built compiler backend and Rust source crates for the host and kernel APIs.

Prerequisites:

  • CANN toolkit (8.x or 9.x) installed on the target machine
  • Rust nightly toolchain (auto-installed by rustup from the included rust-toolchain.toml)

Setup:

# 1. Extract the distribution
tar xzf ascend-rs-0.1.1-$(uname -m).tar.gz
cd ascend-rs-0.1.1

# 2. Source the CANN environment
source /usr/local/Ascend/ascend-toolkit/latest/bin/setenv.bash

# 3. Make the compiler backend discoverable
export LD_LIBRARY_PATH="$(pwd)/lib:$LD_LIBRARY_PATH"

# 4. Verify (compiles a kernel and runs it on the NPU)
bash test.sh --run

What’s in the package:

ascend-rs-0.1.1/
├── lib/librustc_codegen_mlir.so   # Compiler backend (Rust → NPU binary)
├── crates/
│   ├── ascend_rs/                 # Host API: device, stream, memory, kernel launch
│   ├── ascend_sys/                # FFI bindings (auto-generated from CANN headers)
│   ├── ascend_std/                # Kernel runtime: buffer ops, vector intrinsics
│   ├── ascend_std_macros/         # #[aiv_kernel] attribute macro
│   ├── ascend_rs_builder/         # Build-time kernel compiler (KernelBuilder)
│   └── ascend_rs_builder_config/  # CANN path detection
├── examples/vec_add/              # Working starter project
├── test.sh                        # Smoke test
└── rust-toolchain.toml            # Pinned nightly version

The compiler backend (librustc_codegen_mlir.so) is loaded by rustc during kernel compilation. It translates Rust kernel code through MLIR to AscendC C++, then invokes the CANN bisheng compiler to produce NPU binaries. Users interact with it indirectly through KernelBuilder in their build.rs scripts.


Let’s start with the simplest possible example. This Hello World demonstrates the basics of the ascend-rs host API — safely initializing the NPU, creating execution contexts, and launching kernels from Rust.

Kernel Code (C++)

At this stage, Hello World uses a C++ kernel, which is the native approach for the CANN SDK:

// hello_world.cpp
#include "kernel_operator.h"

extern "C" __global__ __aicore__ void hello_world() {
    AscendC::printf("Hello World!!!\n");
}

extern "C" void hello_world_do(uint32_t blockDim, void *stream) {
    hello_world<<<blockDim, nullptr, stream>>>();
}

Here, __global__ marks the function as a host-callable entry point, and __aicore__ indicates it runs on the Ascend AI Core. The <<<...>>> syntax, similar to CUDA, specifies parallelism and execution stream.

Host Code (Rust)

The host code demonstrates ascend-rs’s most important design principle — RAII resource management and lifetime safety:

use ascend_rs::prelude::*;
use std::error::Error;

// Declare FFI interface to the C++ kernel
unsafe extern "C" {
    fn hello_world_do(dim: u32, stream: *mut std::ffi::c_void);
}

fn main() -> Result<(), Box<dyn Error>> {
    // Step 1: Initialize ACL runtime
    let acl = Acl::new()?;

    // Step 2: Select and initialize device
    let device = Device::new(&acl)?;

    // Step 3: Create execution context and stream
    let context = AclContext::new(&device)?;
    let stream = AclStream::new(&context)?;

    // Step 4: Launch kernel (8 parallel blocks)
    unsafe {
        hello_world_do(8, stream.to_raw());
    }

    // Step 5: Synchronize and wait for kernel completion
    stream.synchronize()?;

    // Step 6: All resources automatically freed (RAII)
    // Drop order: stream → context → device → acl
    Ok(())
}

Key Design: Lifetime Chain

Notice the type signatures in this code:

Acl                    → Lifetime root
  Device<'acl>         → Must drop before Acl
    AclContext<'d>     → Must drop before Device
      AclStream<'c>   → Must drop before Context

If you try to use these resources in the wrong order, the code simply won’t compile. This is the power of Rust’s type system — guaranteeing correct resource management at compile time, whereas C++ can only rely on programmer discipline.

Comparison: Pitfalls in C++

The equivalent C++ code requires manual lifecycle management for every resource:

// C++ version: every resource requires manual cleanup
aclInit(nullptr);
aclrtSetDevice(0);
aclrtContext ctx;
aclrtCreateContext(&ctx, 0);
aclrtStream stream;
aclrtCreateStream(&stream);

hello_world_do(8, stream);
aclrtSynchronizeStream(stream);

// Must manually free in correct order, otherwise undefined behavior
aclrtDestroyStream(stream);
aclrtDestroyContext(ctx);
aclrtResetDevice(0);
aclFinalize();

If any step throws an exception or returns early, the subsequent cleanup code is skipped. In the Rust version, the Drop trait guarantees resources are always freed correctly, regardless of control flow changes.


English | 中文版

3. Going Deeper: Writing NPU Kernels in Rust

Hello World demonstrated host-side safety. But ascend-rs has a bigger vision: using Rust on the device side too. This means writing NPU kernel code in Rust, not C++.

Let’s walk through a complete vector multiplication (vec_mul) example to demonstrate this.

3.1 The Rust Kernel

This is the Rust code that runs on the NPU:

#![allow(unused)]
fn main() {
// kernels/src/lib.rs

// Key: #![no_core] indicates a completely bare-metal environment
#![feature(no_core)]
#![no_std]
#![no_core]

/// Element-wise vector multiplication: z[i] = x[i] * y[i]
///
/// #[ascend_std::aiv_kernel] marks this function as an NPU kernel entry point
#[ascend_std::aiv_kernel]
pub unsafe fn mul(x: *const u16, y: *const u16, z: *mut u16) {
    unsafe {
        // Total elements = 16, divide work evenly across parallel blocks
        let block_size = 16usize / ascend_std::get_block_num();
        let start = ascend_std::get_block_idx() * block_size;
        let mut i = start;
        loop {
            // Multiply element-wise and write to output
            *z.wrapping_add(i) = *x.wrapping_add(i) * *y.wrapping_add(i);

            i = i + 1;
            if i == block_size + start {
                break;
            }
        }
    }
}
}

Several things worth noting about this code:

#![no_core] environment: The NPU has no operating system or standard library. ascend_std provides a minimal reimplementation of Rust’s core types (Copy, Clone, Add, Mul, etc.) so that Rust code can compile in a bare-metal environment.

#[ascend_std::aiv_kernel]: This attribute macro marks the function as an AIV (Ascend Instruction Vector) kernel entry point. It expands to #[unsafe(no_mangle)] (so the host can look up the symbol by name) and #[ascend::aiv_kernel] (so the MLIR codegen backend recognizes it and adds the hacc.entry attribute).

NPU parallel model: Similar to CUDA’s block/thread model, the Ascend NPU uses blocks and sub-blocks to organize parallel computation. get_block_idx() and get_block_num() provide execution context so the kernel knows which data slice to process.

3.2 The Host Code

The host code handles data transfer, kernel loading, and result verification:

// src/main.rs
use ascend_rs::prelude::*;

fn main() -> anyhow::Result<()> {
    // ── Phase 1: Initialization ──
    let acl = Acl::new()?;
    let device = Device::new(&acl)?;
    let context = AclContext::new(&device)?;
    let stream = AclStream::new(&context)?;

    // ── Phase 2: Data preparation ──
    let x_host = common::read_buf_from_file::<u16>("test_data/input_x.bin");
    let y_host = common::read_buf_from_file::<u16>("test_data/input_y.bin");

    // Allocate device memory with HugeFirst policy (prefer huge pages for TLB efficiency)
    let mut x_device = DeviceBuffer::from_slice_with_policy(
        x_host.as_slice(), AclrtMemMallocPolicy::HugeFirst
    )?;
    let mut y_device = DeviceBuffer::from_slice_with_policy(
        y_host.as_slice(), AclrtMemMallocPolicy::HugeFirst
    )?;
    let mut z_device = unsafe {
        DeviceBuffer::<u16>::uninitialized_with_policy(
            x_host.len(), AclrtMemMallocPolicy::HugeFirst
        )?
    };

    // ── Phase 3: Kernel execution ──
    unsafe {
        // KernelLoader loads NPU binary from build.rs compilation artifacts
        let kernel_loader = KernelLoader::new()?;

        // Get kernel handle by symbol name "mul"
        let kernel = kernel_loader.get_kernel("mul")?;

        // Launch kernel with 2 parallel blocks
        let block_dim: u32 = 2;
        let mut args = [
            x_device.as_mut_ptr() as *mut _,
            y_device.as_mut_ptr() as *mut _,
            z_device.as_mut_ptr() as *mut _,
        ];
        kernel.launch(block_dim, &stream, &mut args)?;
    }

    // ── Phase 4: Synchronize and verify ──
    stream.synchronize()?;
    let res = z_device.to_host()?;

    for (idx, elem) in res.iter().enumerate() {
        let expected = x_host[idx].wrapping_mul(y_host[idx]);
        assert_eq!(*elem, expected);
    }

    Ok(())
}

3.3 The Build System

build.rs bridges the Rust toolchain and the CANN compiler:

// build.rs
use ascend_rs_builder::KernelBuilder;
use std::path::PathBuf;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    println!("cargo:rerun-if-changed=kernels");
    ascend_rs_builder::add_ascend_link_args()?;

    let out_path = PathBuf::from(std::env::var("OUT_DIR").unwrap());
    let kernel = out_path.join("kernel.o");

    // Detects "kernels" is a directory → triggers Rust kernel compilation pipeline
    KernelBuilder::new("kernels").copy_to(&kernel).build()?;
    Ok(())
}

When KernelBuilder detects the input is a directory (containing Cargo.toml), it:

  1. Runs cargo build targeting nvptx64-nvidia-cuda
  2. Specifies -Zcodegen-backend=rustc_codegen_mlir for the custom codegen backend
  3. The backend translates Rust MIR to MLIR
  4. The mlir_to_cpp pass converts MLIR into C++ source with AscendC API calls (DMA, vector ops, pipe barriers)
  5. Invokes bisheng (CANN C++ compiler) to compile the generated C++ into NPU binary (.acl.o)

Steps 4–5 are key: although CANN includes bishengir-compile (an MLIR-native compiler for 910B), the production pipeline uses the mlir_to_cpp path for all targets (both 310P and 910B). This C++ codegen approach provides access to the full AscendC feature set — DMA operations via DataCopy, TPipe infrastructure, and vector intrinsics. When the Rust kernel calls functions like ascend_reduce_max_f32, the mlir_to_cpp pass recognizes these in the MLIR and emits the corresponding AscendC vector operations (ReduceMax, Exp, etc.). All 522 tests passing on 910B3 hardware use this path.


English | 中文版

4. A More Realistic Example: Softmax

Vector multiplication demonstrates the basics, but real neural network workloads require math functions like exp(), log(), and sqrt(). The softmax function — used in attention layers, classification heads, and probability normalization — is a perfect example:

$$\text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_j e^{x_j - \max(x)}}$$

4.1 Math Intrinsics in ascend_std

ascend-rs exposes hardware math operations as Rust methods on primitive types. Under the hood, f32::exp() maps to the expf32 compiler intrinsic, which the MLIR codegen backend lowers to llvm.intr.exp — ultimately executing as a native NPU math instruction.

#![allow(unused)]
fn main() {
// In ascend_std: these methods are available on f32/f64 in kernel code
let y = x.exp();   // expf32 → llvm.intr.exp
let y = x.ln();    // logf32 → llvm.intr.log
let y = x.sqrt();  // sqrtf32 → llvm.intr.sqrt
}

4.2 The Softmax Kernel

Here is a complete softmax kernel written in Rust for the Ascend NPU:

#![allow(unused)]
#![feature(no_core)]
#![no_std]
#![no_core]

fn main() {
#[ascend_std::aiv_kernel]
pub unsafe fn softmax(input: *const f32, output: *mut f32, len: *const u32) {
    unsafe {
        let n = *len as usize;

        // Step 1: Find max value for numerical stability
        let mut max_val = *input;
        let mut i = 1usize;
        loop {
            if i >= n { break; }
            let val = *input.wrapping_add(i);
            if val > max_val { max_val = val; }
            i = i + 1;
        }

        // Step 2: Compute exp(x_i - max) and accumulate sum
        let mut sum: f32 = 0.0;
        i = 0;
        loop {
            if i >= n { break; }
            let exp_val = (*input.wrapping_add(i) - max_val).exp();
            *output.wrapping_add(i) = exp_val;
            sum = sum + exp_val;
            i = i + 1;
        }

        // Step 3: Normalize
        i = 0;
        loop {
            if i >= n { break; }
            *output.wrapping_add(i) = *output.wrapping_add(i) / sum;
            i = i + 1;
        }
    }
}
}

The key line is (*input.wrapping_add(i) - max_val).exp() — this calls f32::exp(), which compiles through the MLIR backend into a native NPU exponential instruction. The subtraction of max_val before exponentiation is the standard numerical stability trick that prevents overflow.

This demonstrates that ascend-rs kernel code isn’t limited to simple arithmetic — it can express the same algorithms you’d write in C++ AscendC, with Rust’s safety guarantees.

4.3 Performance: Rust vs C++ on Real Hardware

How does a Rust kernel perform compared to hand-written C++ on actual NPU hardware? We benchmarked the softmax kernel on an Ascend 310P NPU with four implementations:

  • C++ naive (scalar) — A hand-written C++ kernel using scalar loops with GetValue/SetValue accessors
  • C++ optimized (vector) — An expert-written C++ kernel using AscendC vector intrinsics (ReduceMax, Exp, Muls)
  • Rust scalar — The Rust kernel above, compiled through the MLIR-to-C++ codegen pipeline
  • Rust vector — A Rust kernel using ascend-rs vector intrinsics (ascend_reduce_max_f32, ascend_exp_f32, ascend_muls_f32), compiled through the same pipeline

Each kernel processes f32 input arrays, with 1 warmup iteration and 10 timed iterations per configuration. All results are verified against a CPU reference for correctness.

SizeC++ Naive (ms)C++ Opt (ms)Rust Scalar (ms)Rust Vector (ms)Scalar vs NaiveVector vs Opt
2560.1000.0780.0990.0770.99x0.99x
1,0240.1910.0770.2020.0761.06x0.99x
4,0960.5680.0790.6070.0791.07x1.00x
16,3842.0730.0892.2210.0871.07x0.98x

Key findings:

  1. Rust vector matches C++ optimized performance. The Rust vectorized kernel, using ascend_std vector intrinsics that map to AscendC operations, performs within 1-2% of the hand-optimized C++ kernel across all sizes. At 16,384 elements, the Rust vector kernel (0.087ms) is actually slightly faster than C++ optimized (0.089ms). This means there is zero performance penalty for writing vectorized NPU kernels in Rust instead of C++.

  2. Vector intrinsics provide massive speedups. Both vectorized kernels are 1.3x faster at small sizes and up to 25x faster at 16,384 elements compared to their scalar counterparts. The vector pipeline processes 256 bits (8 floats) per cycle vs one element per cycle for scalar code.

  3. Rust scalar is within 5-7% of C++ scalar. The scalar codegen path also produces competitive code, with the small overhead coming from different UB access patterns (direct pointer arithmetic vs accessor methods).

  4. All implementations are numerically correct. Every kernel-size combination produces results matching the CPU reference (max error < 1e-8, output sum ≈ 1.0). The vector implementations achieve even lower error than scalar (max_err ~1e-10 vs ~1e-8) due to hardware-optimized math operations.

Here is what the Rust vectorized softmax kernel looks like — it reads almost identically to the C++ version:

#![allow(unused)]
fn main() {
#[ascend_std::aiv_kernel]
pub unsafe fn softmax(input: *const f32, output: *mut f32, len_buf: *const u32) {
    unsafe {
        let n = *len_buf;
        let in_buf  = ascend_std::ascend_buf_alloc(n);
        let out_buf = ascend_std::ascend_buf_alloc(n);
        let work    = ascend_std::ascend_buf_alloc(n);
        let rwork   = ascend_std::ascend_buf_alloc(n);

        ascend_std::ascend_buf_load_f32(in_buf, input, n);
        ascend_std::ascend_pipe_barrier();

        let max_val = ascend_std::ascend_reduce_max_f32(work, in_buf, rwork, n);
        ascend_std::ascend_adds_f32(out_buf, in_buf, 0.0f32 - max_val, n);
        ascend_std::ascend_exp_f32(out_buf, out_buf, n);
        let sum_val = ascend_std::ascend_reduce_sum_f32(work, out_buf, rwork, n);
        ascend_std::ascend_muls_f32(out_buf, out_buf, 1.0f32 / sum_val, n);

        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_buf_store_f32(output, out_buf, n);
    }
}
}

The ascend_buf_alloc / ascend_buf_load_f32 / ascend_reduce_max_f32 calls are extern "C" stubs in ascend_std that the MLIR codegen backend recognizes and translates to AscendC API calls (TBuf, DataCopy, ReduceMax, etc.) during C++ code generation. This gives Rust kernels direct access to the NPU’s vector pipeline with zero overhead.

4.4 Beyond Softmax: Activation Function Benchmarks

To validate the breadth of the vector intrinsic API, we benchmarked three additional activation functions — Relu, Sigmoid, and Tanh — each composed from the same primitive operations. Unlike softmax, these activations don’t have dedicated AscendC builtins; instead they are constructed from composable vector primitives:

  • Relu(x) = max(x, 0) → Maxs
  • Sigmoid(x) = 1 / (1 + exp(-x)) → MulsExpAddsReciprocal
  • Tanh(x) = 2 · sigmoid(2x) - 1 → MulsExpAddsReciprocalMulsAdds

For each function, we compare a C++ implementation (TQue pipeline) against the equivalent Rust-style code (TBuf pipeline matching the mlir_to_cpp output):

SizeRelu C++ (ms)Relu Rust (ms)Sigmoid C++ (ms)Sigmoid Rust (ms)Tanh C++ (ms)Tanh Rust (ms)
2560.0780.0750.0750.0750.0750.077
1,0240.0750.0760.0750.0740.0750.076
4,0960.0750.0760.0770.0770.0760.078
16,3840.0830.0830.0860.0860.0850.086

All six kernels perform identically within measurement noise. Relu achieves exact correctness (max_err = 0), while Sigmoid and Tanh achieve max_err < 3e-3 at sizes ≥ 1024. The size=256 correctness issue affects both C++ and Rust equally — it’s an AscendC hardware-level precision artifact at small vector sizes, not a codegen issue.

This confirms that the Rust vector intrinsic API generalizes beyond softmax. For the activation functions tested here — each a composition of AscendC vector primitives — Rust and C++ produce identical performance. We expect this to hold for any kernel composed purely from vector intrinsics, since the codegen maps each Rust intrinsic call 1:1 to the same AscendC C++ call. Cube engine operations (matmul via Mmad) and multi-level buffer hierarchies (L1/L0A/L0B/L0C) are supported at the API level but have not yet been hardware-verified through the full pipeline.

4.5 Formal Equivalence Verification: AscendC vs AscendRS

Performance parity is compelling, but the strongest argument for the Rust codegen pipeline is bitwise equivalence — proving that Rust-generated kernels produce exactly the same numerical results as hand-written AscendC C++ kernels on real NPU hardware.

We selected three representative kernels that cover the most common neural network operation patterns:

  • ReLU — single vector op: output[i] = max(input[i], 0)ascend_maxs_f32
  • Sigmoid — chained vector ops: output[i] = 1/(1 + exp(-input[i]))MulsExpAddsReciprocal
  • Vec Add — binary vector op: z[i] = x[i] + y[i]ascend_add_f32

For each kernel, we compiled two implementations:

  1. AscendC original — idiomatic C++ using the TQue pipeline (EnQue/DeQue implicit synchronization), as a 910B production engineer would write it
  2. AscendRS equivalent — C++ generated from Rust source via the mlir_to_cpp pipeline (TBuf + explicit pipe_barrier(PIPE_ALL))

Both were run on the 310P NPU with identical inputs (256 f32 elements, deterministic PRNG) and compared at three levels:

TestC++ vs CPURS vs CPUC++ vs RS
ReLUPASS (err=0.00)PASS (err=0.00)PASS (err=0.00)
SigmoidPASS (err=2.4e-3)PASS (err=2.4e-3)PASS (err=0.00)
Vec AddPASS (err=0.00)PASS (err=0.00)PASS (err=0.00)

The C++ vs RS column shows bitwise identical output (max error = 0.0) for all three kernels. The NPU produces exactly the same bits whether the kernel was written in C++ or Rust. The small sigmoid CPU difference (2.4e-3) is the NPU’s Exp() vector unit precision vs x86 expf() — it affects both implementations equally and is not a codegen issue.

Here is the Rust sigmoid kernel — four lines of vector intrinsic calls that produce identical NPU output to the 40-line AscendC C++ class:

#![allow(unused)]
fn main() {
#[ascend_std::aiv_kernel]
pub unsafe fn sigmoid(input: *const f32, output: *mut f32, len: *const u32) {
    unsafe {
        let n = *len;
        let buf_in = ascend_std::ascend_buf_alloc(n);
        let buf_out = ascend_std::ascend_buf_alloc(n);

        ascend_std::ascend_buf_load_f32(buf_in, input, n);
        ascend_std::ascend_pipe_barrier();

        ascend_std::ascend_muls_f32(buf_out, buf_in, -1.0f32, n);
        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_exp_f32(buf_out, buf_out, n);
        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_adds_f32(buf_out, buf_out, 1.0f32, n);
        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_reciprocal_f32(buf_out, buf_out, n);

        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_buf_store_f32(output, buf_out, n);
    }
}
}

A notable discovery during this work: in-place chained vector operations on the 310P require explicit pipe_barrier(PIPE_ALL) between each step. Without barriers between Muls→Exp→Adds→Reciprocal on the same buffer, the next operation reads stale data. This is a hardware synchronization requirement that the Rust codegen pipeline now handles correctly — and the equivalence test serves as a regression test for this behavior.

English | 中文版

5. Scaling Up: 508 Kernels Across All MultiKernelBench Categories

Beyond individual benchmarks and equivalence tests, we systematically expanded ascend-rs kernel coverage to achieve complete 1:1 coverage of all 300 MultiKernelBench reference kernels across 15 categories (activation, architecture, attention, broadcast, convolution, fuse, index, loss, math, matmul, normalization, optimizer, pooling, reduce, resize).

ascend-rs now contains 505 Rust NPU kernels, all compilable through the MLIR codegen backend. These break down into tiers of verification:

  • 16 deployable kernels — compiled through the full Rust→MLIR→C++→bisheng pipeline, deployed and executed on NPU hardware
  • 413 tests passing NPU correctness verification on Ascend 910B3 — verified against CPU reference on real hardware with 0 failures and 0 crashes; bitwise-identical output to hand-written AscendC C++ confirmed for representative kernels (Section 4.5). This includes 37 matmul tests executed via CANN’s aclnn operator API (aclnnMm, aclnnAdd, aclnnAddmm, aclnnRelu, aclnnMul, aclnnReduceSum), as well as all convolution, pooling, resize, index, and optimizer kernels
  • 486 compiletest kernels — verified to compile through the MLIR backend and pass CPU-level correctness tests

Cube-engine matmul kernels — previously blocked by TPipe L1/CBUF queue allocation issues on mixed AIV/AIC binaries — now execute correctly via CANN’s built-in operator API. The two-phase aclnn operator pattern (GetWorkspaceSize + Execute) dynamically loaded from libopapi.so bypasses custom kernel compilation entirely, leveraging the cube engine’s optimized built-in operators. Composed operator chains (e.g., aclnnMm + aclnnRelu + aclnnAdd for ResNet residual blocks) enable fused matmul variants that would otherwise require custom cube kernel development.

CategoryKernelsApproach
Activation (16)relu, sigmoid, gelu, tanh, softmax, elu, selu, swish, mish, softplus, softsign, hardsigmoid, hardswish, leaky_relu, log_softmax, gelu_tanhDirect vector intrinsics + kernel_ops composites
Architecture (41)AlexNet/VGG/ResNet FC layers, DenseNet block, MobileNet/EfficientNet, ViT/Swin MLP, MinGPT, LSTM gates/cell, GRU gates, Mamba SSMMatmul + activation + norm compositions
Attention (15)scaled dot-product, causal, cross, multi-query, group-query, KV-cached, cross-modal, linear, sparse, windowed-causal, SwiGLU, GeGLU, masked fillScale + mask + softmax patterns
Broadcast (8)add_bias, elementwise mul/div/sub/max/min, clamp, squareBinary vector intrinsics
Convolution (34)standard conv2d, depthwise conv2d, transposed conv2d variantsScalar nested-loop (no cube engine)
Fuse (86)matmul+gelu, gemm+relu+divide, norm+activation, multi-op chains (3-6 ops fused)Chained vector intrinsics with pipe barriers
Index (12)gather, scatter, scatter_add, index_select, index_copy, index_add, embedding, masked_fill, inplace_update, take_along_dimScalar nested-loop with bounds-checked indexing
Loss (6)MSE, Huber, hinge, cosine similarity, cross-entropy, KL divergenceReduction + arithmetic
Math (5)cumsum (3 variants), cumprod, matrix-scalar multiplyScalar loops + vector ops
Matmul (17)standard, batched, symmetric, bias, scaled, GEMM, wide, accumulate, diagonal-scale, outer productCube engine (Mmad FFI)
Normalization (9)layernorm, rmsnorm, batch/group/instance norm, L1/L2/Frobenius normReduction + normalize patterns
Optimizer (6)SGD, SGD+momentum, Adagrad, RMSprop, Adam, + extendedIn-place buffer arithmetic
Pooling (6)global avg/max/min pool, fused pool+sigmoid, LP poolReduction-based
Reduce (5)max, min, sum, mean, productHardware reduction intrinsics
Resize (5)nearest, lerp, bicubic weight, weighted sum, trilinearInterpolation arithmetic
Tiled (16)256-element tiled variants of activations and opsLoop + tile-size buffer allocation
Multi-block (16)AICore block-parallel variantsget_block_idx() work distribution

To support this breadth, we added 17 composite operations to kernel_ops.rs — higher-level building blocks like elu_f32, mish_f32, rms_norm_f32, mse_loss_f32, and cosine_similarity_f32 — each built from primitive vector intrinsics with correct pipe barrier placement.

The convolution and index/gather/scatter categories are implemented using a scalar nested-loop pattern, achieving complete MultiKernelBench coverage at the API level. CPU correctness tests (cargo test -p kernel_correctness) validate numerical accuracy for 80 representative kernels across all categories. The remaining compiletests verify successful compilation through the MLIR backend without CPU-level numerical checks.

Progress report — verification status as of the current codebase (verified via count_kernels.sh and hardware test logs):

TierCountDescription
Compiletests passed486Compile through MLIR backend + CPU-level correctness (cargo test -p compiletest)
910B3 correctness verified413Pass NPU correctness harness on Ascend 910B3 (0 fail, 0 crash); includes 37 matmul via aclnn, all conv/pooling/resize/index/optimizer
Performance parity with AscendC4≤2% overhead vs hand-optimized C++ (Section 4.3–4.4): softmax, relu, sigmoid, tanh
Deployable (full pipeline)16Compiled through Rust→MLIR→C++→bisheng and executed on NPU hardware
Total kernels505All compilable through the MLIR codegen backend

The 413 passing NPU correctness tests on Ascend 910B3 cover all kernel categories: vector-intrinsic kernels (activations, reductions, fused chains, multi-block), cube-engine matmul (via aclnn operator composition), convolution, pooling, resize, index operations, and optimizers — with 0 failures and 0 crashes.


English | 中文版

6. Memory Safety Case Studies: AscendC C++ vs ascend-rs

With 16 kernels deployed on NPU hardware, 413 passing NPU correctness tests on Ascend 910B3, and 505 total kernels compiling through the MLIR backend, ascend-rs’s value proposition extends beyond performance parity — the key advantage is memory safety. Below we present 6 paired case studies where each AscendC C++ kernel contains a real, exploitable memory safety vulnerability that the equivalent Rust ascend-rs kernel structurally prevents.

These aren’t contrived toy examples. Each vulnerability class is a real pattern that occurs in production AscendC C++ kernel development:

CaseVulnerabilityC++ Root CauseRust Prevention
1. Type ConfusionGM_ADDR erases all type info at entryFunction signature encodes element type
2. Buffer OverflowGetValue(i)/SetValue(i,v) uncheckedBuffer-ID API with explicit count
3. Use-After-FreeFreeTensor() then stale LocalTensor accessNo manual free in API
4. Missing SyncForgetting pipe_barrier() between DMA and computekernel_ops composites include barriers
5. Double FreeFreeTensor() called twiceNo free operation exists
6. Integer OverflowSilent u32 wrap in offset calculationwrapping_mul makes overflow explicit

6.1 Type Confusion via GM_ADDR Type Erasure

AscendC kernel entry points receive all tensor pointers as GM_ADDR (= uint8_t*). The kernel must manually cast to the correct element type. If the host passes f16 data but the kernel casts to float*, each element reads 4 bytes instead of 2 — producing garbage values with no warning. This occurs whenever a kernel is reused for a different dtype without updating the cast, or when a host wrapper passes the wrong tensor format.

C++ — Vulnerable:

#include "kernel_operator.h"

class KernelSoftmaxConfused {
public:
    __aicore__ inline void Init(GM_ADDR input, GM_ADDR output, GM_ADDR len_buf) {
        uint32_t n = *((__gm__ uint32_t *)len_buf);

        // BUG: Host passed half-precision (f16) data, but we cast to float.
        // Each "float" element reads 4 bytes instead of 2, so we get:
        //   - Half the expected number of meaningful values
        //   - Each value is garbage (two f16 bit patterns reinterpreted as one float)
        // The compiler cannot catch this because GM_ADDR is just uint8_t*.
        inputGm.SetGlobalBuffer((__gm__ float *)input, n);
        outputGm.SetGlobalBuffer((__gm__ float *)output, n);
        // ...
    }

    __aicore__ inline void Compute(int32_t len) {
        AscendC::LocalTensor<float> xLocal = inQueue.DeQue<float>();
        AscendC::LocalTensor<float> yLocal = outQueue.AllocTensor<float>();
        // All computation operates on garbage values due to the type confusion.
        // Silently wrong output — no crash, no error.
        AscendC::Exp(yLocal, xLocal, len);
        outQueue.EnQue<float>(yLocal);
        inQueue.FreeTensor(xLocal);
    }
    // ...
};

// The entry point uses GM_ADDR (= uint8_t*) for all tensor arguments.
// The caller can pass any data type — no type checking at this boundary.
extern "C" __global__ __aicore__ void softmax_confused(
        GM_ADDR input, GM_ADDR output, GM_ADDR len_buf) {
    KernelSoftmaxConfused op;
    op.Init(input, output, len_buf);
    op.Process();
}

Rust — Safe:

#![allow(unused)]
#![feature(no_core)]
#![no_std]
#![no_core]

fn main() {
/// The signature `input: *const f32` means the host MUST pass an f32 tensor.
/// If the host has f16 data (*const u16), calling this function is a type error:
///     softmax(f16_ptr, ...)  // ERROR: expected *const f32, found *const u16
#[ascend_std::aiv_kernel]
pub unsafe fn softmax(input: *const f32, output: *mut f32, len: *const u32) {
    unsafe {
        let n = *len;
        let buf_in = ascend_std::ascend_buf_alloc(n);
        let buf_out = ascend_std::ascend_buf_alloc(n);
        let buf_work = ascend_std::ascend_buf_alloc(n);

        // Load f32 data — the _f32 suffix matches the pointer type.
        // There is no way to accidentally load f16 data through an f32 API.
        ascend_std::ascend_buf_load_f32(buf_in, input, n);
        ascend_std::ascend_pipe_barrier();

        // softmax_f32 expects f32 buffers — type consistency maintained
        // throughout the entire pipeline without manual casts.
        ascend_std::kernel_ops::softmax_f32(buf_out, buf_in, buf_work, n);

        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_buf_store_f32(output, buf_out, n);
    }
}
}

Key insight: In C++, GM_ADDR is a type-erased uint8_t* that accepts any data format. In Rust, the function signature *const f32 is part of the type system — the compiler rejects mismatched types at compile time.

6.2 Buffer Overflow via Unchecked Tensor Index

AscendC’s GetValue(i) and SetValue(i, v) perform no bounds checking. If the loop bound is wrong — an off-by-one error, using the wrong length variable, or confusing input/output sizes — the kernel reads or writes out of bounds on local SRAM. This is especially dangerous because local SRAM is shared across all tensor allocations within a tile — an OOB write silently overwrites a neighboring tensor’s data.

C++ — Vulnerable:

#include "kernel_operator.h"

class KernelScalarSoftmax {
    // ...
    __aicore__ inline void Compute(int32_t len, int32_t alignedLen) {
        AscendC::LocalTensor<float> xLocal = inQueue.DeQue<float>();
        AscendC::LocalTensor<float> yLocal = outQueue.AllocTensor<float>();

        // Step 1: Find max (scalar loop)
        float maxVal = xLocal.GetValue(0);
        for (int32_t i = 1; i < len; i++) {
            float v = xLocal.GetValue(i);
            if (v > maxVal) maxVal = v;
        }

        // Step 2: Compute exp(x - max) and sum
        float sum = 0.0f;
        for (int32_t i = 0; i < len; i++) {
            float v = xLocal.GetValue(i) - maxVal;
            yLocal.SetValue(i, v);
            sum += v;
        }

        // Step 3: Normalize
        float invSum = 1.0f / sum;

        // BUG: Off-by-one — loop condition uses <= instead of <.
        // When i == len, SetValue writes one element past the allocated buffer.
        // This overwrites whatever is adjacent in SRAM (another tensor's data,
        // queue metadata, etc.) with no error or warning.
        for (int32_t i = 0; i <= len; i++) {  // should be i < len
            yLocal.SetValue(i, yLocal.GetValue(i) * invSum);  // OOB at i==len
        }

        outQueue.EnQue<float>(yLocal);
        inQueue.FreeTensor(xLocal);
    }
    // ...
};

Rust — Safe:

#![allow(unused)]
#![feature(no_core)]
#![no_std]
#![no_core]

fn main() {
/// The count `n` passed to each vector op is the same value used to allocate
/// the buffer. There is no separate loop variable that could drift out of
/// sync. No element-wise indexing means no off-by-one.
#[ascend_std::aiv_kernel]
pub unsafe fn softmax(input: *const f32, output: *mut f32, len: *const u32) {
    unsafe {
        let n = *len;
        let buf_in = ascend_std::ascend_buf_alloc(n);
        let buf_out = ascend_std::ascend_buf_alloc(n);
        let buf_work = ascend_std::ascend_buf_alloc(n);

        ascend_std::ascend_buf_load_f32(buf_in, input, n);
        ascend_std::ascend_pipe_barrier();

        // softmax_f32 operates on the entire buffer of `n` elements.
        // There is no loop index, no GetValue(i), no SetValue(i, v).
        // The count `n` is the same value used in ascend_buf_alloc —
        // the allocation and the operation are inherently consistent.
        ascend_std::kernel_ops::softmax_f32(buf_out, buf_in, buf_work, n);

        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_buf_store_f32(output, buf_out, n);
    }
}
}

Key insight: The C++ API exposes GetValue(i)/SetValue(i, v) with no bounds check — a classic source of off-by-one errors. The Rust buffer-ID API operates on whole buffers with an explicit count parameter, eliminating element-wise indexing entirely.

6.3 Use-After-Free of LocalTensor

AscendC requires manual FreeTensor() calls to return SRAM buffers to the queue’s free pool. After FreeTensor(), the LocalTensor handle remains valid at the C++ type level — it still holds the original buffer address. Any subsequent GetValue() or SetValue() compiles and runs, reading/writing memory that may already be reallocated for a different tensor.

C++ — Vulnerable:

#include "kernel_operator.h"

class KernelVecAddUAF {
    // ...
    __aicore__ inline void Compute(int32_t len) {
        AscendC::LocalTensor<half> xLocal = inQueueX.DeQue<half>();
        AscendC::LocalTensor<half> yLocal = inQueueY.DeQue<half>();
        AscendC::LocalTensor<half> zLocal = outQueueZ.AllocTensor<half>();

        AscendC::Add(zLocal, xLocal, yLocal, len);

        // Return buffers to the free pool
        inQueueX.FreeTensor(xLocal);
        inQueueY.FreeTensor(yLocal);

        // BUG: xLocal was freed above, but the C++ handle still compiles.
        // The SRAM region has been returned to inQueueX's free list.
        // In a multi-tile kernel, this buffer may already be reallocated
        // by the next iteration's AllocTensor() call.
        half check = xLocal.GetValue(0);  // use-after-free!

        // The stale value may cause incorrect control flow decisions
        if ((float)check > 100.0f) {
            AscendC::Muls(zLocal, zLocal, (half)0.5f, len);  // based on garbage
        }

        outQueueZ.EnQue<half>(zLocal);
    }
    // ...
};

Rust — Safe:

#![allow(unused)]
#![feature(no_core)]
#![no_std]
#![no_core]

fn main() {
/// buf_x is a typed UbBuf ID — it never becomes invalid.
/// Compare with C++ where FreeTensor(xLocal) invalidates the buffer,
/// but xLocal.GetValue(0) still compiles and accesses freed SRAM.
#[ascend_std::aiv_kernel]
pub unsafe fn vec_add(x: *const u16, y: *const u16, z: *mut u16, len: *const u32) {
    unsafe {
        let n = *len;
        let block_idx = ascend_std::get_block_idx() as u32;
        let base = block_idx * n;

        let tile_size = 256u32;
        let buf_x = ascend_std::ascend_buf_alloc(tile_size);
        let buf_y = ascend_std::ascend_buf_alloc(tile_size);
        let buf_z = ascend_std::ascend_buf_alloc(tile_size);

        let mut offset = 0u32;
        loop {
            if offset >= n { break; }
            let mut len = tile_size;
            if offset + len > n { len = n - offset; }
            let gm_off = (base + offset) as usize;

            ascend_std::ascend_buf_load_f16(buf_x, x.wrapping_add(gm_off), len);
            ascend_std::ascend_buf_load_f16(buf_y, y.wrapping_add(gm_off), len);
            ascend_std::ascend_pipe_barrier();

            ascend_std::ascend_add_f16(buf_z, buf_x, buf_y, len);
            ascend_std::ascend_pipe_barrier();

            // No FreeTensor needed. buf_x, buf_y, buf_z are still valid.
            // The same buffer IDs are reused in the next tile iteration.
            ascend_std::ascend_buf_store_f16(z.wrapping_add(gm_off), buf_z, len);
            offset = offset + tile_size;
        }
        // Kernel returns. All buffers implicitly released.
    }
}
}

Key insight: C++ LocalTensor handles remain syntactically valid after FreeTensor() — the compiler cannot distinguish freed from live handles. In Rust, buffer IDs are #[repr(transparent)] newtype wrappers (UbBuf, L1Buf, L0aBuf, L0bBuf, L0cBuf) with no free operation; “using a buffer after it’s freed” is not a meaningful concept. The newtypes also prevent passing a buffer to the wrong memory level — e.g., passing an L0aBuf to a vector operation that expects UbBuf is a compile error.

6.4 Missing Synchronization Between Pipeline Stages

Ascend NPUs execute DMA (MTE2/MTE3), vector (V), and scalar (S) pipelines concurrently. A pipe_barrier() is required between a DMA load and a subsequent vector operation to ensure the data has actually arrived in local SRAM before computation begins. Forgetting this barrier is the single most common NPU bug — the kernel compiles and runs without error, but produces silently wrong results.

C++ — Vulnerable:

#include "kernel_operator.h"

class KernelSigmoidNoSync {
    // ...
    __aicore__ inline void CopyIn(int32_t offset, int32_t len) {
        AscendC::LocalTensor<float> xLocal = inQueue.AllocTensor<float>();
        AscendC::DataCopy(xLocal, inputGm[offset], len);
        // BUG: Missing pipe_barrier() between DMA load and EnQue.
        // The EnQue only marks the tensor as "available" in the queue,
        // but does NOT ensure the DMA transfer has completed.
        // If the DMA pipeline (MTE2) is slower than the scalar pipeline (S),
        // the subsequent DeQue + vector operations will read stale SRAM data.
        inQueue.EnQue(xLocal);
    }

    __aicore__ inline void Compute(int32_t len) {
        AscendC::LocalTensor<float> xLocal = inQueue.DeQue<float>();
        AscendC::LocalTensor<float> yLocal = outQueue.AllocTensor<float>();

        // Sigmoid = 1 / (1 + exp(-x))
        // Each of these vector operations may execute before the DMA load
        // completes, reading uninitialized or stale data from SRAM.
        AscendC::Muls(yLocal, xLocal, -1.0f, len);       // -x (stale data?)
        AscendC::Exp(yLocal, yLocal, len);                // exp(-x)
        AscendC::Adds(yLocal, yLocal, 1.0f, len);         // 1 + exp(-x)
        AscendC::Reciprocal(yLocal, yLocal, len);          // 1 / (1 + exp(-x))

        outQueue.EnQue<float>(yLocal);
        inQueue.FreeTensor(xLocal);
    }
    // ...
};

Rust — Safe:

#![allow(unused)]
#![feature(no_core)]
#![no_std]
#![no_core]

fn main() {
/// The pipe_barrier() between DMA load and compute is explicit and visible.
/// The sigmoid_f32 composite includes all internal barriers between its
/// four steps (muls → exp → adds → reciprocal).
#[ascend_std::aiv_kernel]
pub unsafe fn sigmoid(input: *const f32, output: *mut f32, len: *const u32) {
    unsafe {
        let n = *len;
        let buf_in = ascend_std::ascend_buf_alloc(n);
        let buf_out = ascend_std::ascend_buf_alloc(n);

        // DMA load from GM to UB
        ascend_std::ascend_buf_load_f32(buf_in, input, n);

        // Explicit barrier: guarantees DMA load is complete before
        // any vector operations read from buf_in.
        ascend_std::ascend_pipe_barrier();

        // sigmoid_f32 is a composite that internally does:
        //   muls(-1) → pipe_barrier → exp → pipe_barrier →
        //   adds(1) → pipe_barrier → reciprocal
        // All internal barriers are included — no way to forget one.
        ascend_std::kernel_ops::sigmoid_f32(buf_out, buf_in, n);

        // Explicit barrier: guarantees vector compute is complete
        // before DMA store reads from buf_out.
        ascend_std::ascend_pipe_barrier();

        // DMA store from UB to GM
        ascend_std::ascend_buf_store_f32(output, buf_out, n);
    }
}
}

Key insight: The C++ queue model (EnQue/DeQue) provides the illusion of synchronization but does not actually ensure DMA completion. In Rust, every barrier is explicit (ascend_pipe_barrier()), and kernel_ops composites include all internal barriers — the programmer cannot accidentally omit one within a composite operation.

6.5 Double-Free of Tensor Buffers

Calling FreeTensor() twice on the same LocalTensor inserts the same buffer address into the queue’s free list twice. The next two AllocTensor() calls will both return the same buffer, causing two “different” tensors to alias the same SRAM region. This manifests as intermittent data corruption that is tile-count-dependent.

C++ — Vulnerable:

#include "kernel_operator.h"

class KernelVecAddDoubleFree {
    // ...
    __aicore__ inline void Compute(int32_t len) {
        AscendC::LocalTensor<half> xLocal = inQueueX.DeQue<half>();
        AscendC::LocalTensor<half> yLocal = inQueueY.DeQue<half>();
        AscendC::LocalTensor<half> zLocal = outQueueZ.AllocTensor<half>();

        AscendC::Add(zLocal, xLocal, yLocal, len);

        inQueueX.FreeTensor(xLocal);
        inQueueY.FreeTensor(yLocal);
        outQueueZ.EnQue<half>(zLocal);

        // BUG: Copy-paste error from a refactoring — FreeTensor called again.
        // xLocal's buffer is now in inQueueX's free list TWICE.
        // On the next two tile iterations, AllocTensor will return the same
        // buffer address for two "different" tensors, causing them to alias.
        // One tile's DMA load will silently overwrite another tile's data.
        inQueueX.FreeTensor(xLocal);  // double-free! Corrupts free list
    }
    // ...
};

Rust — Safe:

#![allow(unused)]
#![feature(no_core)]
#![no_std]
#![no_core]

fn main() {
/// Buffer IDs (buf_x, buf_y, buf_z) are allocated once and reused across
/// all tile iterations. No manual lifecycle management means no double-free.
#[ascend_std::aiv_kernel]
pub unsafe fn vec_add(x: *const u16, y: *const u16, z: *mut u16, len: *const u32) {
    unsafe {
        let n = *len;
        let block_idx = ascend_std::get_block_idx() as u32;
        let base = block_idx * n;
        let tile_size = 256u32;

        // Allocate buffers once. These IDs are valid for the entire kernel.
        let buf_x = ascend_std::ascend_buf_alloc(tile_size);
        let buf_y = ascend_std::ascend_buf_alloc(tile_size);
        let buf_z = ascend_std::ascend_buf_alloc(tile_size);

        let mut offset = 0u32;
        loop {
            if offset >= n { break; }
            let mut len = tile_size;
            if offset + len > n { len = n - offset; }
            let gm_off = (base + offset) as usize;

            ascend_std::ascend_buf_load_f16(buf_x, x.wrapping_add(gm_off), len);
            ascend_std::ascend_buf_load_f16(buf_y, y.wrapping_add(gm_off), len);
            ascend_std::ascend_pipe_barrier();

            ascend_std::ascend_add_f16(buf_z, buf_x, buf_y, len);
            ascend_std::ascend_pipe_barrier();

            ascend_std::ascend_buf_store_f16(z.wrapping_add(gm_off), buf_z, len);

            // No FreeTensor here. Even if this line were duplicated by
            // copy-paste, there is simply no free function to call.
            offset = offset + tile_size;
        }
        // Kernel returns — all buffers implicitly released.
    }
}
}

Key insight: In C++, FreeTensor() is a manual operation that can be accidentally duplicated. In Rust, there is no free operation — buffer IDs are typed newtype wrappers (UbBuf, L1Buf, etc.) that encode the memory level at compile time. “Double-freeing” a buffer ID is meaningless.

6.6 Silent Integer Overflow in Multi-Block Offset

Multi-block kernels distribute work across NPU cores by computing offset = blockIdx * perBlockLen. With uint32_t arithmetic, this multiplication silently wraps on overflow — e.g., 8192 * 524288 = 0x100000000 wraps to 0. The kernel reads/writes from the wrong memory region, potentially aliasing another block’s data. In C++, unsigned overflow is defined behavior (modular arithmetic), so no warning is generated.

C++ — Vulnerable:

#include "kernel_operator.h"

class KernelVecAddOverflow {
    // ...
    __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, GM_ADDR len_buf) {
        uint32_t perBlockLen = *((__gm__ uint32_t *)len_buf);

        // BUG: Silent uint32_t overflow when blockIdx * perBlockLen > 2^32.
        //
        // Example: With 8192 blocks and perBlockLen = 524288 (512K elements),
        // total tensor size is 4GB of half-precision data. Block 8192 computes:
        //   offset = 8192 * 524288 = 4294967296 = 0x100000000
        // But uint32_t wraps: offset = 0. This block now aliases block 0's data.
        //
        // C++ provides no warning — unsigned overflow is well-defined as
        // modular arithmetic. The kernel silently reads the wrong data.
        uint32_t offset = AscendC::GetBlockIdx() * perBlockLen;

        xGm.SetGlobalBuffer((__gm__ half *)x + offset, perBlockLen);
        yGm.SetGlobalBuffer((__gm__ half *)y + offset, perBlockLen);
        zGm.SetGlobalBuffer((__gm__ half *)z + offset, perBlockLen);
        // ...
    }
    // ...
};

Rust — Safe:

#![allow(unused)]
#![feature(no_core)]
#![no_std]
#![no_core]

fn main() {
/// wrapping_mul documents that this multiplication may overflow for large
/// tensor sizes. A reviewer seeing wrapping_mul knows to check whether
/// the overflow is actually safe. In debug builds, plain `*` panics.
#[ascend_std::aiv_kernel]
pub unsafe fn vec_add(x: *const u16, y: *const u16, z: *mut u16, len: *const u32) {
    unsafe {
        let n = *len;
        let block_idx = ascend_std::get_block_idx() as u32;

        // wrapping_mul makes overflow semantics explicit.
        // A developer reading this line knows that:
        //   1. This multiplication CAN overflow for large inputs
        //   2. The overflow behavior is intentionally wrapping
        //   3. This is a potential correctness concern worth reviewing
        //
        // In debug builds (CPU-side testing), plain `*` would panic:
        //   let offset = block_idx * n;  // panics in debug if overflows!
        let offset = block_idx.wrapping_mul(n);

        let tile_size = 256u32;
        let buf_x = ascend_std::ascend_buf_alloc(tile_size);
        let buf_y = ascend_std::ascend_buf_alloc(tile_size);
        let buf_z = ascend_std::ascend_buf_alloc(tile_size);

        let mut tile_off = 0u32;
        loop {
            if tile_off >= n { break; }
            let mut len = tile_size;
            if tile_off + len > n { len = n - tile_off; }
            let gm_off = (offset.wrapping_add(tile_off)) as usize;

            ascend_std::ascend_buf_load_f16(buf_x, x.wrapping_add(gm_off), len);
            ascend_std::ascend_buf_load_f16(buf_y, y.wrapping_add(gm_off), len);
            ascend_std::ascend_pipe_barrier();

            ascend_std::ascend_add_f16(buf_z, buf_x, buf_y, len);
            ascend_std::ascend_pipe_barrier();

            ascend_std::ascend_buf_store_f16(z.wrapping_add(gm_off), buf_z, len);
            tile_off = tile_off + tile_size;
        }
    }
}
}

Key insight: In C++, blockIdx * perBlockLen silently wraps with no indication the developer considered overflow. In Rust, wrapping_mul explicitly documents the intent, and in debug builds regular * panics on overflow — catching the bug during development before it reaches hardware.


English | 中文版

7. End-to-End Pipeline Walkthrough

Let’s trace the complete journey from source code to NPU execution during a single cargo run.

7.1 Compilation Phase

graph TD
    A["Rust Kernel Source<br/>kernels/src/lib.rs"] -->|"rustc + rustc_codegen_mlir"| B["Rust MIR<br/>Type-checked, monomorphized"]
    B -->|"builder_methods.rs:<br/>MIR ops → MLIR ops"| C["MLIR Modules<br/>LLVM · Arith · CF dialects<br/>hacc.entry attribute"]
    C -->|"compile_ascend.rs:<br/>merge all modules"| D["Merged MLIR<br/>kernel code + ascend_std deps"]
    D -->|"mlir_to_cpp<br/>(default)"| E["Generated C++<br/>AscendC class with TBuf,<br/>DataCopy, ReduceMax, Exp, ..."]
    D -->|"mlir_to_pto<br/>(ACLRS_CODEGEN_PATH=pto)"| P["PTO Assembly<br/>pto.tload, pto.tadd, pto.tmatmul,<br/>pto.trowmax, pto.texp, ..."]
    P -->|"ptoas --enable-insert-sync"| E
    E --> F["ascend_compile crate<br/>Target abstraction · Validation<br/>Bisheng invocation · C ABI + CLI"]
    F -->|"310P: --cce-aicore-arch=dav-m200"| G["NPU Binary · kernel.acl.o<br/>Ascend 310P machine code"]
    F -->|"910B: --cce-aicore-arch=dav-c220"| H["NPU Binary · kernel.acl.o<br/>Ascend 910B machine code<br/>(413 tests verified)"]

7.1.1 The ascend_compile Compilation Hub

The ascend_compile crate (crates/ascend_compile/) is a standalone compilation library that decouples kernel compilation from the rustc_codegen_mlir backend. Any C++ kernel generator — whether from ascend-rs’s own MLIR-to-C++ pipeline, TileLang, Triton, PyPTO (CANN’s tile-level operator DSL), or future frontends — can use it to compile AscendC kernels:

graph TD
    A1["ascend-rs<br/>Rust→MLIR→C++"] --> E["AscendC C++ kernel source"]
    A2["TileLang<br/>Python DSL→AscendC (planned)"] -.-> E
    A3["Triton<br/>GPU kernel compiler (planned)"] -.-> E
    A4["PyTorch<br/>torch.compile (planned)"] -.-> E
    A5["PyPTO<br/>CANN tile-level DSL (planned)"] -.-> E
    E --> F["ascend_compile<br/><br/>Rust API · C ABI · CLI · Python<br/><br/>3 validation passes<br/>Dual flag paths · 310P + 910B<br/>Object or shared library output"]
    F --> G["NPU Binary · .o / .so"]

This architecture enables the broader Ascend ecosystem to benefit from ascend-rs’s validated compilation pipeline without depending on Rust or rustc. The dashed edges indicate planned integrations not yet implemented.

7.1.2 Alternative Codegen Path: PTOAS (Programmable Tile Operation Assembly)

In addition to the default mlir_to_cpp path, ascend-rs supports an experimental PTO (Programmable Tile Operations) codegen path that targets the pto-isa virtual ISA — the same tile-level instruction set used internally by CANN’s FlashAttention implementation on Ascend 910B.

Activation. Set ACLRS_CODEGEN_PATH=pto to route kernel compilation through the PTO path instead of direct C++ generation:

export ACLRS_CODEGEN_PATH=pto          # Enable PTO path (default: cpp)
export ACLRS_PTOAS_PATH=/path/to/ptoas # Optional: explicit ptoas binary location

Pipeline. The PTO path adds an intermediate representation layer between MLIR and the final C++ that bisheng compiles:

graph LR
    A["Merged MLIR<br/>(LLVM dialect)"] -->|"mlir_to_pto"| B["PTO Assembly<br/>(pto dialect MLIR)"]
    B -->|"ptoas<br/>--enable-insert-sync"| C["AscendC C++"]
    C -->|"bisheng"| D[".acl.o"]

The key advantage of this intermediate step is that ptoas automatically inserts synchronization barriers (set_flag/wait_flag) between pipeline stages. In the direct C++ path, the codegen must explicitly emit pipe_barrier(PIPE_ALL) between DMA and compute operations — getting this wrong causes silent data corruption or NPU hangs. The PTO path delegates barrier insertion to the ptoas assembler, which has exact knowledge of the hardware pipeline topology.

Tile intrinsics API. The ascend_std::tile module provides safe Rust wrappers for PTO tile operations:

#![allow(unused)]
fn main() {
use ascend_std::tile::*;

pub unsafe fn tile_softmax(input: *const f32, output: *mut f32) {
    // Load 32×32 tile from global memory
    let x: Tile<32, 32, f32> = tile_load_f32(input);

    // Numerically-stable softmax decomposition (5 PTO ops):
    // 1. Row-wise max: pto.trowmax
    // 2. Subtract max:  pto.trowexpandsub
    // 3. Exponential:   pto.texp
    // 4. Row-wise sum:  pto.trowsum
    // 5. Divide by sum: pto.trowexpanddiv
    let y: Tile<32, 32, f32> = tile_softmax_f32(x);

    // Store result to global memory
    tile_store_f32(output, y);
}
}

The Tile<ROWS, COLS, T> type is a move-only handle (no Copy) that ensures single-ownership semantics — preventing double-DMA and enforcing compile-time safety. Const generic parameters carry shape information through the type system, catching dimension mismatches at compile time rather than at NPU runtime.

Matmul via cube unit. Tile matmul maps to the hardware’s cube engine through a multi-level memory hierarchy pipeline:

#![allow(unused)]
fn main() {
// (M×K) @ (K×N) → (M×N), routed through L1→L0A/L0B→Cube→L0C
let a: Tile<32, 32, f32> = tile_load_f32(a_ptr);
let b: Tile<32, 32, f32> = tile_load_f32(b_ptr);
let c: Tile<32, 32, f32> = tile_matmul_f32(a, b);  // pto.tmatmul
tile_store_f32(c_ptr, c);
}

The mlir_to_pto translator generates the full cube-unit pipeline: GM→CBUF staging tiles (pto.tload), CBUF→L0A/L0B movement (pto.tmov), matrix multiply on L0C (pto.tmatmul), and writeback — all with correct buffer layout attributes (blayout, slayout, fractal) for each memory level.

PTO virtual ISA. The translator emits the following PTO-dialect operations:

CategoryOperationsDescription
Memorypto.tload, pto.tstoreGM↔local tile DMA transfers
Element-wisepto.tadd, pto.tmul, pto.texpVectorized arithmetic and transcendentals
Reductionpto.trowmax, pto.trowsum, pto.trowexpandsub, pto.trowexpanddivRow-wise reductions with broadcast
Cubepto.tmatmul, pto.tmovMatrix multiply and inter-level data movement
Memory mgmtpto.alloc_tile, pto.make_tensor_view, pto.partition_viewBuffer allocation and GM partitioning

Each PTO tile buffer carries explicit layout metadata specifying its memory level (vec, mat, left, right, acc), data layout (row_major/col_major), and fractal size — enabling ptoas to generate correct data movement instructions for the hardware’s fractal memory architecture.

7.2 Runtime Phase

graph TD
    subgraph Host["Host CPU"]
        H1["Acl::new()"] --> H2["Device::new"]
        H2 --> H3["AclContext"]
        H3 --> H4["AclStream"]
        H4 --> H5["DeviceBuffer::from_slice()"]
        H5 --> H6["kernel.launch()"]
        H6 --> H7["stream.sync()"]
        H7 --> H8["z_device.to_host()"]
        H8 --> H9["Verify results"]
        H9 --> H10["RAII Drop · auto-clean"]
    end
    subgraph Device["NPU Device"]
        D1["AI Core 0<br/>block_idx=0<br/>Process x 0..8"]
        D2["AI Core 1<br/>block_idx=1<br/>Process x 8..16"]
        D3["Device Memory<br/>x: Input A · y: Input B<br/>z: Output = A * B"]
    end
    H4 -.->|"stream binds"| D3
    H5 -.->|"Host → Device copy"| D3
    H6 -.->|"Kernel execution"| D1
    H6 -.->|"Kernel execution"| D2
    H7 -.->|"Completion signal"| Device
    H8 -.->|"Device → Host transfer"| D3
    H10 -.->|"Resources freed"| Device

7.3 Memory Safety Guarantees

Throughout this process, ascend-rs provides the following compile-time safety guarantees:

Safety IssueC++ Approachascend-rs Approach
Device memory leakManual aclrtFreeDrop on DeviceBuffer<T>
Wrong deallocation orderProgrammer conventionLifetime system prevents at compile time
Use-after-free streamNo checkCompile error
Send unsafe type to deviceNo checkDeviceSend trait bound
Forgetting to synchronizeSilent data corruptionType system extensible to enforce

English | 中文版

8. Next Steps: Roadmap and Vision

Current Status

ascend-rs is in active development:

  • Host API: Alpha stage. ACL operations, memory management, kernel launching, BLAS, DVPP, profiling, and HCCL are implemented.
  • Build tooling: Alpha stage. Supports compilation of both C++ and Rust kernels with automatic codegen path selection.
  • ascend_compile crate: Standalone kernel compilation library with C ABI, CLI, and Python bindings. Decouples bisheng invocation from rustc, enabling any C++ kernel generator to compile for Ascend NPU.
  • Device runtime: 505 Rust NPU kernels (486 compiletests + 16 deployable + 6 tile) with complete 1:1 MultiKernelBench coverage across 17 categories, 413 tests passing NPU correctness verification on Ascend 910B3 (0 fail, 0 crash), including 37 matmul tests via aclnn operator composition, and 6 memory safety case studies demonstrating structural advantages over AscendC C++.
  • Benchmarks: Rust vector kernels match hand-optimized C++ performance (zero overhead) on softmax, activations, vec_add, and matmul.

Short-term Goals

Vector intrinsic coverage: The vector intrinsic API covers a comprehensive set of operations for f32 and f16:

  • Arithmetic: Add, Sub, Mul, Div, Min, Max ✓ Implemented
  • Reductions: ReduceMax, ReduceMin, ReduceSum ✓ Implemented
  • Unary math: Exp, Abs, Ln, Sqrt, Rsqrt, Reciprocal ✓ Implemented
  • Scalar-vector: Adds, Muls, Maxs, Mins (f32 and f16) ✓ Implemented
  • Activation functions: Relu, Sigmoid, Tanh, GELU, Softmax, ELU, Swish, Mish, SELU, Softplus, Softsign, HardSigmoid, HardSwish, Leaky ReLU, Log Softmax ✓ Implemented (16 activations)
  • Composite operations: LayerNorm, RMSNorm, L1/L2 Norm, MSE/Huber/Hinge Loss, Cosine Similarity, SGD Update, Reduce Mean/Prod ✓ Implemented (17 composites in kernel_ops.rs)
  • Cube engine: matmul_f16 via Mmad FFI (f16 inputs → f32 output) ✓ Implemented
  • Cube engine transpose: matmul_f16_transpose_b with hardware L1→L0B transpose ✓ Implemented
  • Tiling and double-buffering: Queue-based (TQue) pipeline for overlapping DMA and compute
  • Type-safe buffer handles: #[repr(transparent)] newtype wrappers (UbBuf, L1Buf, L0aBuf, L0bBuf, L0cBuf) that prevent mixing buffer memory levels at compile time ✓ Implemented

End-to-end neural network operator examples:

  • Conv2D ✓ — Pre-built operator via OpsBuilder/atc, with host-side Model+Dataset execution and CPU reference verification
  • Multi-Head Attention (MHA) ✓ — Host-orchestrated scaled dot-product attention pipeline: Q*K^T (HGEMM) → scale (Rust kernel) → row-wise softmax (Rust kernel with f16 reduce/exp/muls intrinsics) → weights*V (HGEMM)
  • BLAS API improvement ✓ — acl_blas_gemm_ex alpha/beta changed from owned to borrowed (&DeviceBox<T>), enabling reuse across multiple GEMM calls in pipelines like MHA

Device-side Rust language support: Core operators and codegen are complete:

  • Operators: Add, Sub, Mul, Div, Rem, bitwise ops (BitAnd, BitOr, Shl, Shr) ✓ Implemented
  • Codegen: Signed/float remainder, float-integer conversions ✓ Implemented
  • Type casting: Cast codegen for f16↔f32 conversions ✓ Implemented
  • Iterator combinators: map, filter, fold, zip, enumerate, etc.

Mid-term Goals: Ecosystem Integration

ascend_compile as the universal compilation backend: The standalone ascend_compile crate provides a single, validated compilation path for any tool that generates AscendC C++ kernels. It exposes four interfaces:

InterfaceConsumerUse Case
Rust APIrustc_codegen_mlirascend-rs’s own MLIR→C++→binary pipeline
C ABI (libascend_compile.so)Python via ctypesDrop-in replacement for TileLang’s libgen.py
CLI (ascend-compile)Shell scripts, CIAd-hoc compilation and validation
Python wrapper (ascend_compile.py)TileLang, Triton backendsDirect Python integration

Key features that benefit all consumers:

  • 3 validation passes before compilation: entry point check, DMA/sync barrier check (error on 310P, warning on 910B), buffer size vs. hardware limits
  • Dual flag paths: --cce-aicore-arch for 310P/310B and --npu-arch -xasc for 910B (TileLang-compatible)
  • Both object and shared library output: -c -o out.o or -fPIC --shared -o out.so

TileLang-Ascend integration: TileLang generates optimized AscendC C++ kernels from a Python DSL but relies on a bare subprocess.run(bisheng, ...) call with no validation. Replacing LibraryGenerator.compile_lib() with ascend_compile.compile_kernel() provides:

  • Automatic target detection and correct flag selection
  • Pre-compilation validation that catches common NPU bugs (missing sync barriers, buffer overflows)
  • Consistent compilation across tools — the same flags ascend-rs uses for its own validated kernels

PyPTO integration: PyPTO (Parallel Tile Operations) is CANN’s high-level operator programming framework that compiles Python-level tensor operations through a ~90-instruction PTO virtual ISA down to AscendC C++ code. When PyPTO is released alongside the CANN framework, ascend_compile can serve as the compilation backend, and an ascend-rs interface to PyPTO would enable memory-safe static analysis of tile-level operators — catching buffer overflows, missing synchronization barriers, and incorrect DMA parameters at compile time that PyPTO currently validates only at code-generation time.

Triton-Ascend backend: Triton’s compiler pipeline produces target-specific IR that must be lowered to device binaries. A Triton backend for Ascend can use ascend_compile to handle the final AscendC C++ → NPU binary step, benefiting from the same validation and target abstraction.

PyTorch integration path: torch.compile with an Ascend backend could leverage ascend_compile through its C ABI to compile generated kernels without a Python→Rust dependency, using the same libascend_compile.so that TileLang uses.

Complete host API: All major CANN API modules now have safe Rust wrappers:

  • Tensor descriptors ✓ — TensorDesc, DataBuffer, Dataset (28 methods)
  • Model inference ✓ — Model::from_file(), execute(), execute_async(), ModelDescription (16 methods)
  • Event management ✓ — AclEvent with record/sync/timing (8 methods)
  • DVPP image preprocessing ✓ — DvppChannel, PicDesc, resize/crop/JPEG/PNG (42 methods)
  • Profiling API ✓ — ProfSession, ProfConfig, StepInfo, ProfStamp (18 methods)
  • HCCL distributed communication ✓ — AllReduce, AllGather, Broadcast, ReduceScatter, Send/Recv (17 methods)

MLIR codegen backend improvements:

  • Rust intrinsics ✓ — bit manipulation (ctlz/cttz/ctpop/bswap/bitreverse/rotate), float math (floor/ceil/round/trunc/copysign/fma), overflow arithmetic, saturating arithmetic
  • Float constant support ✓ — proper MLIR attribute formatting with decimal points
  • C++ codegen intrinsic translation ✓ — all LLVM intrinsics now mapped to GCC builtins and C math functions
  • Correctness fixes ✓ — raw_eq (byte comparison), discriminant_value (enum match), const_uint_big (i128), static_addr_of (global symbols), codegen_static (initializer values)
  • Debug info generation (not yet started)

Long-term Vision

Ascend target specification — davinci-huawei-none: We have prepared a concrete Tier-3 target proposal for the Rust compiler. The target triple davinci-huawei-none follows established conventions (nvptx64-nvidia-cuda, amdgcn-amd-amdhsa) and defines ABI, calling conventions, and pointer sizes for the DaVinci NPU architecture. The target spec (upstream-tier3/compiler/rustc_target/src/spec/targets/davinci_huawei_none.rs) uses aarch64-unknown-none as the LLVM placeholder (since no DaVinci LLVM backend exists) and registers cfg(target_arch = "davinci") for conditional compilation. The upstream-tier3/ directory contains the complete submission package: target spec, platform-support documentation, patches for mod.rs/platform-support.md/bootstrap/sanity.rs, and community engagement materials (Zulip post, optional MCP draft, PR description). Our engagement plan: (1) post to Zulip #t-compiler/help for early feedback on the triplet name, (2) file an MCP if the novel MLIR codegen backend warrants compiler-team consensus, (3) open a draft PR to rust-lang/rust. Tier-3 targets have the lowest bar — no RFC, no CI, single-reviewer approval — and our in-tree changes contain no proprietary code.

Reducing the no_core burden: Maintaining a parallel core library reimplementation is a massive engineering effort. The long-term direction is to explore using -Zbuild-std=core with the MLIR backend to compile the Rust standard library source directly, rather than reimplementing by hand.

A unified Ascend compilation stack: The ascend_compile crate is the first step toward a unified compilation infrastructure where multiple frontends (Rust, Python DSLs, compiler IRs) share the same validated, target-aware backend. This mirrors the LLVM model — many frontends, one backend — but specialized for Ascend NPU hardware:

graph TD
    A1["Rust kernels"] --> F["AscendC C++ · common IR"]
    A2["TileLang (planned)"] -.-> F
    A3["Triton (planned)"] -.-> F
    A4["torch.compile (planned)"] -.-> F
    A5["PyPTO (planned)"] -.-> F
    A6["Future DSLs (planned)"] -.-> F
    F --> G["ascend_compile: validate → target flags → bisheng → binary"]
    G --> H["NPU Binary · .o / .so"]

Community Involvement

ascend-rs is currently in a private repository, pending an organizational decision on open-sourcing. Once released, it will welcome community participation. If you have Ascend NPU hardware and are interested in exploring memory-safe kernel programming, here are areas where contributions would be valuable:

  1. Add new vector intrinsics to ascend_std: Following the established pattern of extern "C" stubs + mlir_to_cpp handlers.
  2. Write more compiletest tests: As new features are added to ascend_std, corresponding compile tests should follow.
  3. Expand host API wrappers: The CANN SDK has many unwrapped APIs, each of which can be contributed independently.
  4. Try writing more complex Rust kernels: Help discover gaps in the codegen backend and validate new intrinsics on NPU hardware.
  5. Integrate ascend_compile with your tool: If you work on TileLang, Triton, or other kernel compilers targeting Ascend, try replacing your compilation step with ascend_compile and report issues.

English | 中文版

Conclusion

The ascend-rs project demonstrates that memory safety in NPU programming is achievable without sacrificing performance. Through Rust’s ownership system, lifetimes, and RAII patterns, we eliminate an entire class of memory safety errors at compile time — errors that traditional C++ NPU programming can only guard against through programmer experience and discipline.

From Hello World to the vectorized softmax kernel, we’ve seen a complete pipeline from source to NPU execution: Rust source → MLIR intermediate representation → C++ with AscendC vector intrinsics → NPU binary → device execution → safe result retrieval. With 413 tests passing on Ascend 910B3 hardware (0 failures, 0 crashes) across all kernel categories, benchmark results confirm that Rust vectorized kernels match the performance of hand-optimized C++ — with zero overhead. The experimental PTOAS codegen path (Section 7.1.2) further demonstrates that tile-level operations can be expressed in safe Rust and compiled through the PTO virtual ISA, with automatic synchronization barrier insertion by the ptoas assembler.

With the introduction of the ascend_compile crate, ascend-rs now extends its impact beyond Rust kernel authors. By providing a standalone, validated compilation library with C ABI and Python bindings, the project enables the broader Ascend ecosystem — TileLang, Triton, PyTorch, and future compiler frameworks — to share a common, well-tested compilation backend. The same validation passes that catch missing sync barriers and buffer overflows in Rust-generated kernels now protect kernels from any source.

The direction is clear: bring safety guarantees to every Ascend NPU user, whether they’re writing Rust kernels, Python DSLs, or integrating compiler toolchains — and make the entire ecosystem more reliable in the process.


About the Project

ascend-rs is developed internally at Huawei Boyle Research Center and is pending an open-source release decision. If you’re interested in memory-safe NPU programming or collaboration, please contact the author.


Author: Yijun Yu


English | 中文版

Appendix: Real-World Memory Safety Vulnerabilities in GPU/NPU Ecosystems

The six memory safety case studies in Section 6 demonstrate structural patterns where Rust prevents common mistakes. However, memory safety in accelerator code is not merely a theoretical concern — it has led to actively exploited zero-day vulnerabilities, production crashes, and security incidents across every major GPU/NPU vendor. This appendix documents concrete, citable cases.

A.1 ARM Mali GPU: Use-After-Free Exploited by Spyware (CVE-2023-4211)

A use-after-free in the ARM Mali GPU kernel driver’s VMA tracking allowed privilege escalation on billions of Android devices. An attacker could split a multi-page tracking VMA via munmap(), causing the teardown routine to null out kctx->process_mm while bookkeeping was still pending. Google TAG confirmed this was actively exploited by a commercial surveillance vendor. Rust’s ownership model prevents use-after-free by construction — the freed VMA would be consumed/dropped, and any subsequent reference would be a compile-time error.

Sources: Google Project Zero; Arm Security Bulletin

A.2 ARM Bifrost/Valhall GPU: Actively Exploited Zero-Day (CVE-2024-4610)

Another use-after-free in ARM GPU drivers, this time affecting Bifrost and Valhall architectures (r34p0–r40p0). CISA confirmed active exploitation in the wild across hundreds of millions of smartphones and embedded devices. Rust’s borrow checker enforces exclusive mutable access, making the dangling reference pattern impossible.

Source: CISA KEV Catalog

A.3 NVIDIA GPU Driver: Out-of-Bounds Write (CVE-2024-0090)

An out-of-bounds write in the NVIDIA GPU display driver for Linux and Windows enabled privilege escalation. Rust’s bounds checking on slice access would catch this with a safe panic rather than silent memory corruption.

Source: NVD; SecurityWeek

A.4 AMDGPU Fence: Use-After-Free Race Condition (CVE-2023-51042)

A race condition in the Linux AMDGPU driver’s amdgpu_cs_wait_all_fences() allowed code to access a fence object after it was freed. This triggered kernel crashes and potential privilege escalation, requiring emergency patches from Red Hat, SUSE, and Ubuntu. Rust’s ownership model makes data races a compile-time error — the fence would be protected by Arc<Mutex<...>>, preventing both the use-after-free and the underlying race.

Source: NVD

A.5 NVIDIA CUDA Toolkit: Heap Buffer Overflow via Integer Overflow (CVE-2024-53873)

Nine vulnerabilities in NVIDIA CUDA Toolkit’s cuobjdump utility, caused by integer overflow during cubin file parsing leading to heap buffer overflow. Rust’s checked arithmetic (overflow panics in debug, wrapping_mul required for explicit wrapping) prevents the integer overflow, and Vec/slice bounds checking prevents the subsequent heap corruption.

Source: Palo Alto Unit42

A.6 Qualcomm Adreno GPU: Three Zero-Days Exploited in Targeted Attacks (CVE-2025-21479/21480/27038)

Three zero-day vulnerabilities in Qualcomm Adreno GPU drivers, including unauthorized GPU microcode command execution and a use-after-free during rendering. Actively exploited in targeted attacks on billions of Android devices. Rust’s memory safety guarantees prevent the UAF, and the ownership model constrains what operations are possible on GPU resources.

Sources: The Hacker News; BleepingComputer

A.7 PyTorch CUDA Kernel: Silent Out-of-Bounds Access (Issue #37153)

In PyTorch’s Reduce.cuh, accessing iter.shape()[0] on a scalar input (where iter.shape() returns an empty array) caused an out-of-bounds memory read. This led to flaky test failures that were extremely difficult to reproduce or diagnose — a classic silent data corruption pattern. Rust’s slice indexing panics on empty-slice access rather than silently reading garbage memory.

Source: PyTorch Issue #37153

A.8 TensorFlow GPU Kernels: Repeated Heap Buffer Overflows (CVE-2023-25668, CVE-2020-15198, CVE-2019-16778)

A pattern of heap buffer overflows in TensorFlow GPU kernels: QuantizeAndDequantize reading past tensor bounds (CVE-2023-25668), SparseCountSparseOutput with mismatched tensor shapes (CVE-2020-15198), and UnsortedSegmentSum truncating int64 to int32 producing negative indices (CVE-2019-16778). These are particularly dangerous because ML models loaded from untrusted sources can trigger them. Rust prevents all three: bounds checking catches overflows, the type system can enforce shape consistency, and explicit as cast semantics prevent silent truncation.

Sources: Snyk: CVE-2023-25668; GitHub Advisory: CVE-2019-16778

A.9 GPU Memory Exploitation for Fun and Profit (USENIX Security 2024)

Academic research demonstrating that buffer overflows in CUDA kernel global memory can be exploited for code injection, return-oriented programming on GPU, and cross-tenant ML model weight corruption. Unlike CPUs, GPU memory spaces lack ASLR, stack canaries, and other standard protections. A malicious GPU kernel can corrupt another tenant’s model weights in shared GPU cloud deployments. Rust’s bounds checking prevents buffer overflows entirely in safe code — exactly the class of attack this paper demonstrates.

Source: USENIX Security 2024

Summary

CVEComponentBug ClassExploited?
CVE-2023-4211ARM Mali GPU driverUse-after-freeYes (spyware)
CVE-2024-4610ARM Bifrost/Valhall GPUUse-after-freeYes
CVE-2024-0090NVIDIA GPU driverOut-of-bounds writePatched
CVE-2023-51042AMDGPU Linux driverUse-after-free (race)Patched
CVE-2024-53873NVIDIA CUDA ToolkitHeap buffer overflowPatched
CVE-2025-21479Qualcomm Adreno GPUMemory corruption / UAFYes (targeted)
#37153PyTorch CUDA kernelsOut-of-bounds readN/A
CVE-2023-25668+TensorFlow GPU kernelsHeap buffer overflowN/A
USENIX ’24CUDA memory modelBuffer overflow (cross-tenant)Demonstrated

Every major GPU/NPU vendor — NVIDIA, AMD, ARM, Qualcomm — has shipped memory safety vulnerabilities in their accelerator drivers and toolchains. At least four were actively exploited in the wild. The bug classes — use-after-free, out-of-bounds writes, buffer overflows, race conditions — are precisely the categories that Rust’s ownership model, borrow checker, and bounds checking eliminate at compile time. This is the practical motivation for ascend-rs: not just cleaner code, but eliminating vulnerabilities that have real-world security consequences.


English | 中文版

Appendix B: CVE Code Analysis — Vulnerable C++ vs Safe Rust Mitigations

This appendix presents the actual (or reconstructed) vulnerable C/C++ code from the CVEs documented in Appendix A, paired with ascend-rs-style Rust code that structurally prevents each vulnerability class.

B.1 Use-After-Free via Reference Count Drop (CVE-2023-51042, AMDGPU)

The Linux AMDGPU driver dereferences a fence pointer after dropping its reference count.

Vulnerable C code (from drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c, before fix 2e54154):

// Inside amdgpu_cs_wait_all_fences()
r = dma_fence_wait_timeout(fence, true, timeout);
dma_fence_put(fence);          // Reference dropped — fence may be freed
if (r < 0)
    return r;
if (r == 0)
    break;
if (fence->error)              // USE-AFTER-FREE: fence already freed
    return fence->error;

ascend-rs mitigation — Rust’s ownership ensures the value is consumed, not dangled:

#![allow(unused)]
fn main() {
// ascend_rs host API pattern: Arc<Fence> enforces lifetime
fn wait_all_fences(fences: &[Arc<Fence>], timeout: Duration) -> Result<()> {
    for fence in fences {
        let status = fence.wait_timeout(timeout)?;
        // fence.error is checked WHILE we still hold the Arc reference
        if let Some(err) = fence.error() {
            return Err(err);
        }
        // Arc reference is alive until end of loop iteration —
        // Rust compiler rejects any code that uses fence after drop
    }
    Ok(())
}
}

Why Rust prevents this: Arc<Fence> is reference-counted. The compiler ensures you cannot access fence.error() after the Arc is dropped — the borrow checker rejects any reference to a moved/dropped value at compile time. There is no way to write the C pattern (use after put) in safe Rust.

B.2 Out-of-Bounds Write via Unchecked User Index (CVE-2024-0090, NVIDIA)

The NVIDIA GPU driver accepts a user-supplied index via ioctl without bounds checking.

Vulnerable C code (reconstructed from CVE description):

// NVIDIA GPU driver ioctl handler
struct gpu_resource_table {
    uint32_t entries[MAX_GPU_RESOURCES];
    uint32_t count;
};

static int nvidia_ioctl_set_resource(struct gpu_resource_table *table,
                                     struct user_resource_request *req)
{
    // BUG: No bounds check on user-supplied index
    table->entries[req->index] = req->value;   // OUT-OF-BOUNDS WRITE
    return 0;
}

ascend-rs mitigation — Rust slices enforce bounds at the type level:

#![allow(unused)]
fn main() {
// ascend_rs host API: DeviceBuffer<T> wraps a bounded slice
struct GpuResourceTable {
    entries: Vec<u32>,  // Vec tracks its own length
}

impl GpuResourceTable {
    fn set_resource(&mut self, index: usize, value: u32) -> Result<()> {
        // Option 1: Panics on out-of-bounds (debug + release)
        self.entries[index] = value;

        // Option 2: Returns None for out-of-bounds (graceful)
        *self.entries.get_mut(index)
            .ok_or(Error::IndexOutOfBounds)? = value;
        Ok(())
    }
}
}

Why Rust prevents this: Vec<u32> tracks its length. Indexing with [] performs a bounds check and panics (safe termination, not memory corruption). Using .get_mut() returns None for out-of-bounds access. There is no way to silently write past the buffer in safe Rust.

B.3 Integer Overflow Leading to Heap Buffer Overflow (CVE-2024-53873, NVIDIA CUDA Toolkit)

The CUDA cuobjdump tool reads a 2-byte signed value from a crafted .cubin file, sign-extends it, and uses the corrupted size in memcpy.

Vulnerable C code (from Talos disassembly analysis):

// Parsing .nv_debug_source section in cubin ELF files
int16_t name_len_raw = *(int16_t*)(section_data);  // e.g., 0xFFFF = -1
int32_t name_len = (int32_t)name_len_raw;           // sign-extends to -1
int32_t alloc_size = name_len + 1;                   // -1 + 1 = 0
memcpy(dest_buf, src, (size_t)alloc_size);           // HEAP BUFFER OVERFLOW

ascend-rs mitigation — Rust’s checked arithmetic catches overflow:

#![allow(unused)]
fn main() {
// ascend_rs: parsing NPU binary metadata with safe arithmetic
fn parse_debug_section(section: &[u8], dest: &mut [u8]) -> Result<()> {
    let name_len_raw = i16::from_le_bytes(
        section.get(0..2).ok_or(Error::TruncatedInput)?.try_into()?
    );

    // checked_add returns None on overflow instead of wrapping
    let alloc_size: usize = (name_len_raw as i32)
        .checked_add(1)
        .and_then(|n| usize::try_from(n).ok())
        .ok_or(Error::IntegerOverflow)?;

    // Slice bounds checking prevents buffer overflow
    let src = section.get(offset..offset + alloc_size)
        .ok_or(Error::BufferOverflow)?;
    dest.get_mut(..alloc_size)
        .ok_or(Error::BufferOverflow)?
        .copy_from_slice(src);
    Ok(())
}
}

Why Rust prevents this: checked_add() returns None on overflow. usize::try_from() rejects negative values. Slice indexing with .get() returns None for out-of-bounds ranges. The entire chain is safe — no silent wrapping, no unchecked memcpy.

B.4 Out-of-Bounds Read on Empty Container (PyTorch Issue #37153)

PyTorch’s CUDA reduce kernel indexes into iter.shape() which returns an empty array for scalar tensors.

Vulnerable C++ code (from aten/src/ATen/native/cuda/Reduce.cuh):

// iter.shape() returns empty IntArrayRef for scalar input
// iter.ndim() returns 0
int64_t dim0;
if (reduction_on_fastest_striding_dimension) {
    dim0 = iter.shape()[0];  // OUT-OF-BOUNDS: shape() is empty
    // dim0 = garbage value (e.g., 94599111233572)
}

ascend-rs mitigation — Rust’s Option type makes emptiness explicit:

#![allow(unused)]
fn main() {
// ascend_rs kernel: safe tensor shape access
fn configure_reduce_kernel(shape: &[usize], strides: &[usize]) -> Result<KernelConfig> {
    // .first() returns Option<&T> — None for empty slices
    let dim0 = shape.first()
        .copied()
        .ok_or(Error::ScalarTensorNotSupported)?;

    // Or use pattern matching for multiple dimensions
    let (dim0, dim1) = match shape {
        [d0, d1, ..] => (*d0, *d1),
        [d0] => (*d0, 1),
        [] => return Err(Error::EmptyShape),
    };

    Ok(KernelConfig { dim0, dim1 })
}
}

Why Rust prevents this: shape.first() returns Option<&usize>, forcing the caller to handle the empty case. The match on slice patterns is exhaustive — the compiler requires the [] (empty) arm. shape[0] on an empty slice panics with a clear message instead of reading garbage.

B.5 Integer Truncation Bypassing Bounds Checks (CVE-2019-16778, TensorFlow)

TensorFlow’s UnsortedSegmentSum kernel implicitly truncates int64 tensor sizes to int32.

Vulnerable C++ code (from tensorflow/core/kernels/segment_reduction_ops.h):

template <typename T, typename Index>  // Index = int32
struct UnsortedSegmentFunctor {
    void operator()(OpKernelContext* ctx,
                    const Index num_segments,  // TRUNCATED: int64 → int32
                    const Index data_size,     // TRUNCATED: int64 → int32
                    const T* data, /* ... */)
    {
        if (data_size == 0) return;  // Bypassed: truncated value ≠ 0
        // data_size = 1 (truncated from 4294967297)
        // Actual tensor has 4 billion elements — massive OOB access
    }
};

ascend-rs mitigation — Rust’s type system rejects implicit narrowing:

#![allow(unused)]
fn main() {
// ascend_rs: explicit conversions prevent silent truncation
fn unsorted_segment_sum(
    data: &DeviceBuffer<f32>,
    segment_ids: &DeviceBuffer<i32>,
    num_segments: usize,         // Always full-width
) -> Result<DeviceBuffer<f32>> {
    let data_size: usize = data.len();  // usize, never truncated

    // If i32 index is needed for the kernel, conversion is explicit:
    let data_size_i32: i32 = i32::try_from(data_size)
        .map_err(|_| Error::TensorTooLarge {
            size: data_size,
            max: i32::MAX as usize,
        })?;

    // Rust rejects: let x: i32 = some_i64;  // ERROR: mismatched types
    // Rust rejects: let x: i32 = some_i64 as i32;  // clippy::cast_possible_truncation
    Ok(output)
}
}

Why Rust prevents this: Rust has no implicit integer narrowing. let x: i32 = some_i64; is a compile error. The as cast exists but clippy::cast_possible_truncation warns on it. TryFrom/try_into() returns Err when the value doesn’t fit, making truncation impossible without explicit acknowledgment.

B.6 Use-After-Free via Raw Pointer After Lock Release (CVE-2023-4211, ARM Mali)

The ARM Mali GPU driver copies a raw pointer from shared state, releases the lock, sleeps, then dereferences the now-dangling pointer.

Vulnerable C code (from mali_kbase_mem_linux.c, confirmed by Project Zero):

static void kbasep_os_process_page_usage_drain(struct kbase_context *kctx)
{
    struct mm_struct *mm;

    spin_lock(&kctx->mm_update_lock);
    mm = rcu_dereference_protected(kctx->process_mm, /*...*/);
    rcu_assign_pointer(kctx->process_mm, NULL);
    spin_unlock(&kctx->mm_update_lock);  // Lock released

    synchronize_rcu();  // SLEEPS — mm may be freed by another thread

    add_mm_counter(mm, MM_FILEPAGES, -pages);  // USE-AFTER-FREE
}

ascend-rs mitigation — Rust’s Arc + Mutex prevents dangling references:

#![allow(unused)]
fn main() {
// ascend_rs host API: device context with safe shared state
struct DeviceContext {
    process_mm: Mutex<Option<Arc<MmStruct>>>,
}

impl DeviceContext {
    fn drain_page_usage(&self) {
        // Take ownership of the Arc from the Mutex
        let mm = {
            let mut guard = self.process_mm.lock().unwrap();
            guard.take()  // Sets inner to None, returns Option<Arc<MmStruct>>
        };
        // Lock is released here (guard dropped)

        // If mm exists, we hold a strong reference — it CANNOT be freed
        if let Some(mm) = mm {
            synchronize_rcu();
            // mm is still alive — Arc guarantees it
            mm.add_counter(MmCounter::FilePages, -pages);
        }
        // mm dropped here — Arc ref count decremented
        // Only freed when the LAST Arc reference is dropped
    }
}
}

Why Rust prevents this: Arc<MmStruct> is a reference-counted smart pointer. Taking it from the Option gives us ownership of a strong reference. Even after the lock is released and other threads run, our Arc keeps the MmStruct alive. There is no way to obtain a dangling raw pointer from an Arc in safe Rust — the underlying memory is freed only when the last Arc is dropped.


English | 中文版

Appendix C: Vulnerability Analysis of 300 MultiKernelBench Kernels

The 300 kernels in MultiKernelBench span 15 categories. If implemented as standard AscendC C++ kernels, each inherits the structural vulnerability patterns of the GM_ADDR/LocalTensor/FreeTensor API. We systematically classify which patterns affect which kernel categories, count the exposure, and show the highest-risk C++ vs. ascend-rs comparisons.

C.1 Vulnerability Pattern Prevalence

Vulnerability PatternAffected Kernel CategoriesCount (/300)Severity
V1: GM_ADDR type erasureAll 15 categories300High
V2: Unchecked GetValue/SetValue OOBIndex (12), Conv (34), Pooling (6), Resize (10), Architecture (50), Attention (15), Math (6)133Critical
V3: Integer overflow in offset calcAll multi-block kernels: Activation (16), Broadcast (10), Reduce (5), Normalization (8), Fuse (100), Matmul (17), Optimizer (5)161High
V4: FreeTensor use-after-freeAll tiled/pipelined kernels300High
V5: Double-free of LocalTensorAll tiled/pipelined kernels300Medium
V6: Missing pipe_barrier syncAll DMA+compute kernels300Critical

Key finding: Every AscendC C++ kernel is structurally exposed to V1 (type erasure), V4 (use-after-free), V5 (double-free), and V6 (missing sync) because these are properties of the API itself, not of specific algorithms. The algorithmic vulnerabilities (V2, V3) affect subsets depending on whether the kernel uses element-indexed access or multi-block offset arithmetic.

C.2 Highest-Risk Category: Index Operations (12 kernels)

Index kernels (gather, scatter, scatter_add, index_select, index_copy, index_add, embedding, masked_fill, inplace_update, take_along_dim, argmax, argmin) are the highest-risk category because they combine all six vulnerability patterns simultaneously:

  • V1: GM_ADDR erases tensor element types
  • V2: User-provided index values access arbitrary offsets with no bounds check
  • V3: idx * row_len + j can overflow uint32_t for large tensors
  • V4/V5: Tiled implementations use FreeTensor lifecycle
  • V6: DMA ↔ compute synchronization required

C++ AscendC gather (vulnerable):

#include "kernel_operator.h"

// GM_ADDR erases all type info — caller can pass any dtype
extern "C" __global__ __aicore__
void gather(GM_ADDR input, GM_ADDR index, GM_ADDR output, GM_ADDR len_buf) {
    uint32_t n = *((__gm__ uint32_t *)len_buf);
    // V1: Manual cast from GM_ADDR — no compile-time type safety
    __gm__ float *in_ptr = (__gm__ float *)input;
    __gm__ uint32_t *idx_ptr = (__gm__ uint32_t *)index;
    __gm__ float *out_ptr = (__gm__ float *)output;

    for (uint32_t i = 0; i < n; i++) {
        uint32_t idx = idx_ptr[i];
        // V2: No bounds check on idx — attacker-controlled index
        // reads arbitrary memory within GM address space
        out_ptr[i] = in_ptr[idx];  // OOB if idx >= input_len
    }
}

ascend-rs gather (mitigated):

#![allow(unused)]
fn main() {
#[ascend_std::aiv_kernel]
pub unsafe fn gather(
    input: *const f32,   // V1 mitigated: typed pointer, not GM_ADDR
    index: *const u32,
    output: *mut f32,
    len: *const u32,
) {
    unsafe {
        let n = *len;
        let mut i = 0u32;
        loop {
            if i >= n { break; }  // Loop bounds explicit
            let idx = *index.wrapping_add(i as usize);
            // V2: wrapping_add is explicit about pointer arithmetic semantics
            // V3: no integer overflow — each offset cast individually
            *output.wrapping_add(i as usize) = *input.wrapping_add(idx as usize);
            i = i + 1;
        }
        // V4/V5: No FreeTensor — buffer IDs auto-managed
        // V6: No DMA/compute split — scalar ops on GM directly
    }
}
}

C.3 High-Risk Category: Convolution Kernels (34 kernels)

Convolution kernels have deeply nested loops with complex multi-dimensional index arithmetic (oc * in_ch * k_h * k_w + ic * k_h * k_w + kh * k_w + kw). A single wrong dimension in the index expression silently reads from wrong memory.

C++ AscendC conv2d index calculation (vulnerable):

// V2+V3: 6-level nested index arithmetic — easy to get a dimension wrong
for (int oc = 0; oc < out_ch; oc++) {
    for (int oh = 0; oh < out_h; oh++) {
        for (int ow = 0; ow < out_w; ow++) {
            float sum = 0.0f;
            for (int ic = 0; ic < in_ch; ic++) {
                for (int kh = 0; kh < k_h; kh++) {
                    for (int kw = 0; kw < k_w; kw++) {
                        int ih = oh * stride + kh * dilation;
                        int iw = ow * stride + kw * dilation;
                        // V3: 32-bit multiply chain can overflow
                        int in_idx = ic * in_h * in_w + ih * in_w + iw;
                        int w_idx = oc * in_ch * k_h * k_w
                                  + ic * k_h * k_w + kh * k_w + kw;
                        // V2: No bounds check — if ih >= in_h or iw >= in_w,
                        // reads out-of-bounds from GM
                        sum += (float)inLocal.GetValue(in_idx)
                             * (float)wLocal.GetValue(w_idx);
                    }
                }
            }
            outLocal.SetValue(oc * out_h * out_w + oh * out_w + ow, sum);
        }
    }
}

ascend-rs conv2d (mitigated):

#![allow(unused)]
fn main() {
#[ascend_std::aiv_kernel]
pub unsafe fn conv_standard_2d(
    input: *const f32, weight: *const f32, output: *mut f32,
    params: *const u32,  // [in_ch, out_ch, in_h, in_w, k_h, k_w, stride, dilation]
) {
    unsafe {
        // All params read from typed pointer — no GM_ADDR cast
        let in_ch = *params;
        let out_ch = *params.wrapping_add(1);
        // ... (read remaining params)
        let out_h = (in_h - (k_h - 1) * dilation - 1) / stride + 1;
        let out_w = (in_w - (k_w - 1) * dilation - 1) / stride + 1;

        let mut oc = 0u32;
        loop {
            if oc >= out_ch { break; }
            // ... nested loops with explicit bounds ...
            let ih = oh * stride + kh * dilation;
            let iw = ow * stride + kw * dilation;
            // V3 mitigated: wrapping semantics explicit via `as usize`
            // Debug builds panic on overflow, release wraps intentionally
            let in_idx = (ic * in_h * in_w + ih * in_w + iw) as usize;
            let w_idx = (oc * in_ch * k_h * k_w
                       + ic * k_h * k_w + kh * k_w + kw) as usize;
            sum = sum + *input.wrapping_add(in_idx) * *weight.wrapping_add(w_idx);
            // V4/V5: No FreeTensor needed
            // V6: No DMA — scalar GM access
        }
    }
}
}

C.4 High-Risk Category: Fused Operations (100 kernels)

Fused kernels (matmul+activation, conv+norm+activation, etc.) chain multiple pipeline stages. In C++, each stage requires its own AllocTensor/FreeTensor/pipe_barrier — missing any one produces silent data corruption.

C++ fused matmul+sigmoid (vulnerable):

// Fused matmul + sigmoid: C = sigmoid(A * B)
// V4: 4 tensors allocated/freed — each is a use-after-free opportunity
// V5: Copy-paste between fused variants can duplicate FreeTensor
// V6: 3 pipeline transitions (DMA→cube, cube→vector, vector→DMA)
//     — each requires pipe_barrier, forgetting any one = stale data

AscendC::LocalTensor<half> aLocal = inQueueA.AllocTensor<half>();
AscendC::DataCopy(aLocal, aGm, m * k);
inQueueA.EnQue(aLocal);
// V6: Need barrier here for DMA → cube
aLocal = inQueueA.DeQue<half>();

// ... matmul ...

inQueueA.FreeTensor(aLocal);
// V4: aLocal handle still valid — accidental read compiles and runs

AscendC::LocalTensor<float> cLocal = outQueue.AllocTensor<float>();
// V6: Need barrier here for cube → vector
AscendC::Muls(cLocal, cLocal, -1.0f, total);  // sigmoid step 1
AscendC::Exp(cLocal, cLocal, total);            // sigmoid step 2
// V6: Need inter-op barriers for in-place chained ops on 310P
AscendC::Adds(cLocal, cLocal, 1.0f, total);    // sigmoid step 3
AscendC::Reciprocal(cLocal, cLocal, total);     // sigmoid step 4
outQueue.FreeTensor(cLocal);

ascend-rs fused matmul+sigmoid (mitigated):

#![allow(unused)]
fn main() {
#[ascend_std::aiv_kernel]
pub unsafe fn fused_matmul_sigmoid(
    a: *const u16, b: *const u16, c: *mut f32, dims: *const u32,
) {
    unsafe {
        let m = *dims;
        let k = *dims.wrapping_add(1);
        let n = *dims.wrapping_add(2);

        // V6 mitigated: matmul_f16 handles DMA+cube internally
        ascend_std::kernel_ops::matmul_f16(c, a, b, m, k, n);
        ascend_std::ascend_pipe_barrier();  // Explicit, visible

        let total = m * n;
        let buf_c = ascend_std::ascend_buf_alloc(total);
        ascend_std::ascend_buf_load_f32(buf_c, c as *const f32, total);
        ascend_std::ascend_pipe_barrier();  // Explicit, visible

        // V6 mitigated: sigmoid_f32 includes ALL internal barriers
        // (muls → barrier → exp → barrier → adds → barrier → reciprocal)
        ascend_std::kernel_ops::sigmoid_f32(buf_c, buf_c, total);

        ascend_std::ascend_pipe_barrier();  // Explicit, visible
        ascend_std::ascend_buf_store_f32(c, buf_c, total);
        // V4/V5: No FreeTensor — buf_c auto-managed
    }
}
}

C.5 Vulnerability Tally: 300 Kernels x 6 Patterns

CategoryKernelsV1 TypeV2 OOBV3 OverflowV4 UAFV5 DblFreeV6 SyncTotal Exposures
Activation161601616161680
Architecture50505050505050300
Attention1515151515151590
Broadcast101001010101050
Convolution34343434343434204
Fuse1001000100100100100500
Index1212121212121272
Loss770777735
Math666666636
Matmul171701717171785
Normalization880888840
Optimizer550555525
Pooling666666636
Reduce550555525
Resize1010101010101060
Total3003001333003003003001,633

C.6 How ascend-rs Eliminates Each Pattern

PatternC++ Root Causeascend-rs MitigationResidual Risk
V1: Type erasureGM_ADDR = uint8_t* for all tensorsTyped *const f32 / *const u16 in fn signaturesNone (compile-time)
V2: Unchecked OOBGetValue(i) / SetValue(i,v) with no bounds checkVector intrinsics with explicit count n; scalar loops use wrapping_addunsafe pointer arithmetic still unchecked at runtime
V3: Integer overflowblockIdx * perBlockLen silent wraparoundwrapping_mul makes overflow explicit; debug builds panicDeveloper must choose wrapping_* vs checked_*
V4: Use-after-freeFreeTensor() invalidates handle, C++ allows continued useNo FreeTensor API; buffer IDs are typed newtypes (UbBuf, L1Buf, etc.), not owning handlesNone (API-level)
V5: Double-freeFreeTensor() called twice corrupts free listNo FreeTensor API; buffer lifecycle auto-managedNone (API-level)
V6: Missing syncManual pipe_barrier() between every pipeline transitionkernel_ops composites include all internal barriers; DMA barriers explicit and fewDeveloper must place DMA↔compute barriers (2 per kernel, not per-op)

Net result: Of the 1,633 total vulnerability exposures across 300 kernels, ascend-rs eliminates 1,500 at the API/type level (V1, V4, V5 fully; V6 reduced from per-op to per-kernel). The remaining 133 OOB exposures (V2) are mitigated by replacing element-indexed access with whole-vector operations, though unsafe pointer arithmetic in scalar fallback kernels remains the programmer’s responsibility.

English | 中文版

Appendix D: Ecosystem Integration — Workflows, Demos, and Vulnerability Prevention

The Python AI/ML ecosystem generates NPU kernel code through multiple paths: TileLang lowers Python DSL to AscendC C++, PyTorch’s torch.compile with an Ascend backend produces fused kernels, Triton’s Ascend backend lowers GPU-style tile programs, and PyPTO compiles its virtual ISA to AscendC. All four paths share a common failure mode: the generated C++ is compiled by bisheng with no awareness of target hardware constraints. ascend_compile sits between code generation and compilation, catching hardware-specific bugs before they reach the NPU.

D.1 The ascend_compile Integration Hub

The ascend_compile crate provides four integration interfaces, each suited to a different ecosystem role:

  1. Rust APIascend_compile::compile_kernel(source, &config) for native Rust toolchains
  2. C ABIlibascend_compile.so with extern "C" functions (ascend_compile_kernel, ascend_compile_config_new, etc.) for embedding in C/C++ runtimes
  3. CLIascend-compile kernel.cpp --soc Ascend910B3 --shared for shell scripts and CI pipelines
  4. Python wrapperascend_compile.py (ctypes over the C ABI) for direct use in Python ML frameworks

Before invoking the bisheng compiler, ascend_compile runs three validation passes that scan the kernel source text:

                 C++ kernel source
                        |
                        v
          +-----------------------------+
          |  Pass 1: Entry Point Check  |
          |  __aicore__ present?        |
          +-----------------------------+
                        |
                        v
          +-----------------------------+
          |  Pass 2: DMA/Sync Barrier   |
          |  DataCopy without           |
          |  pipe_barrier()?            |
          |  310P → error               |
          |  910B → warning             |
          +-----------------------------+
                        |
                        v
          +-----------------------------+
          |  Pass 3: Buffer Size Check  |
          |  InitBuffer size vs target  |
          |  UB limit:                  |
          |  910B → 192KB (196608 B)    |
          |  310P → 256KB (262144 B)    |
          +-----------------------------+
                        |
                        v
               bisheng compilation
                        |
                        v
                  kernel binary

The Rust implementation of these three passes (crates/ascend_compile/src/validate.rs) operates entirely on string scanning — no compilation or parsing is needed. The validate_kernel() function returns a Vec<ValidationDiagnostic>, where each diagnostic carries a severity (Error or Warning) and an optional line number:

#![allow(unused)]
fn main() {
// crates/ascend_compile/src/validate.rs
pub fn validate_kernel(source: &str, target: AscendTarget) -> Vec<ValidationDiagnostic> {
    let mut diags = Vec::new();
    check_entry_point(source, &mut diags);       // Pass 1
    check_sync_barriers(source, target, &mut diags); // Pass 2
    check_buffer_sizes(source, target, &mut diags);  // Pass 3
    diags
}
}

D.2 TileLang Integration

Note: The ascend_compile validation layer (D.1) works today on any C++ kernel source. The “ascend-rs mitigation” workflows described in D.2–D.5 are architectural designs showing how each tool could target Rust instead of C++. The Rust kernel examples compile through the MLIR backend, but the end-to-end integration (tool → Rust → MLIR → C++ → NPU) has not been implemented in any upstream tool. These sections describe a feasible path, not a shipped feature.

Workflow. TileLang generates AscendC C++ from its Python DSL through the LibraryGenerator.compile_lib() method, which internally runs subprocess.run(bisheng, ...). By replacing that final compilation step with ascend_compile.compile_kernel(), TileLang gains target-aware validation without modifying its code generation pipeline.

Demo — compiling a TileLang-generated matmul kernel with validation:

from ascend_compile import compile_kernel

# TileLang generates this C++ source from Python DSL
kernel_source = '''
#include "kernel_operator.h"
extern "C" __global__ __aicore__ void tilelang_matmul(
    GM_ADDR a, GM_ADDR b, GM_ADDR c, GM_ADDR workspace) {
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueA, inQueueB;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueC;
    pipe.InitBuffer(inQueueA, 1, 32 * sizeof(half));
    pipe.InitBuffer(inQueueB, 1, 32 * sizeof(half));
    pipe.InitBuffer(outQueueC, 1, 32 * sizeof(half));

    AscendC::GlobalTensor<half> aGm;
    aGm.SetGlobalBuffer((__gm__ half*)a);
    AscendC::LocalTensor<half> aLocal = inQueueA.AllocTensor<half>();
    // DMA load
    AscendC::DataCopy(aLocal, aGm, {1, 32, 0, 0});
    // compute — but no pipe_barrier between DMA and compute!
    AscendC::Mmad(cLocal, aLocal, bLocal, 16, 16, 16);
    // DMA store
    AscendC::DataCopy(cGm, cLocal, {1, 32, 0, 0});
}
'''

# Compile with validation — catches missing pipe_barrier!
try:
    binary = compile_kernel(
        kernel_source,
        soc="Ascend310P1",    # 310P requires explicit barriers
        shared=True,
        validate=True,
    )
except RuntimeError as e:
    print(f"Caught: {e}")
    # "validation failed:
    #   error: line 16: DMA operations found but no pipe_barrier/sync
    #   — required on Ascend310P1 (add pipe_barrier(PIPE_ALL)
    #     between DMA and compute)"

Vulnerability prevented. Without ascend_compile, TileLang’s bare subprocess.run(bisheng) would compile this kernel successfully. On 310P, the kernel would silently hangDataCopy completes via the MTE2/MTE3 DMA pipelines, but the compute unit reads stale data from Unified Buffer because no pipe_barrier(PIPE_ALL) separates DMA from compute. The scalar pipeline sees old values, produces garbage output, and the kernel may never terminate. This is vulnerability pattern V6 (missing sync) from Appendix C. The 910B target has auto-sync support that can mask this bug, making it surface only on 310P hardware — exactly the kind of target-dependent failure that ascend_compile catches at compile time.

ascend-rs mitigation. While ascend_compile detects missing barriers, ascend-rs eliminates the vulnerability class entirely. In the safer workflow, TileLang’s Python DSL generates a Rust kernel instead of C++ — the ascend-rs codegen then produces C++ with barriers guaranteed by construction:

#![allow(unused)]
fn main() {
// Rust kernel: TileLang DSL → ascend-rs instead of raw C++
#[ascend_std::aiv_kernel]
pub unsafe fn tilelang_softmax(input: *const f32, output: *mut f32, n_ptr: *const u32) {
    unsafe {
        let n = *n_ptr;
        let buf_in  = ascend_std::ascend_buf_alloc(n);
        let buf_out = ascend_std::ascend_buf_alloc(n);
        let work    = ascend_std::ascend_buf_alloc(n);

        ascend_std::ascend_buf_load_f32(buf_in, input, n);
        ascend_std::ascend_pipe_barrier();  // codegen also auto-inserts after DMA

        // kernel_ops::softmax_f32 has 4 embedded pipe_barrier() calls —
        // impossible to forget any of them
        ascend_std::kernel_ops::softmax_f32(buf_out, buf_in, work, n);

        ascend_std::ascend_pipe_barrier();  // codegen also auto-inserts before DMA
        ascend_std::ascend_buf_store_f32(output, buf_out, n);
    }
}
}

The kernel_ops::softmax_f32 composite expands to ReduceMax → Adds → Exp → ReduceSum → Muls with a pipe_barrier(PIPE_ALL) between each step. Additionally, the MLIR→C++ codegen (mlir_to_cpp.rs) automatically inserts pipe_barrier(PIPE_ALL) after every DMA load and before every DMA store — providing a second layer of defense even if the programmer omits the explicit call. The result: synchronization bugs are structurally impossible in ascend-rs kernels, not merely detected.

D.3 PyTorch Integration

Workflow. torch.compile with an Ascend backend generates AscendC C++ for fused operator subgraphs. The backend calls ascend_compile via the C ABI (libascend_compile.so), which the Python wrapper ascend_compile.py binds through ctypes. This path is suitable for production deployment where the compilation service runs as a long-lived process.

Demo — catching a buffer overflow in a torch.compile-generated kernel:

import torch
from ascend_compile import compile_kernel

# torch.compile's Ascend backend generates AscendC C++ for a fused GELU.
# The code generator computed buffer sizes for a GPU with 48KB shared memory
# per SM, but the Ascend 910B UB is 192KB — and the generated size is wrong.
generated_cpp = '''
#include "kernel_operator.h"
extern "C" __global__ __aicore__ void gelu_kernel(
    GM_ADDR input, GM_ADDR output, GM_ADDR workspace) {
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueue;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueue;
    // torch.compile generated a 300KB buffer — exceeds 910B's 192KB UB!
    pipe.InitBuffer(inQueue, 1, 300000);
    pipe.InitBuffer(outQueue, 1, 300000);
    AscendC::GlobalTensor<float> inputGm;
    inputGm.SetGlobalBuffer((__gm__ float*)input);
    AscendC::LocalTensor<float> xLocal = inQueue.AllocTensor<float>();
    AscendC::DataCopy(xLocal, inputGm, {1, 64, 0, 0});
    pipe_barrier(PIPE_ALL);
    // ... GELU computation ...
}
'''

try:
    binary = compile_kernel(generated_cpp, soc="Ascend910B3")
except RuntimeError as e:
    print(f"Caught: {e}")
    # "validation failed:
    #   error: line 10: InitBuffer size 300000 bytes exceeds
    #   Ascend910B3 UB limit of 196608 bytes
    #   error: line 11: InitBuffer size 300000 bytes exceeds
    #   Ascend910B3 UB limit of 196608 bytes"

Vulnerability prevented. Without ascend_compile, a buffer size that exceeds the NPU’s Unified Buffer would compile without error — bisheng does not validate buffer sizes against hardware SRAM limits. At runtime, the kernel writes past physical SRAM boundaries, corrupting adjacent memory regions. On the Ascend NPU, the UB is partitioned across multiple AI Cores; an oversized buffer on one core can overwrite another core’s working data, causing silent data corruption across independent kernels. This is a hardware-level buffer overflow that no C++ compiler can catch. ascend_compile validates InitBuffer sizes against each target’s exact UB limit: 196,608 bytes (192KB) for 910B, 262,144 bytes (256KB) for 310P.

ascend-rs mitigation. In the safer workflow, torch.compile’s Ascend backend generates a Rust kernel instead of C++. Buffer management is handled through typed newtype IDs (UbBuf, L1Buf, L0aBuf, etc.) returned by ascend_buf_alloc() — not raw pointers, not FreeTensor handles. The newtypes prevent mixing buffer memory levels (e.g., passing an L0aBuf to a UB vector operation is a compile error). The codegen translates these IDs to AscendC TBuf<TPosition::VECCALC> objects with sizes computed from the kernel’s data flow analysis:

#![allow(unused)]
fn main() {
// Rust kernel: torch.compile → ascend-rs instead of raw C++
#[ascend_std::aiv_kernel]
pub unsafe fn fused_gelu(input: *const f32, output: *mut f32, n_ptr: *const u32) {
    unsafe {
        let n = *n_ptr;
        // Typed buffer IDs (UbBuf) — no pointer arithmetic, no sizing errors
        let buf = ascend_std::ascend_buf_alloc(n);
        let tmp = ascend_std::ascend_buf_alloc(n);
        let work = ascend_std::ascend_buf_alloc(n);

        ascend_std::ascend_buf_load_f32(buf, input, n);
        ascend_std::ascend_pipe_barrier();

        // GELU via composites: x * sigmoid(1.702 * x)
        ascend_std::kernel_ops::gelu_f32(tmp, buf, work, n);

        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_buf_store_f32(output, tmp, n);
    }
}
}

The codegen determines InitBuffer sizes from the kernel’s ascend_buf_alloc(n) calls and the target’s UB limit — if n elements exceed UB capacity, it can tile the computation automatically. No manual buffer size calculation is needed, and no raw byte count is passed to InitBuffer by the programmer. The result: buffer overflow is eliminated by design, not merely detected.

D.4 Triton Integration

Workflow. Triton’s Ascend backend lowers Triton IR (designed for GPU tile programs) to AscendC C++ source. The lowering must translate GPU concepts (thread blocks, shared memory, tl.load/tl.store) to NPU concepts (AI Core blocks, Unified Buffer, DataCopy). A common translation error is omitting the __aicore__ attribute, since GPU kernels use __global__ alone.

Demo — catching a missing entry point annotation:

from ascend_compile import compile_kernel

# Triton's Ascend backend lowered a vector_add kernel from GPU IR to AscendC C++.
# The GPU→NPU translation preserved __global__ but forgot __aicore__.
triton_generated = '''
#include "kernel_operator.h"
extern "C" __global__ void vector_add(    // Missing __aicore__!
    GM_ADDR x, GM_ADDR y, GM_ADDR z, GM_ADDR workspace) {
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueX, inQueueY;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueZ;
    pipe.InitBuffer(inQueueX, 1, 32768);
    pipe.InitBuffer(inQueueY, 1, 32768);
    pipe.InitBuffer(outQueueZ, 1, 32768);
    AscendC::GlobalTensor<float> xGm;
    xGm.SetGlobalBuffer((__gm__ float*)x);
    AscendC::LocalTensor<float> xLocal = inQueueX.AllocTensor<float>();
    AscendC::DataCopy(xLocal, xGm, {1, 64, 0, 0});
    pipe_barrier(PIPE_ALL);
    // ... vector add computation ...
}
'''

try:
    binary = compile_kernel(triton_generated, soc="Ascend910B3")
except RuntimeError as e:
    print(f"Caught: {e}")
    # "validation failed:
    #   error: no __aicore__ entry point found"

Vulnerability prevented. The __aicore__ attribute instructs bisheng to generate code for the NPU’s AI Core processor rather than the host ARM/x86 CPU. Without it, bisheng may compile the function with the wrong calling convention, wrong register allocation, and wrong instruction set. The resulting binary exists and loads onto the NPU, but executes with a host ABI on AI Core hardware — producing garbage results, corrupting the stack, or hanging the AI Core entirely. This is a silent, catastrophic failure: no error is raised, the kernel binary is valid ELF, but every computation is wrong. ascend_compile catches it with a single string scan before compilation begins.

ascend-rs mitigation. In the safer workflow, a Triton-Ascend backend lowers Triton IR to a Rust kernel marked with #[aiv_kernel]. The codegen unconditionally emits the correct MLIR attributes (hacc.entry, hacc.function_kind = #hacc.function_kind<DEVICE>) and the C++ entry point with both __global__ and __aicore__:

#![allow(unused)]
fn main() {
// Rust kernel: Triton IR → ascend-rs instead of raw C++
#[ascend_std::aiv_kernel]  // ← triggers automatic __aicore__ in codegen
pub unsafe fn vector_add(
    x: *const f32, y: *const f32, z: *mut f32, n_ptr: *const u32,
) {
    unsafe {
        let n = *n_ptr;
        let bx = ascend_std::ascend_buf_alloc(n);
        let by = ascend_std::ascend_buf_alloc(n);

        ascend_std::ascend_buf_load_f32(bx, x, n);
        ascend_std::ascend_buf_load_f32(by, y, n);
        ascend_std::ascend_pipe_barrier();

        ascend_std::ascend_add_f32(bx, bx, by, n);

        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_buf_store_f32(z, bx, n);
    }
}
}

The codegen in declare.rs detects the #[aiv_kernel] attribute and unconditionally adds the MLIR entry-point attributes. There is no code path where a Rust kernel function can be compiled without the __aicore__ annotation — the attribute is applied by the compiler, not by the programmer. This converts a human-error-prone annotation task into an automatic, toolchain-guaranteed property.

D.5 PyPTO Integration

Workflow. PyPTO defines a virtual ISA of approximately 90 tile-level instructions (pto.load, pto.matmul, pto.store, etc.) that compile to AscendC C++. PyPTO’s tile scheduler optimizes for throughput by using double-buffered tiles, which doubles the memory footprint. When the tile scheduler targets a GPU with abundant shared memory and the generated code is redirected to an NPU target with smaller SRAM, buffer sizes may exceed the physical Unified Buffer.

Demo — catching an oversized double-buffered allocation:

from ascend_compile import compile_kernel

# PyPTO generated C++ from tile-level Python operations:
# pto.load(tile_a) -> pto.matmul(tile_a, tile_b) -> pto.store(tile_c)
# The tile scheduler allocated 2 x 256KB for double-buffered tiles.
pypto_generated = '''
#include "kernel_operator.h"
extern "C" __global__ __aicore__ void pypto_tile_op(
    GM_ADDR input, GM_ADDR output, GM_ADDR workspace) {
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 2> inQueue;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueue;
    // PyPTO allocated 256KB per buffer for double-buffered tiles
    // 2 buffers x 256KB = 512KB total — but 910B UB is only 192KB!
    pipe.InitBuffer(inQueue, 2, 256 * 1024);
    pipe.InitBuffer(outQueue, 1, 32768);

    AscendC::GlobalTensor<float> inputGm;
    inputGm.SetGlobalBuffer((__gm__ float*)input);
    AscendC::LocalTensor<float> aLocal = inQueue.AllocTensor<float>();
    AscendC::DataCopy(inputGm, aLocal, {1, 64, 0, 0});
    pipe_barrier(PIPE_ALL);
}
'''

try:
    binary = compile_kernel(pypto_generated, soc="Ascend910B3")
except RuntimeError as e:
    print(f"Caught: {e}")
    # "validation failed:
    #   error: line 10: InitBuffer size 262144 bytes exceeds
    #   Ascend910B3 UB limit of 196608 bytes"

Vulnerability prevented. PyPTO’s tile scheduler optimizes for throughput by maximizing buffer sizes, but has no knowledge of the target NPU’s physical SRAM capacity. Without target-aware validation, the compiled kernel would attempt to use more Unified Buffer than physically exists. On the Ascend NPU, UB is not virtualizable — there is no page fault mechanism, no swap space, and no memory protection between buffers within a single AI Core. An oversized InitBuffer causes the runtime to lay out buffers that overlap in physical SRAM, resulting in silent memory corruption where one pipeline stage’s DMA writes overwrite another stage’s compute data. ascend_compile catches this because it stores each target’s exact UB size: 196,608 bytes for 910B variants, 262,144 bytes for 310P variants.

ascend-rs mitigation. In the safer workflow, PyPTO’s tile-level operations map to ascend-rs kernel_ops composites. Buffer allocation uses ascend_buf_alloc(n) with element counts, not byte sizes — the codegen computes the physical InitBuffer byte count from the element count and data type, and validates it against the target’s UB limit during code generation:

#![allow(unused)]
fn main() {
// Rust kernel: PyPTO tile ops → ascend-rs instead of raw C++
#[ascend_std::aiv_kernel]
pub unsafe fn pypto_tile_matmul(
    a: *const u16, b: *const u16, c: *mut f32, n_ptr: *const u32,
) {
    unsafe {
        let n = *n_ptr;
        // Typed buffer allocation — codegen maps to TBuf with correct TPosition
        let l1_a  = ascend_std::ascend_buf_alloc_l1(n);   // L1 buffer
        let l0a   = ascend_std::ascend_buf_alloc_l0a(n);  // L0A buffer (cube input A)
        let l0b   = ascend_std::ascend_buf_alloc_l0b(n);  // L0B buffer (cube input B)
        let l0c   = ascend_std::ascend_buf_alloc_l0c(n);  // L0C buffer (cube output)

        // Each alloc maps to a specific TBuf<TPosition::*> in codegen
        // L0A → TBuf<TPosition::A1>, L0B → TBuf<TPosition::B1>, etc.
        // Mixing positions is a compile error in the generated C++
        ascend_std::ascend_mmad_f16(l0c, l0a, l0b, n, n, n, 1);
    }
}
}

The codegen emits TBuf<TPosition::A1> for L0A, TBuf<TPosition::B1> for L0B, and TBuf<TPosition::CO1> for L0C — the AscendC type system enforces that L0A buffers cannot be passed to L0B operations, and vice versa. Combined with element-count-based allocation (not raw byte counts), buffer sizing errors are caught at code generation time rather than at hardware runtime. PyPTO’s tile scheduler can target ascend-rs kernels knowing that buffer position and size constraints are enforced by the type system.

D.6 Summary: Detection vs. Structural Mitigation

ascend_compile detects vulnerabilities in C++ code; ascend-rs eliminates the vulnerability class entirely. The following table contrasts both levels of defense:

ToolVulnerabilityascend_compile Detectionascend-rs Structural Mitigation
TileLangV6: Missing sync barriersError on 310P if DataCopy without pipe_barrierkernel_ops composites embed all barriers; codegen auto-inserts DMA barriers
PyTorchBuffer size overflowError if InitBuffer > target UB limitascend_buf_alloc(n) uses element counts; codegen computes byte sizes
TritonMissing __aicore__ entryError if __aicore__ not found in source#[aiv_kernel] triggers unconditional hacc.entry attribute in codegen
PyPTOBuffer exceeds UB limitError if InitBuffer > target UB limitTyped TBuf<TPosition::*> positions; element-count allocation

The two layers are complementary. ascend_compile validation operates on any C++ kernel source, regardless of origin — it protects the entire ecosystem today. ascend-rs mitigation goes further by making the vulnerability structurally impossible in kernels authored through its Rust→MLIR→C++ pipeline. Tools that adopt ascend-rs as their backend would get both layers automatically. As of this writing, ascend_compile validation is ready for integration; the ascend-rs Rust backend is an architectural option that tool developers could adopt in future versions.

These three validation passes are lightweight — they operate on string scanning with no compilation, parsing, or AST construction needed. The validate_kernel() function adds less than 1ms to the compilation pipeline, even for large kernels. On the NPU, a hung kernel produces no stack trace, no core dump, and no error message — only a timeout. ascend_compile converts these opaque runtime failures into actionable compile-time errors with line numbers and target-specific explanations.

D.7 Golden-Value Testing with PyTorch

Beyond compilation integration, PyTorch serves a second role in the ascend-rs ecosystem: verification. The generate.py script (tests/kernel_correctness/golden/generate.py) produces reference outputs for 72 test cases across 6 categories, using PyTorch and NumPy as the source of truth.

# tests/kernel_correctness/golden/generate.py (excerpt)
import torch
import torch.nn.functional as F

# Generate reference conv2d output with deterministic seed
rng = torch.manual_seed(42)
x = torch.randn(1, 3, 7, 7)
w = torch.randn(8, 3, 3, 3)
y = F.conv2d(x, w, stride=1, padding=0)
# -> conv_golden.json: loaded by `cargo test -p kernel_correctness`

The golden values cover all kernel categories that require non-trivial numerical verification:

CategoryTest CasesOperations
Convolution16conv1d, conv2d, conv3d, depthwise, transposed
Index14argmax/min, gather, scatter, scatter_add, embedding, index_select, masked_fill
Pooling12max_pool1d/2d/3d, avg_pool1d/2d/3d
Matmul13transposed_a, transposed_b, transposed_both, lower/upper triangular
Resize8bilinear upsample, nearest upsample, trilinear, bilinear downsample
Misc9where_broadcast, logic_and, power, masked_cumsum, triplet_loss, lamb_update
Total72

The Rust test harness (cargo test -p kernel_correctness) loads these JSON files, runs the corresponding ascend-rs kernel implementations on CPU, and compares outputs against PyTorch’s reference values with a tolerance of 1e-4 for floating-point operations.

Vulnerability prevention. Golden-value testing catches implementation errors that compile-time validation cannot: a gather kernel with an off-by-one index error (vulnerability pattern V2 from Appendix C) compiles cleanly and passes all three ascend_compile validation passes, but produces wrong outputs that diverge from PyTorch’s reference. The golden-value test catches it. Similarly, a conv2d kernel that accumulates in the wrong order (swapping input channel and spatial dimensions) produces numerically valid but semantically wrong results — only comparison against a reference implementation reveals the bug. By generating golden values from PyTorch — the same framework that most ML practitioners use — ascend-rs ensures that its kernel implementations match the numerical behavior that users expect from their models.

English | 中文版

Appendix E: Complete Kernel Inventory

This appendix is auto-generated by scripts/generate_kernel_appendix.sh. Run bash scripts/generate_kernel_appendix.sh to regenerate.

Summary

MetricCount
Compiletest kernels486
Deployable kernels19
Total kernels505
MultiKernelBench coverage300/300 (100%)
MKB categories covered15/15 (100%)
Memory safety vulnerability patterns6 classes (with attack examples)

Vulnerability Pattern Legend

IDVulnerabilityC++ Root CauseRust PreventionAttack Example
V1Type erasureGM_ADDR erases all type infoFunction signature encodes element typecase1
V2Buffer overflowGetValue(i) unchecked indexingBuffer-ID API with explicit countcase2
V3Integer overflowSilent u32 wrap in offset calcwrapping_mul makes overflow explicitcase6
V4Use-after-freeFreeTensor() then stale accessNo manual free in APIcase3
V5Double freeFreeTensor() called twiceNo free operation existscase5
V6Missing syncForgotten pipe_barrier()kernel_ops composites embed barrierscase4

Kernel Inventory by Category

Activation (17 kernels)

Applicable vulnerability patterns: V1(type erasure),V2(unchecked index),V6(missing sync)

MKB reference: reference_kernels/activation/

Architecture (77 kernels)

Applicable vulnerability patterns: V1,V2,V3(offset overflow),V6

MKB reference: reference_kernels/architecture/

Kernel FunctionSource FileMKB Reference910B3 Status
mlp_relutests/compiletest/ui/arch_ops_kernel.rsPASS
mlp_gelu_biastests/compiletest/ui/arch_ops_kernel.rsPASS
mlp_swishtests/compiletest/ui/arch_ops_kernel.rsPASS
ffn_prenormtests/compiletest/ui/arch_ops_kernel.rsffn_prenorm.pyPASS
down_projtests/compiletest/ui/arch_ops_kernel.rsdown_proj.pyPASS
attention_score_normtests/compiletest/ui/arch_ops_kernel.rsPASS
rope_freqtests/compiletest/ui/arch_ops_kernel.rsPASS
embedding_scaletests/compiletest/ui/arch_ops_kernel.rsPASS
gated_residualtests/compiletest/ui/arch_ops_kernel.rsgated_residual.pyPASS
scaled_dottests/compiletest/ui/arch_ops_kernel.rsPASS
classifier_headtests/compiletest/ui/arch_ops_kernel.rsPASS
regression_headtests/compiletest/ui/arch_ops_kernel.rsPASS
softmax_classifiertests/compiletest/ui/arch_ops_kernel.rsPASS
mlptests/compiletest/ui/arch_ops_kernel.rsmlp.pyPASS
deep_narrow_mlptests/compiletest/ui/arch_ops_kernel.rsdeep_narrow_mlp.pyPASS
shallow_wide_mlptests/compiletest/ui/arch_ops_kernel.rsshallow_wide_mlp.pyPASS
vanilla_rnntests/compiletest/ui/arch_rnn_kernel.rsvanilla_rnn.pyPASS
lstm_forget_gatetests/compiletest/ui/arch_rnn_kernel.rsPASS
lstm_input_gatetests/compiletest/ui/arch_rnn_kernel.rsPASS
lstm_cell_candidatetests/compiletest/ui/arch_rnn_kernel.rsPASS
lstm_cell_updatetests/compiletest/ui/arch_rnn_kernel.rsPASS
lstm_outputtests/compiletest/ui/arch_rnn_kernel.rsPASS
gru_reset_gatetests/compiletest/ui/arch_rnn_kernel.rsPASS
gru_update_gatetests/compiletest/ui/arch_rnn_kernel.rsPASS
gru_candidatetests/compiletest/ui/arch_rnn_kernel.rsPASS
gru_hidden_updatetests/compiletest/ui/arch_rnn_kernel.rsPASS
vanilla_rnn_hiddentests/compiletest/ui/arch_rnn_kernel.rsvanilla_rnn_hidden.pyPASS
lstmtests/compiletest/ui/arch_rnn_kernel.rslstm.pyPASS
lstm_bidirectionaltests/compiletest/ui/arch_rnn_kernel.rslstm_bidirectional.pyPASS
lstm_cntests/compiletest/ui/arch_rnn_kernel.rslstm_cn.pyPASS
grutests/compiletest/ui/arch_rnn_kernel.rsgru.pyPASS
gru_birectionaltests/compiletest/ui/arch_rnn_kernel.rsgru_birectional.pyPASS
gru_bidirectional_hiddentests/compiletest/ui/arch_rnn_kernel.rsgru_bidirectional_hidden.pyPASS
gru_hiddentests/compiletest/ui/arch_rnn_kernel.rsgru_hidden.pyPASS
alexnet_fctests/compiletest/ui/arch_network_kernel.rsalexnet_fc.pyPASS
vgg_fctests/compiletest/ui/arch_network_kernel.rsPASS
resnet_residualtests/compiletest/ui/arch_network_kernel.rsPASS
densenet_blocktests/compiletest/ui/arch_network_kernel.rsPASS
mobilenet_pointwisetests/compiletest/ui/arch_network_kernel.rsmobilenet_pointwise.pyPASS
efficientnet_fctests/compiletest/ui/arch_network_kernel.rsPASS
inception_mergetests/compiletest/ui/arch_network_kernel.rsPASS
squeezenet_firetests/compiletest/ui/arch_network_kernel.rsPASS
shufflenet_fctests/compiletest/ui/arch_network_kernel.rsPASS
regnet_stemtests/compiletest/ui/arch_network_kernel.rsregnet_stem.pyPASS
lenet_fctests/compiletest/ui/arch_network_kernel.rslenet_fc.pyPASS
unet_skiptests/compiletest/ui/arch_network_kernel.rsunet_skip.pyPASS
vit_mlptests/compiletest/ui/arch_network_kernel.rsvit_mlp.pyPASS
swin_attentiontests/compiletest/ui/arch_network_kernel.rsPASS
mingpt_blocktests/compiletest/ui/arch_network_kernel.rsmingpt_block.pyPASS
mlp_mixertests/compiletest/ui/arch_network_kernel.rsmlp_mixer.pyPASS
mamba_ssmtests/compiletest/ui/arch_network_kernel.rsPASS
densenet121tests/compiletest/ui/arch_network_kernel.rsdensenet121.pyPASS
densenet121_dense_blocktests/compiletest/ui/arch_network_kernel.rsdensenet121_dense_block.pyPASS
densenet121_transition_layertests/compiletest/ui/arch_network_kernel.rsdensenet121_transition_layer.pyPASS
densenet201tests/compiletest/ui/arch_network_kernel.rsdensenet201.pyPASS
efficientnet_b0tests/compiletest/ui/arch_network_kernel.rsefficientnet_b0.pyPASS
efficientnet_b1tests/compiletest/ui/arch_network_kernel.rsefficientnet_b1.pyPASS
efficientnet_b2tests/compiletest/ui/arch_network_kernel.rsefficientnet_b2.pyPASS
resnet18tests/compiletest/ui/arch_network_kernel.rsresnet18.pyPASS
resnet101tests/compiletest/ui/arch_network_kernel.rsresnet101.pyPASS
resnet_basic_blocktests/compiletest/ui/arch_network_kernel.rsresnet_basic_block.pyPASS
vgg16tests/compiletest/ui/arch_network_kernel.rsvgg16.pyPASS
vgg19tests/compiletest/ui/arch_network_kernel.rsvgg19.pyPASS
squeeze_nettests/compiletest/ui/arch_network_kernel.rssqueeze_net.pyPASS
squeeze_net_fire_moduletests/compiletest/ui/arch_network_kernel.rssqueeze_net_fire_module.pyPASS
shufflenettests/compiletest/ui/arch_network_kernel.rsshufflenet.pyPASS
shufflenet_unittests/compiletest/ui/arch_network_kernel.rsshufflenet_unit.pyPASS
googlenet_inception_moduletests/compiletest/ui/arch_network_kernel.rsPASS
googlenet_inception_v1tests/compiletest/ui/arch_network_kernel.rsPASS
swin_mlptests/compiletest/ui/arch_network_kernel.rsswin_mlp.pyPASS
swintransformer_v2tests/compiletest/ui/arch_network_kernel.rsswintransformer_v2.pyPASS
mamba_return_final_statetests/compiletest/ui/arch_network_kernel.rsmamba_return_final_state.pyPASS
mamba_return_ytests/compiletest/ui/arch_network_kernel.rsmamba_return_y.pyPASS
convolutional_vision_transformertests/compiletest/ui/arch_network_kernel.rsconvolutional_vision_transformer.pyPASS
net_vlad_no_ghost_clusterstests/compiletest/ui/arch_network_kernel.rsnet_vlad_no_ghost_clusters.pyPASS
net_vlad_with_ghost_clusterstests/compiletest/ui/arch_network_kernel.rsnet_vlad_with_ghost_clusters.pyPASS
mobilenetv2_invertedtests/compiletest/ui/arch_network_kernel.rsmobilenetv2_inverted.pyPASS

Attention (23 kernels)

Applicable vulnerability patterns: V1,V2,V3,V6(multi-stage sync)

MKB reference: reference_kernels/attention/

Kernel FunctionSource FileMKB Reference910B3 Status
attention_softmaxtests/compiletest/ui/attention_kernel.rsPASS
residual_add_layernormtests/compiletest/ui/attention_kernel.rsPASS
residual_add_rmsnormtests/compiletest/ui/attention_kernel.rsPASS
swiglutests/compiletest/ui/attention_kernel.rsswiglu.pyPASS
geglutests/compiletest/ui/attention_kernel.rsPASS
masked_filltests/compiletest/ui/attention_kernel.rsmasked_fill.pyPASS
causal_attentiontests/compiletest/ui/attention_extended_kernel.rsPASS
cross_attentiontests/compiletest/ui/attention_extended_kernel.rscross_attention.pyPASS
multi_query_attentiontests/compiletest/ui/attention_extended_kernel.rsmulti_query_attention.pyPASS
group_query_attentiontests/compiletest/ui/attention_extended_kernel.rsgroup_query_attention.pyPASS
kv_cached_attentiontests/compiletest/ui/attention_extended_kernel.rsPASS
cross_modal_attentiontests/compiletest/ui/attention_extended_kernel.rscross_modal_attention.pyPASS
linear_attentiontests/compiletest/ui/attention_extended_kernel.rslinear_attention.pyPASS
sparse_attentiontests/compiletest/ui/attention_extended_kernel.rssparse_attention.pyPASS
windowed_causal_attentiontests/compiletest/ui/attention_extended_kernel.rswindowed_causal_attention.pyPASS
min_gpt_causal_attentiontests/compiletest/ui/attention_extended_kernel.rsmin_gpt_causal_attention.pyPASS
relu_self_attentiontests/compiletest/ui/attention_extended_kernel.rsrelu_self_attention.pyPASS
vision_attentiontests/compiletest/ui/attention_extended_kernel.rsvision_attention.pyPASS
scaled_dot_product_attentiontests/compiletest/ui/attention_extended_kernel.rsscaled_dot_product_attention.pyPASS
sdpa_inferencetests/compiletest/ui/attention_extended_kernel.rssdpa_inference.pyPASS
sdpa_long_contexttests/compiletest/ui/attention_extended_kernel.rssdpa_long_context.pyPASS
kv_cached_chat_batch_attentiontests/compiletest/ui/attention_extended_kernel.rskv_cached_chat_batch_attention.pyPASS
kv_cached_speculative_attentiontests/compiletest/ui/attention_extended_kernel.rskv_cached_speculative_attention.pyPASS

Broadcast (12 kernels)

Applicable vulnerability patterns: V1(type erasure),V2(bounds),V5(double free)

MKB reference: reference_kernels/broadcast/

Convolution (34 kernels)

Applicable vulnerability patterns: V2(nested loop OOB),V3(stride*index overflow)

MKB reference: reference_kernels/convolution/

Kernel FunctionSource FileMKB Reference910B3 Status
conv_standard_1dtests/compiletest/ui/conv_standard_kernel.rsconv_standard_1d.pyPASS
conv_standard_1d_dilated_stridedtests/compiletest/ui/conv_standard_kernel.rsconv_standard_1d_dilated_strided.pyPASS
conv_standard_2d_square_squaretests/compiletest/ui/conv_standard_kernel.rsconv_standard_2d_square_square.pyPASS
conv_standard_2d_asym_squaretests/compiletest/ui/conv_standard_kernel.rsconv_standard_2d_asym_square.pyPASS
conv_standard_2d_square_asymtests/compiletest/ui/conv_standard_kernel.rsconv_standard_2d_square_asym.pyPASS
conv_standard_2d_asym_asymtests/compiletest/ui/conv_standard_kernel.rsconv_standard_2d_asym_asym.pyPASS
conv_standard_2d_dilated_paddedtests/compiletest/ui/conv_standard_kernel.rsconv_standard_2d_dilated_padded.pyPASS
conv_standard_3d_square_squaretests/compiletest/ui/conv_standard_kernel.rsconv_standard_3d_square_square.pyPASS
conv_standard_3d_asym_squaretests/compiletest/ui/conv_standard_kernel.rsconv_standard_3d_asym_square.pyPASS
conv_standard_3d_square_asymtests/compiletest/ui/conv_standard_kernel.rsconv_standard_3d_square_asym.pyPASS
conv_standard_3d_asym_asymtests/compiletest/ui/conv_standard_kernel.rsconv_standard_3d_asym_asym.pyPASS
conv_depthwise_2d_sq_sqtests/compiletest/ui/conv_depthwise_kernel.rsconv_depthwise_2d_sq_sq.pyPASS
conv_depthwise_2d_asym_sqtests/compiletest/ui/conv_depthwise_kernel.rsconv_depthwise_2d_asym_sq.pyPASS
conv_depthwise_2d_sq_asymtests/compiletest/ui/conv_depthwise_kernel.rsconv_depthwise_2d_sq_asym.pyPASS
conv_depthwise_2d_asym_asymtests/compiletest/ui/conv_depthwise_kernel.rsconv_depthwise_2d_asym_asym.pyPASS
conv_depthwise_separable_2dtests/compiletest/ui/conv_depthwise_kernel.rsconv_depthwise_separable_2d.pyPASS
conv_pointwise_2dtests/compiletest/ui/conv_depthwise_kernel.rsconv_pointwise_2d.pyPASS
conv_transposed_1dtests/compiletest/ui/conv_transpose_kernel.rsconv_transposed_1d.pyPASS
conv_transposed_1d_dilatedtests/compiletest/ui/conv_transpose_kernel.rsconv_transposed_1d_dilated.pyPASS
conv_transposed_1d_asym_padded_strided_dilatedtests/compiletest/ui/conv_transpose_kernel.rsPASS
conv_transposed_2d_sq_sqtests/compiletest/ui/conv_transpose_kernel.rsconv_transposed_2d_sq_sq.pyPASS
conv_transposed_2d_sq_asymtests/compiletest/ui/conv_transpose_kernel.rsconv_transposed_2d_sq_asym.pyPASS
conv_transposed_2d_asym_sqtests/compiletest/ui/conv_transpose_kernel.rsconv_transposed_2d_asym_sq.pyPASS
conv_transposed_2d_asym_asymtests/compiletest/ui/conv_transpose_kernel.rsconv_transposed_2d_asym_asym.pyPASS
conv_transposed_2d_asym_asym_paddedtests/compiletest/ui/conv_transpose_kernel.rsPASS
conv_transposed_2d_dilated_padded_stridedtests/compiletest/ui/conv_transpose_kernel.rsPASS
conv_transposed_2d_groupedtests/compiletest/ui/conv_transpose_kernel.rsPASS
conv_transposed_3d_sq_sqtests/compiletest/ui/conv_transpose_kernel.rsconv_transposed_3d_sq_sq.pyPASS
conv_transposed_3d_sq_asymtests/compiletest/ui/conv_transpose_kernel.rsconv_transposed_3d_sq_asym.pyPASS
conv_transposed_3d_asym_sqtests/compiletest/ui/conv_transpose_kernel.rsconv_transposed_3d_asym_sq.pyPASS
conv_transposed_3d_asym_asymtests/compiletest/ui/conv_transpose_kernel.rsconv_transposed_3d_asym_asym.pyPASS
conv_transposed_3d_asym_sq_groupedtests/compiletest/ui/conv_transpose_kernel.rsPASS
conv_transposed_3d_asym_asym_groupedtests/compiletest/ui/conv_transpose_kernel.rsPASS
conv_transposed_3d_sq_sq_dilatedtests/compiletest/ui/conv_transpose_kernel.rsPASS

Fuse (120 kernels)

Applicable vulnerability patterns: V1,V2,V4(use-after-free in chain),V6(inter-op sync)

MKB reference: reference_kernels/fuse/

Kernel FunctionSource FileMKB Reference910B3 Status
fused_relu_hardswishtests/compiletest/ui/fused_activation_chain_kernel.rsfused_relu_hardswish.pyPASS
fused_hardswish_relutests/compiletest/ui/fused_activation_chain_kernel.rsfused_hardswish_relu.pyPASS
fused_mish_mishtests/compiletest/ui/fused_activation_chain_kernel.rsfused_mish_mish.pyPASS
fused_mish_tanhtests/compiletest/ui/fused_activation_chain_kernel.rsfused_mish_tanh.pyPASS
fused_min_tanh_tanhtests/compiletest/ui/fused_activation_chain_kernel.rsfused_min_tanh_tanh.pyPASS
fused_mul_leakyrelu_gelutests/compiletest/ui/fused_activation_chain_kernel.rsfused_mul_leakyrelu_gelu.pyPASS
fused_sub_tanh_subtests/compiletest/ui/fused_activation_chain_kernel.rsfused_sub_tanh_sub.pyPASS
fused_sigmoid_sumtests/compiletest/ui/fused_activation_chain_kernel.rsfused_sigmoid_sum.pyPASS
fused_add_scale_sigmoidtests/compiletest/ui/fused_activation_chain_kernel.rsfused_add_scale_sigmoid.pyPASS
fused_scale_mintests/compiletest/ui/fused_activation_chain_kernel.rsfused_scale_min.pyPASS
fused_leakyrelu_leakyrelu_gelu_gelutests/compiletest/ui/fused_activation_chain_kernel.rsfused_leakyrelu_leakyrelu_gelu_gelu.pyPASS
fused_divide_leakyrelutests/compiletest/ui/fused_activation_chain_kernel.rsfused_divide_leakyrelu.pyPASS
fused_sub_hardswishtests/compiletest/ui/fused_activation_chain_kernel.rsfused_sub_hardswish.pyPASS
fused_tanh_scale_bias_maxtests/compiletest/ui/fused_activation_chain_kernel.rsfused_tanh_scale_bias_max.pyPASS
fused_relu_bias_addtests/compiletest/ui/fused_activation_chain_kernel.rsfused_relu_bias_add.pyPASS
fused_hardswish_relu_softmax_meantests/compiletest/ui/fused_activation_chain_kernel.rsfused_hardswish_relu_softmax_mean.pyPASS
fused_leakyrelu_clamp_gelutests/compiletest/ui/fused_activation_chain_kernel.rsfused_leakyrelu_clamp_gelu.pyPASS
fused_norm_add_multests/compiletest/ui/fused_multi_op_kernel.rsfused_norm_add_mul.pyPASS
fused_scale_normtests/compiletest/ui/fused_multi_op_kernel.rsfused_scale_norm.pyPASS
fused_sub_mish_mishtests/compiletest/ui/fused_multi_op_kernel.rsfused_sub_mish_mish.pyPASS
fused_sub_tanh_sub_meantests/compiletest/ui/fused_multi_op_kernel.rsfused_sub_tanh_sub_mean.pyPASS
fused_min_add_multests/compiletest/ui/fused_multi_op_kernel.rsfused_min_add_mul.pyPASS
fused_elu_scaletests/compiletest/ui/fused_multi_op_kernel.rsfused_elu_scale.pyPASS
fused_selu_addtests/compiletest/ui/fused_multi_op_kernel.rsfused_selu_add.pyPASS
fused_softplus_tanhtests/compiletest/ui/fused_multi_op_kernel.rsfused_softplus_tanh.pyPASS
fused_relu_scale_addtests/compiletest/ui/fused_multi_op_kernel.rsfused_relu_scale_add.pyPASS
fused_sigmoid_gatetests/compiletest/ui/fused_multi_op_kernel.rsfused_sigmoid_gate.pyPASS
fused_exp_reduce_sumtests/compiletest/ui/fused_multi_op_kernel.rsfused_exp_reduce_sum.pyPASS
log_sum_exptests/compiletest/ui/fused_multi_op_kernel.rslog_sum_exp.pyPASS
fused_max_lse_relutests/compiletest/ui/fused_multi_op_kernel.rsfused_max_lse_relu.pyPASS
fused_hardswish_gelutests/compiletest/ui/fused_multi_op_kernel.rsfused_hardswish_gelu.pyPASS
fused_softsign_scale_addtests/compiletest/ui/fused_multi_op_kernel.rsfused_softsign_scale_add.pyPASS
fused_hardsigmoid_scale_clamptests/compiletest/ui/fused_multi_op_kernel.rsfused_hardsigmoid_scale_clamp.pyPASS
fused_abs_sumtests/compiletest/ui/fused_multi_op_kernel.rsfused_abs_sum.pyPASS
fused_rmsnorm_mish_scaletests/compiletest/ui/fused_multi_op_kernel.rsfused_rmsnorm_mish_scale.pyPASS
fused_reciprocal_scale_addtests/compiletest/ui/fused_multi_op_kernel.rsfused_reciprocal_scale_add.pyPASS
fused_layernorm_relutests/compiletest/ui/fused_norm_activation_kernel.rsfused_layernorm_relu.pyPASS
fused_layernorm_sigmoidtests/compiletest/ui/fused_norm_activation_kernel.rsfused_layernorm_sigmoid.pyPASS
fused_rmsnorm_swishtests/compiletest/ui/fused_norm_activation_kernel.rsfused_rmsnorm_swish.pyPASS
fused_layernorm_tanh_hardswishtests/compiletest/ui/fused_norm_activation_kernel.rsfused_layernorm_tanh_hardswish.pyPASS
fused_softmax_meantests/compiletest/ui/fused_norm_activation_kernel.rsfused_softmax_mean.pyPASS
fused_layernorm_gelutests/compiletest/ui/fused_norm_activation_kernel.rsfused_layernorm_gelu.pyPASS
fused_rmsnorm_gelutests/compiletest/ui/fused_norm_activation_kernel.rsfused_rmsnorm_gelu.pyPASS
fused_log_softmax_meantests/compiletest/ui/fused_norm_activation_kernel.rsfused_log_softmax_mean.pyPASS
test_sigmoidtests/compiletest/ui/composite_ops_kernel.rsPASS
test_tanhtests/compiletest/ui/composite_ops_kernel.rsPASS
test_gelutests/compiletest/ui/composite_ops_kernel.rsPASS
test_softmaxtests/compiletest/ui/composite_ops_kernel.rsPASS
conv2d_activation_batch_normtests/compiletest/ui/fused_conv2d_ext_kernel.rsconv2d_activation_batch_norm.pyPASS
conv2d_add_scale_sigmoid_group_normtests/compiletest/ui/fused_conv2d_ext_kernel.rsconv2d_add_scale_sigmoid_group_norm.pyPASS
conv2d_avg_pool_sigmoid_sumtests/compiletest/ui/fused_conv2d_ext_kernel.rsconv2d_avg_pool_sigmoid_sum.pyPASS
conv2d_batch_norm_scalingtests/compiletest/ui/fused_conv2d_ext_kernel.rsconv2d_batch_norm_scaling.pyPASS
conv2d_gelu_global_avg_pooltests/compiletest/ui/fused_conv2d_ext_kernel.rsconv2d_gelu_global_avg_pool.pyPASS
conv2d_group_norm_scale_max_pool_clamptests/compiletest/ui/fused_conv2d_ext_kernel.rsconv2d_group_norm_scale_max_pool_clamp.pyPASS
conv2d_group_norm_tanh_hard_swish_residual_add_log_sum_exptests/compiletest/ui/fused_conv2d_ext_kernel.rsconv2d_group_norm_tanh_hard_swish_residual_add_log_sum_exp.pyPASS
conv2d_instance_norm_dividetests/compiletest/ui/fused_conv2d_ext_kernel.rsconv2d_instance_norm_divide.pyPASS
conv2d_subtract_hard_swish_max_pool_mishtests/compiletest/ui/fused_conv2d_ext_kernel.rsconv2d_subtract_hard_swish_max_pool_mish.pyPASS
conv2d_subtract_subtract_mishtests/compiletest/ui/fused_conv2d_ext_kernel.rsconv2d_subtract_subtract_mish.pyPASS
conv2d_subtract_tanh_subtract_avg_pooltests/compiletest/ui/fused_conv2d_ext_kernel.rsconv2d_subtract_tanh_subtract_avg_pool.pyPASS
conv3d_divide_max_global_avg_pool_bias_add_sumtests/compiletest/ui/fused_conv3d_ext_kernel.rsconv3d_divide_max_global_avg_pool_bias_add_sum.pyPASS
conv3d_leaky_relu_sum_clamp_gelutests/compiletest/ui/fused_conv3d_ext_kernel.rsconv3d_leaky_relu_sum_clamp_gelu.pyPASS
conv3d_multiply_instance_norm_clamp_multiply_maxtests/compiletest/ui/fused_conv3d_ext_kernel.rsconv3d_multiply_instance_norm_clamp_multiply_max.pyPASS
conv3d_relu_leaky_relu_gelu_sigmoid_bias_addtests/compiletest/ui/fused_conv3d_ext_kernel.rsconv3d_relu_leaky_relu_gelu_sigmoid_bias_add.pyPASS
conv3d_scaling_tanh_multiply_sigmoidtests/compiletest/ui/fused_conv3d_ext_kernel.rsconv3d_scaling_tanh_multiply_sigmoid.pyPASS
conv3d_softmax_max_pool_max_pooltests/compiletest/ui/fused_conv3d_ext_kernel.rsconv3d_softmax_max_pool_max_pool.pyPASS
conv_transpose2d_add_min_gelu_multiplytests/compiletest/ui/fused_conv_transpose2d_kernel.rsconv_transpose2d_add_min_gelu_multiply.pyPASS
conv_transpose2d_bias_add_clamp_scaling_clamp_dividetests/compiletest/ui/fused_conv_transpose2d_kernel.rsconv_transpose2d_bias_add_clamp_scaling_clamp_divide.pyPASS
conv_transpose2d_gelu_group_normtests/compiletest/ui/fused_conv_transpose2d_kernel.rsconv_transpose2d_gelu_group_norm.pyPASS
conv_transpose2d_max_pool_hardtanh_mean_tanhtests/compiletest/ui/fused_conv_transpose2d_kernel.rsconv_transpose2d_max_pool_hardtanh_mean_tanh.pyPASS
conv_transpose2d_min_sum_gelu_addtests/compiletest/ui/fused_conv_transpose2d_kernel.rsconv_transpose2d_min_sum_gelu_add.pyPASS
conv_transpose2d_mish_add_hardtanh_scalingtests/compiletest/ui/fused_conv_transpose2d_kernel.rsconv_transpose2d_mish_add_hardtanh_scaling.pyPASS
conv_transpose2d_multiply_global_avg_pool_global_avg_pool_meantests/compiletest/ui/fused_conv_transpose2d_kernel.rsconv_transpose2d_multiply_global_avg_pool_global_avg_pool_mean.pyPASS
conv_transpose2d_subtract_tanhtests/compiletest/ui/fused_conv_transpose2d_kernel.rsconv_transpose2d_subtract_tanh.pyPASS
convtranspose2d_batchnorm_tanh_maxpool_groupnormtests/compiletest/ui/fused_conv_transpose2d_kernel.rsconvtranspose2d_batchnorm_tanh_maxpool_groupnorm.pyPASS
convtranspose2d_globalavgpool_biasadd_logsumexp_sum_multiplytests/compiletest/ui/fused_conv_transpose2d_kernel.rsconvtranspose2d_globalavgpool_biasadd_logsumexp_sum_multiply.pyPASS
convtranspose2d_softmax_biasadd_scaling_sigmoidtests/compiletest/ui/fused_conv_transpose2d_kernel.rsconvtranspose2d_softmax_biasadd_scaling_sigmoid.pyPASS
conv_transpose3d_add_hard_swishtests/compiletest/ui/fused_conv_transpose3d_kernel.rsconv_transpose3d_add_hard_swish.pyPASS
conv_transpose3d_avg_pool_clamp_softmax_multiplytests/compiletest/ui/fused_conv_transpose3d_kernel.rsconv_transpose3d_avg_pool_clamp_softmax_multiply.pyPASS
conv_transpose3d_batch_norm_avg_pool_avg_pooltests/compiletest/ui/fused_conv_transpose3d_kernel.rsconv_transpose3d_batch_norm_avg_pool_avg_pool.pyPASS
conv_transpose3d_batch_norm_subtracttests/compiletest/ui/fused_conv_transpose3d_kernel.rsconv_transpose3d_batch_norm_subtract.pyPASS
conv_transpose3d_clamp_min_dividetests/compiletest/ui/fused_conv_transpose3d_kernel.rsconv_transpose3d_clamp_min_divide.pyPASS
conv_transpose3d_layer_norm_gelu_scalingtests/compiletest/ui/fused_conv_transpose3d_kernel.rsconv_transpose3d_layer_norm_gelu_scaling.pyPASS
conv_transpose3d_leaky_relu_multiply_leaky_relu_maxtests/compiletest/ui/fused_conv_transpose3d_kernel.rsconv_transpose3d_leaky_relu_multiply_leaky_relu_max.pyPASS
conv_transpose3d_log_sum_exp_hard_swish_subtract_clamp_maxtests/compiletest/ui/fused_conv_transpose3d_kernel.rsconv_transpose3d_log_sum_exp_hard_swish_subtract_clamp_max.pyPASS
conv_transpose3d_max_max_sumtests/compiletest/ui/fused_conv_transpose3d_kernel.rsconv_transpose3d_max_max_sum.pyPASS
conv_transpose3d_max_pool_softmax_subtract_swish_maxtests/compiletest/ui/fused_conv_transpose3d_kernel.rsconv_transpose3d_max_pool_softmax_subtract_swish_max.pyPASS
conv_transpose3d_multiply_max_global_avg_pool_clamptests/compiletest/ui/fused_conv_transpose3d_kernel.rsconv_transpose3d_multiply_max_global_avg_pool_clamp.pyPASS
conv_transpose3d_scale_batch_norm_global_avg_pooltests/compiletest/ui/fused_conv_transpose3d_kernel.rsconv_transpose3d_scale_batch_norm_global_avg_pool.pyPASS
conv_transpose3d_scaling_avg_pool_bias_add_scalingtests/compiletest/ui/fused_conv_transpose3d_kernel.rsconv_transpose3d_scaling_avg_pool_bias_add_scaling.pyPASS
conv_transpose3d_softmax_sigmoidtests/compiletest/ui/fused_conv_transpose3d_kernel.rsconv_transpose3d_softmax_sigmoid.pyPASS
conv_transpose3d_sum_layer_norm_avg_pool_gelutests/compiletest/ui/fused_conv_transpose3d_kernel.rsconv_transpose3d_sum_layer_norm_avg_pool_gelu.pyPASS
conv_transpose3d_sum_residual_add_multiply_residual_addtests/compiletest/ui/fused_conv_transpose3d_kernel.rsconv_transpose3d_sum_residual_add_multiply_residual_add.pyPASS
conv_transpose3d_swish_group_norm_hard_swishtests/compiletest/ui/fused_conv_transpose3d_kernel.rsconv_transpose3d_swish_group_norm_hard_swish.pyPASS
convtranspose3d_mean_add_softmax_tanh_scalingtests/compiletest/ui/fused_conv_transpose3d_kernel.rsconvtranspose3d_mean_add_softmax_tanh_scaling.pyPASS
convtranspose3d_relu_groupnormtests/compiletest/ui/fused_conv_transpose3d_kernel.rsconvtranspose3d_relu_groupnorm.pyPASS
gemm_add_relutests/compiletest/ui/fused_gemm_ext_kernel.rsgemm_add_relu.pyPASS
gemm_batch_norm_gelu_group_norm_mean_relutests/compiletest/ui/fused_gemm_ext_kernel.rsgemm_batch_norm_gelu_group_norm_mean_relu.pyPASS
gemm_batch_norm_scaling_softmaxtests/compiletest/ui/fused_gemm_ext_kernel.rsgemm_batch_norm_scaling_softmax.pyPASS
gemm_log_sum_exp_leaky_relu_leaky_relu_gelu_gelutests/compiletest/ui/fused_gemm_ext_kernel.rsgemm_log_sum_exp_leaky_relu_leaky_relu_gelu_gelu.pyPASS
gemm_sigmoid_sum_log_sum_exptests/compiletest/ui/fused_gemm_ext_kernel.rsgemm_sigmoid_sum_log_sum_exp.pyPASS
gemm_subtract_global_avg_pool_log_sum_exp_gelu_residual_addtests/compiletest/ui/fused_gemm_ext_kernel.rsgemm_subtract_global_avg_pool_log_sum_exp_gelu_residual_add.pyPASS
matmul_avg_pool_gelu_scale_maxtests/compiletest/ui/fused_matmul_ext_kernel.rsmatmul_avg_pool_gelu_scale_max.pyPASS
matmul_batch_norm_bias_add_divide_swishtests/compiletest/ui/fused_matmul_ext_kernel.rsmatmul_batch_norm_bias_add_divide_swish.pyPASS
matmul_dropout_mean_softmaxtests/compiletest/ui/fused_matmul_ext_kernel.rsmatmul_dropout_mean_softmax.pyPASS
matmul_scale_residual_add_clamp_log_sum_exp_mishtests/compiletest/ui/fused_matmul_ext_kernel.rsmatmul_scale_residual_add_clamp_log_sum_exp_mish.pyPASS
matmul_scaling_residual_addtests/compiletest/ui/fused_matmul_ext_kernel.rsmatmul_scaling_residual_add.pyPASS
matmul_sigmoid_sumtests/compiletest/ui/fused_matmul_ext_kernel.rsmatmul_sigmoid_sum.pyPASS
matmul_subtract_multiply_relutests/compiletest/ui/fused_matmul_ext_kernel.rsmatmul_subtract_multiply_relu.pyPASS
matmul_sum_max_avg_pool_log_sum_exp_log_sum_exptests/compiletest/ui/fused_matmul_ext_kernel.rsmatmul_sum_max_avg_pool_log_sum_exp_log_sum_exp.pyPASS
matmul_swish_scalingtests/compiletest/ui/fused_matmul_ext_kernel.rsmatmul_swish_scaling.pyPASS
matmul_swish_sum_group_normtests/compiletest/ui/fused_matmul_ext_kernel.rsmatmul_swish_sum_group_norm.pyPASS
bmm_instance_norm_sum_residual_add_multiplytests/compiletest/ui/fused_matmul_ext_kernel.rsbmm_instance_norm_sum_residual_add_multiply.pyPASS
fused_gemm_norm_gelutests/compiletest/ui/fused_matmul_norm_kernel.rsPASS
fused_gemm_norm_scale_softmaxtests/compiletest/ui/fused_matmul_norm_kernel.rsPASS
fused_gemm_scale_normtests/compiletest/ui/fused_matmul_norm_kernel.rsPASS
fused_gemm_norm_hardtanhtests/compiletest/ui/fused_matmul_norm_kernel.rsPASS
fused_gemm_norm_swish_mul_swishtests/compiletest/ui/fused_matmul_norm_kernel.rsPASS
fused_gemm_bias_hardtanh_mish_normtests/compiletest/ui/fused_matmul_norm_kernel.rsPASS
gemm_scale_batch_normtests/compiletest/ui/fused_matmul_norm_kernel.rsgemm_scale_batch_norm.pyPASS
gemm_scale_batchnormtests/compiletest/ui/fused_matmul_norm_kernel.rsgemm_scale_batchnorm.pyPASS

Index (12 kernels)

Applicable vulnerability patterns: V2(gather/scatter OOB),V3(index calc overflow)

MKB reference: reference_kernels/index/

Loss (6 kernels)

Applicable vulnerability patterns: V1,V2,V6(reduction sync)

MKB reference: reference_kernels/loss/

Math (5 kernels)

Applicable vulnerability patterns: V2(cumulative bounds),V3(offset overflow)

MKB reference: reference_kernels/math/

Matmul (23 kernels)

Applicable vulnerability patterns: V1(type erasure f16/f32),V2(tile bounds),V3(dim overflow),V6(cube sync)

MKB reference: reference_kernels/matmul/

Kernel FunctionSource FileMKB Reference910B3 Status
matmultests/compiletest/ui/matmul_kernel.rsmatmul.pyPASS
matmul_standardtests/compiletest/ui/matmul_ops_kernel.rsmatmul_standard.pyPASS
matmul_squaretests/compiletest/ui/matmul_ops_kernel.rsmatmul_square.pyPASS
matmul_matvectests/compiletest/ui/matmul_ops_kernel.rsmatmul_matvec.pyPASS
matmul_large_ktests/compiletest/ui/matmul_ops_kernel.rsmatmul_large_k.pyPASS
matmul_small_ktests/compiletest/ui/matmul_ops_kernel.rsmatmul_small_k.pyPASS
matmul_irregulartests/compiletest/ui/matmul_ops_kernel.rsmatmul_irregular.pyPASS
matmul_tall_skinnytests/compiletest/ui/matmul_ops_kernel.rsmatmul_tall_skinny.pyPASS
matmul_transposed_atests/compiletest/ui/matmul_transpose_kernel.rsPASS
matmul_transposed_btests/compiletest/ui/matmul_transpose_kernel.rsPASS
matmul_transposed_bothtests/compiletest/ui/matmul_transpose_kernel.rsPASS
matmul_lower_triangulartests/compiletest/ui/matmul_transpose_kernel.rsPASS
matmul_upper_triangulartests/compiletest/ui/matmul_transpose_kernel.rsPASS
matmul_batchedtests/compiletest/ui/matmul_extended_kernel.rsmatmul_batched.pyPASS
matmul_symmetrictests/compiletest/ui/matmul_extended_kernel.rsmatmul_symmetric.pyPASS
matmul_biastests/compiletest/ui/matmul_extended_kernel.rsmatmul_bias.pyPASS
matmul_scaledtests/compiletest/ui/matmul_extended_kernel.rsmatmul_scaled.pyPASS
gemm_fulltests/compiletest/ui/matmul_extended_kernel.rsgemm_full.pyPASS
matmul_widetests/compiletest/ui/matmul_extended_kernel.rsmatmul_wide.pyPASS
matmul_relu_matmultests/compiletest/ui/matmul_extended_kernel.rsmatmul_relu_matmul.pyPASS
matmul_accumulatetests/compiletest/ui/matmul_extended_kernel.rsmatmul_accumulate.pyPASS
matmul_diag_scaletests/compiletest/ui/matmul_extended_kernel.rsmatmul_diag_scale.pyPASS
outer_producttests/compiletest/ui/matmul_extended_kernel.rsPASS

Normalization (10 kernels)

Applicable vulnerability patterns: V1,V2,V6(reduce-normalize sync)

MKB reference: reference_kernels/normalization/

Optimizer (6 kernels)

Applicable vulnerability patterns: V1,V2(param bounds),V4(in-place update UAF)

MKB reference: reference_kernels/optimizer/

Pooling (12 kernels)

Applicable vulnerability patterns: V2(window OOB),V3(stride overflow)

MKB reference: reference_kernels/pooling/

Reduce (5 kernels)

Applicable vulnerability patterns: V1,V2,V6(reduction pipeline sync)

MKB reference: reference_kernels/reduce/

Resize (15 kernels)

Applicable vulnerability patterns: V2(interpolation OOB),V3(coordinate overflow)

MKB reference: reference_kernels/resize/

Tiled (16 kernels)

Applicable vulnerability patterns: V2(tile boundary OOB),V6(tile-boundary sync)

Multiblock (16 kernels)

Applicable vulnerability patterns: V2(block partition OOB),V6(cross-block sync)

F16 (14 kernels)

Applicable vulnerability patterns: V1(f16/f32 type confusion)

Unary_math (8 kernels)

Applicable vulnerability patterns: V1,V2

Deployable Kernels (with host code)

KernelSource FilePurpose
?examples/bench_softmax_tile/kernels/src/lib.rsDeployable kernel
scale_f16examples/mha/kernels/src/lib.rsMulti-head attention (f16 scale + softmax)
softmax_rows_f16examples/mha/kernels/src/lib.rsMulti-head attention (f16 scale + softmax)
?examples/acl_vec_mul/kernels/src/lib.rsVector multiplication example
mulexamples/acl_vec_mul/kernels/src/lib.rsVector multiplication example
tile_softmaxexamples/tile_softmax/kernels/src/lib.rsDeployable kernel
?examples/tile_matmul/kernels/src/lib.rsDeployable kernel
softmaxexamples/bench_softmax_rs/kernels/src/lib.rsSoftmax benchmark (Rust)
addexamples/acl_rs_vec_add/kernels/src/lib.rsVector addition end-to-end example
test_store_constexamples/acl_softmax/kernels/src/lib.rsSoftmax with store/copy test kernels
test_copyexamples/acl_softmax/kernels/src/lib.rsSoftmax with store/copy test kernels
softmaxexamples/acl_softmax/kernels/src/lib.rsSoftmax with store/copy test kernels
vec_addexamples/bench_vec_add_rs/kernels/src/lib.rsVector add benchmark (Rust)
matmulexamples/bench_matmul_rs/kernels/src/lib.rsMatrix multiply benchmark (Rust)

Memory Safety Case Studies

Each case pairs a vulnerable C++ kernel with a structurally safe Rust kernel.

CaseVulnerabilityC++ FileRust File
1Type confusion (GM_ADDR type erasure)vulnerable.cppsafe.rs
2Buffer overflow (unchecked indexing)vulnerable.cppsafe.rs
3Use-after-free (FreeTensor then access)vulnerable.cppsafe.rs
4Missing sync (forgotten pipe_barrier)vulnerable.cppsafe.rs
5Double free (repeated FreeTensor)vulnerable.cppsafe.rs
6Integer overflow (silent offset wrap)vulnerable.cppsafe.rs

Performance Comparison (in progress)

Kernelascend-rs TimeAscendC C++ TimeRatioNotes
softmax (256)0.077 ms0.078 ms0.99xZero overhead
softmax (16384)0.087 ms0.089 ms0.98xZero overhead
reluPending
matmulPending
layernormPending
conv2dPending

Performance benchmarking experiments are in progress. This table will be updated as results become available.


This appendix was auto-generated by bash scripts/generate_kernel_appendix.sh. Kernel counts: 486 compiletests + 19 deployable = 505 total.

English | 中文版

Appendix F: Performance Benchmarks

This appendix provides an interactive comparison of AscendC C++ (hand-optimized reference kernels) versus ascend-rs (Rust-generated) kernel performance across different NPU targets.

Methodology

  • Wall-clock timing: clock_gettime(CLOCK_MONOTONIC) around kernel launch + aclrtSynchronizeStream
  • Iterations: 1 warmup + 10 timed, median reported
  • Compilation: Both C++ and Rust kernels compiled with bisheng at -O2
  • Ratio: Rust time / C++ time (< 1.0 = Rust is faster)

Interactive Results

Note: If the interactive table does not render (e.g., in PDF), see the static table below.

Static Summary

KernelSizeTargetC++ (ms)Rust (ms)Ratio
relu256310P0.0780.0750.96x
relu1024310P0.0750.0761.01x
relu4096310P0.0750.0761.01x
relu16384310P0.0830.0831.00x
sigmoid256310P0.0750.0751.00x
sigmoid1024310P0.0750.0740.99x
sigmoid4096310P0.0770.0771.00x
sigmoid16384310P0.0860.0861.00x
softmax256310P0.0780.0770.99x
softmax1024310P0.0770.0760.99x
softmax4096310P0.0790.0791.00x
softmax16384310P0.0890.0870.98x
tanh256310P0.0750.0771.03x
tanh1024310P0.0750.0761.01x
tanh4096310P0.0760.0781.03x
tanh16384310P0.0850.0861.01x
gelu256910B30.0230.0190.83x
gelu1024910B30.0220.0190.86x
gelu4096910B30.0230.0190.83x
gelu16384910B30.0240.0230.96x
relu256910B30.0300.0301.00x
relu1024910B30.0280.0281.00x
relu4096910B30.0290.0260.90x
relu16384910B30.0290.0311.07x
sigmoid256910B30.0280.0281.00x
sigmoid1024910B30.0280.0240.86x
sigmoid4096910B30.0290.0280.97x
sigmoid16384910B30.0290.0301.03x
softmax256910B30.0310.0321.03x
softmax1024910B30.0310.0311.00x
softmax4096910B30.0210.0211.00x
tanh256910B30.0290.0301.03x
tanh1024910B30.0280.0260.93x
tanh4096910B30.0280.0281.00x
tanh16384910B30.0290.0301.03x

Benchmarks collected on Ascend 910B3 and 310P hardware. Auto-generated from kernels.db.

English | 中文版

Appendix G: Tile API vs. Buffer API — A Comparison with FlashTile/PTO

Summary finding: The tile-based kernel API (ascend_std::tile) shrinks a 50-line softmax to 5 lines and eliminates all explicit pipe barrier management. For V-pipe workloads like softmax, PTO provides no runtime performance advantage over the buffer API — both target the same vector engine. The true performance case for PTO is cube-unit (M-pipe) kernels: pto.tmatmul drives L0A/L0B/L0C memory and the matrix multiplier, which is architecturally inaccessible through the buffer/vector API. A GEMM benchmark comparing mlir_to_pto against the buffer path is the correct experiment to demonstrate this advantage.


G.1 The Three Codegen Paths in ascend-rs

ascend-rs supports three distinct codegen paths for Rust NPU kernels. Each path targets a different level of the Ascend software stack, and all three share a common Rust frontend and MLIR intermediate stage:

┌─────────────────────────────────────────────────────────────────────────────┐
│                         ascend-rs Toolchain                                 │
│                                                                             │
│  Rust kernel source (.rs)                                                   │
│         │                                                                   │
│         ▼                                                                   │
│  rustc + rustc_codegen_mlir  ←── custom codegen backend (melior/MLIR)      │
│         │                                                                   │
│         ▼                                                                   │
│  LLVM-dialect MLIR (.mlir)                                                  │
│         │                                                                   │
│   ┌─────┴──────────────────────────────┐                                   │
│   │                                    │                                   │
│   │  ACLRS_CODEGEN_PATH=cpp (default)  │  ACLRS_CODEGEN_PATH=pto          │
│   │                                    │                                   │
│   ▼                                    ▼                                   │
│  mlir_to_cpp.rs                       mlir_to_pto.rs                       │
│  (5,956 lines)                        (714 lines)                          │
│   │                                    │                                   │
│   ▼                                    ▼                                   │
│  AscendC C++ (.cpp)                   PTO-MLIR (.pto)                      │
│   │           │                              │                             │
│   │           │                        ┌─────┘                            │
│   │           │                        ▼                                   │
│   │           │               ptoas (PTO assembler)                        │
│   │           │               [Huawei internal tool]                       │
│   │           │                        │                                   │
│   │           └────────────────────────┤                                   │
│   │                                    ▼                                   │
│   │                          AscendC C++ (.cpp)                            │
│   │                                    │                                   │
│   └────────────────────────────────────┘                                   │
│                                    │                                        │
│                                    ▼                                        │
│               bisheng / ccec  (Huawei CCE compiler)                        │
│                                    │                                        │
│                                    ▼                                        │
│                    NPU binary (.o / .bin)                                   │
│                                    │                                        │
│                                    ▼                                        │
│                  KernelLoader + AclStream (ascend_rs host API)             │
│                                    │                                        │
│                                    ▼                                        │
│                     Ascend NPU hardware execution                           │
└─────────────────────────────────────────────────────────────────────────────┘

PyPTO (FlashTile) is a parallel Python-based codegen path that targets the same PTO assembler from a different frontend:

┌─────────────────────────────────────────────────────────────────────────────┐
│                  PyPTO / FlashTile Toolchain (for comparison)               │
│                                                                             │
│  Python DSL (FlashTile decorators)                                          │
│         │                                                                   │
│         ▼                                                                   │
│  to_pto_converter (Python bindings → MLIR Python API)                      │
│         │                                                                   │
│         ▼                                                                   │
│  PTO-MLIR (.pto)                                                            │
│         │                                                                   │
│         ▼                                                                   │
│  ptoas  →  AscendC C++  →  bisheng  →  NPU binary                         │
└─────────────────────────────────────────────────────────────────────────────┘

The ascend-rs PTO path and the PyPTO path share the same ptoas assembler and the same PTO-MLIR dialect. This means the two ecosystems are interoperable at the .pto boundary: a tile kernel described in either Rust or Python produces structurally identical intermediate representation.

The three ascend-rs paths differ in their target use cases:

PathEnv varTranslatorOutputStatus
Buffer APIACLRS_CODEGEN_PATH=cpp (default)mlir_to_cppAscendC C++ with TBuf, DataCopy, pipe_barrierProduction — verified on 310P and 910B2
Tile→CPPACLRS_CODEGEN_PATH=cpp + tile intrinsicsmlir_to_cpp tile handlersAscendC C++ with scalar GetValue/SetValue loopsWorking — all 6 multi-row shapes pass correctness; ~10 Melem/s (scalar bottleneck)
Tile→PTOACLRS_CODEGEN_PATH=ptomlir_to_ptoPTO-MLIR dialect for ptoasExperimental — full softmax (trowmaxtrowexpandsubtexptrowsumtrowexpanddiv) verified through ptoas; blocked at bisheng step (CANN 8.5.0 / pto-inst.hpp incompatibility)

The tile API path implements Phase 3 of the PTO/FlashTile integration plan, where PTO (Programmable Tile Operations) is a virtual ISA for Ascend NPUs, with ptoas being its assembler. FlashTile refers to the tile-level programming model exposed through the PTO ISA — tile loads, stores, and fused operations like tile.softmax — as distinct from the lower-level buffer/DMA model of AscendC.


G.2 The Usability Gap: Softmax as a Case Study

The same row-wise softmax computation requires very different amounts of code in each API:

Buffer API (mha/kernels/src/lib.rs, ~50 lines of kernel code):

#![allow(unused)]
fn main() {
#[ascend_std::aiv_kernel]
pub unsafe fn softmax_rows_f16(
    input: *const u16, output: *mut u16,
    row_len: *const u32, num_rows: *const u32,
) {
    let cols = *row_len;
    let rows = *num_rows;
    let buf_in   = ascend_std::ascend_buf_alloc(cols);
    let buf_out  = ascend_std::ascend_buf_alloc(cols);
    let buf_work = ascend_std::ascend_buf_alloc(cols);
    let buf_rwork = ascend_std::ascend_buf_alloc(cols);

    let mut row = 0u32;
    loop {
        if row >= rows { break; }
        let in_ptr  = input.wrapping_add((row * cols) as usize);
        let out_ptr = output.wrapping_add((row * cols) as usize);

        ascend_std::ascend_buf_load_f16(buf_in, in_ptr, cols);
        ascend_std::ascend_pipe_barrier();                          // barrier 1

        let max_val = ascend_std::ascend_reduce_max_f16(buf_rwork, buf_in, buf_work, cols);
        ascend_std::ascend_pipe_barrier();                          // barrier 2

        ascend_std::ascend_adds_f16(buf_out, buf_in, -max_val, cols);
        ascend_std::ascend_pipe_barrier();                          // barrier 3

        ascend_std::ascend_exp_f16(buf_out, buf_out, cols);
        ascend_std::ascend_pipe_barrier();                          // barrier 4

        let sum_val = ascend_std::ascend_reduce_sum_f16(buf_rwork, buf_out, buf_work, cols);
        ascend_std::ascend_pipe_barrier();                          // barrier 5

        ascend_std::ascend_muls_f16(buf_out, buf_out, 1.0f32 / sum_val, cols);
        ascend_std::ascend_pipe_barrier();                          // barrier 6

        ascend_std::ascend_buf_store_f16(out_ptr, buf_out, cols);
        ascend_std::ascend_pipe_barrier();                          // barrier 7

        row = row + 1;
    }
}
}

Tile API (tile_softmax/kernels/src/lib.rs, 5 lines of kernel logic):

#![allow(unused)]
fn main() {
#[ascend_std::aiv_kernel]
pub unsafe fn tile_softmax<const ROWS: usize, const COLS: usize>(
    input: *const f32,
    output: *mut f32,
) {
    let block_idx = ascend_std::get_block_idx() as usize;
    let offset = block_idx * ROWS * COLS;
    let t_in  = tile_load_f32::<ROWS, COLS>(input.add(offset));
    let t_out = tile_softmax_f32::<ROWS, COLS>(t_in);
    tile_store_f32::<ROWS, COLS>(output.add(offset), t_out);
}
}

The difference is stark: 7 explicit pipe_barrier() calls, 4 named buffer allocations, and a manual row loop in the buffer API vs. zero barriers, zero explicit buffers, and no loop in the tile API. The tile API codegen path — mlir_to_pto — automatically suppresses pipe_barrier calls because PTO manages pipeline synchronization implicitly.


G.3 The PTO Format: MLIR Dialect, Not Text Assembly

A critical finding emerged when ptoas was located on the 910c server. The actual .pto format consumed by ptoas is MLIR with a pto dialect — not a line-oriented text assembly.

The correct PTO format uses MLIR structured ops:

module {
  func.func @softmax_kernel(%arg0: !pto.ptr<f32>, %arg1: !pto.ptr<f32>) {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %c32 = arith.constant 32 : index
    %in_tv  = pto.make_tensor_view %arg0, shape = [%c32, %c32] strides = [%c32, %c1] : !pto.tensor_view<2xf32>
    %out_tv = pto.make_tensor_view %arg1, shape = [%c32, %c32] strides = [%c32, %c1] : !pto.tensor_view<2xf32>
    %in_pt  = pto.partition_view %in_tv,  offsets = [%c0, %c0], sizes = [%c32, %c32] : !pto.tensor_view<2xf32> -> !pto.partition_tensor_view<32x32xf32>
    %out_pt = pto.partition_view %out_tv, offsets = [%c0, %c0], sizes = [%c32, %c32] : !pto.tensor_view<2xf32> -> !pto.partition_tensor_view<32x32xf32>
    %buf_in  = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=32, cols=32, ...>
    %buf_out = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=32, cols=32, ...>
    pto.tload  ins(%in_pt  : ...) outs(%buf_in  : ...)
    pto.tsoftmax ins(%buf_in : ...) outs(%buf_out : ...)
    pto.tstore ins(%buf_out : ...) outs(%out_pt : ...)
    return
  }
}

When this is fed to ptoas, the tool lowers it through several MLIR passes (PTO Infer Mem Scope → PTO plan Mem → PTOToEmitC) and emits AscendC C++:

#include "common/pto_instr.hpp"
using namespace pto;
__global__ AICORE void softmax_kernel(__gm__ float* v1, __gm__ float* v2) {
  using T = float;
  // ... pto-generated AscendC vector ops ...
}

This means ptoas is a source-to-source compiler from PTO-MLIR to AscendC C++, not an assembler to machine code. The pipeline is:

PTO-MLIR (.pto file)  →  ptoas  →  AscendC C++ (.cpp)  →  bisheng  →  NPU binary (.o)

G.4 Comparison with FlashTile

FlashTile (PyPTO) is CANN’s tile-level operator programming framework. It exposes approximately 90 tile operations through a Python DSL that compiles via ptoas to AscendC C++. The ascend-rs tile API (ascend_std::tile) targets the same PTO ISA from the Rust side.

DimensionFlashTile/PyPTOascend-rs buffer APIascend-rs tile→CPPascend-rs tile→PTO
Frontend languagePython DSLRust (no_std)Rust (no_std)Rust (no_std)
Tile shape encodingRuntime Python objectsRuntime count argsCompile-time const genericsCompile-time const generics
Shape mismatch detectionRuntime errorRuntime (wrong result)Compile errorCompile error
Barrier managementImplicit (PTO)Explicit (7 per softmax)Implicit (generated)Implicit (PTO)
Memory safetyPython GC; no device-side safetyRust ownershipRust ownershipRust ownership
Codegen pathPython → PTO-MLIR → ptoas → C++Rust → MLIR → mlir_to_cpp → C++Rust → MLIR → mlir_to_cpp (tile handlers) → C++Rust → MLIR → mlir_to_pto → PTO-MLIR → ptoas → C++
MLIR optimization stageNoneNo (pass-through)No (pass-through)Yes — MLIR passes before ptoas
ptoas requiredYesNoNoYes — same dependency
V-pipe (softmax, eltwise)~same as buffer APIBest (440–788 Melem/s)~10 Melem/s (scalar workaround)~same as buffer API
M-pipe (GEMM, matmul)Full cube-unit via pto.tmatmulNot accessible (V-pipe only)Not accessible (V-pipe only)Full cube-unit via pto.tmatmul
Current hardware statusCANN internal distributionProduction (310P + 910B2)Working — all 6 softmax shapes passExperimental — mlir_to_pto.rs done; build integration pending

The key structural advantage of the Rust approach over PyPTO is the compile-time shape system: Tile<16, 1024, f32> is a distinct type from Tile<1, 1024, f32>, and passing the wrong tile to tile_softmax_f32 is a type error caught by rustc before any code runs. In Python, tile shape mismatches are runtime errors.

The key advantage of PyPTO is maturity: it ships with CANN and is tested against real hardware. ascend-rs’s tile path depends on ptoas, which is not yet publicly available.


G.5 Quantitative Summary

V-pipe workloads (softmax) — ergonomics

MetricBuffer APITile→CPPTile→PTO
Kernel source lines~5055
Explicit pipe_barrier calls7/row00
Named buffer allocations400
Multi-row correctness1D only✓ 6 shapesexpected
Shape safetyruntimecompile-timecompile-time

V-pipe workloads (softmax) — performance on Ascend 910B2

SizeBuffer APITile→CPP (scalar)Tile→PTO (expected)
1×1,0240.0085 ms0.109 ms~0.009 ms
1×4,0960.0093 ms0.419 ms~0.010 ms
1×8,1920.0104 ms0.831 ms~0.011 ms
Throughput440–788 Melem/s~9–10 Melem/s~440–788 Melem/s
Hardware✓ 910B2✓ 910B2, 6 shapesbisheng compat pending

M-pipe workloads (matrix multiply/GEMM)

MetricBuffer APITile→CPPTile→PTO
Cube unit accessibleNoNoYes
mlir_to_pto handlerloc=mat/left/right/acc
Measured perf~0.17–0.27 GFlop/s
Peak theoreticalV-pipe onlyV-pipe only~32 TFlop/s
Hardware-verifiedNo✓ scalar, 5 shapesbisheng compat pending

The M-pipe row is where PTO’s performance rationale is strongest: the 910B2 cube unit is architecturally separate from the V-pipe and orders of magnitude faster for matrix operations — and it is only reachable through PTO.


G.6 Current Status and Next Steps

What is done: mlir_to_pto.rs has been rewritten (950+ lines) to emit correct PTO-MLIR dialect ops (pto.make_tensor_view, pto.partition_view, pto.alloc_tile, pto.tload, pto.tstore, pto.tadd, pto.texp, pto.trowmax, pto.trowsum, pto.trowexpandsub, pto.trowexpanddiv). The ptoas binary accepts the generated .pto files and emits AscendC C++. 10/10 unit tests pass. The full softmax decomposition (trowmax → trowexpandsub → texp → trowsum → trowexpanddiv) is E2E verified through ptoas — all five reduction ops are correctly compiled to TROWMAX/TROWEXPANDSUB/TEXP/TROWSUM/TROWEXPANDDIV AscendC C++. ptoas is now wired into KernelBuilder: ACLRS_CODEGEN_PATH=pto now runs the full MLIR → PTO-MLIR → ptoas.pto.cppbisheng pipeline. SSA alias tracking (getelementptr, alloca store/load, bitcast) was added to mlir_to_pto.rs to resolve block-offset GM pointers generated by the real Rust→MLIR codegen.

translate_matmul() now emits correct cube-unit tile types — fixed. Previously the function used loc=vec tiles for all operands, which violates TMatmul.hpp static assertions in pto-isa. The corrected mlir_to_pto.rs now emits the full pipeline: pto.alloc_tile with loc=mat (CBUF staging), loc=left (L0A), loc=right (L0B), loc=acc (L0C, fractal=1024), followed by pto.tload GM→mat, pto.tmov mat→left/right (MTE1 pipeline), and pto.tmatmul left×right→acc. The output through ptoas is verified to emit the correct __ca__/__cb__/__cc__ AscendC buffer qualifiers.

What remains:

1. Compiler version gap. ptoas-generated C++ includes pto/pto-inst.hpp, which is incompatible with the Clang 15-based bisheng in CANN 8.5.0. The errors span multiple symbols: MrgSortExecutedNumList (missing from pto_instr.hpp forward declarations), copy_gm_to_ubuf_align_b32 (builtin not supported for the dav-c220 target feature set in Clang 15), and bfloat16_t (not defined in Clang 15 aicore mode). This is an upstream compatibility issue: pto-inst.hpp is designed for a newer bisheng. Resolution: upgrade to CANN 9.x, or request a Clang 15 compatibility shim from the pto-isa maintainers.

2. Hardware benchmark comparison. Once the compiler version gap is resolved, the efficiency question — whether ptoas-generated AscendC avoids the LocalTensor::operator[] sub-view issue that forces the scalar fallback in the mlir_to_cpp tile path — can be answered empirically on 910B2. Based on the data, PTO-generated code should achieve ~440–800 Melem/s instead of ~10 Melem/s, recovering the 40–80× gap currently left by the scalar fallback.