English | 中文版
3. Going Deeper: Writing NPU Kernels in Rust
Hello World demonstrated host-side safety. But ascend-rs has a bigger vision: using Rust on the device side too. This means writing NPU kernel code in Rust, not C++.
Let’s walk through a complete vector multiplication (vec_mul) example to demonstrate this.
3.1 The Rust Kernel
This is the Rust code that runs on the NPU:
#![allow(unused)]
fn main() {
// kernels/src/lib.rs
// Key: #![no_core] indicates a completely bare-metal environment
#![feature(no_core)]
#![no_std]
#![no_core]
/// Element-wise vector multiplication: z[i] = x[i] * y[i]
///
/// #[ascend_std::aiv_kernel] marks this function as an NPU kernel entry point
#[ascend_std::aiv_kernel]
pub unsafe fn mul(x: *const u16, y: *const u16, z: *mut u16) {
unsafe {
// Total elements = 16, divide work evenly across parallel blocks
let block_size = 16usize / ascend_std::get_block_num();
let start = ascend_std::get_block_idx() * block_size;
let mut i = start;
loop {
// Multiply element-wise and write to output
*z.wrapping_add(i) = *x.wrapping_add(i) * *y.wrapping_add(i);
i = i + 1;
if i == block_size + start {
break;
}
}
}
}
}
Several things worth noting about this code:
#![no_core] environment: The NPU has no operating system or standard library. ascend_std provides a minimal reimplementation of Rust’s core types (Copy, Clone, Add, Mul, etc.) so that Rust code can compile in a bare-metal environment.
#[ascend_std::aiv_kernel]: This attribute macro marks the function as an AIV (Ascend Instruction Vector) kernel entry point. It expands to #[unsafe(no_mangle)] (so the host can look up the symbol by name) and #[ascend::aiv_kernel] (so the MLIR codegen backend recognizes it and adds the hacc.entry attribute).
NPU parallel model: Similar to CUDA’s block/thread model, the Ascend NPU uses blocks and sub-blocks to organize parallel computation. get_block_idx() and get_block_num() provide execution context so the kernel knows which data slice to process.
3.2 The Host Code
The host code handles data transfer, kernel loading, and result verification:
// src/main.rs
use ascend_rs::prelude::*;
fn main() -> anyhow::Result<()> {
// ── Phase 1: Initialization ──
let acl = Acl::new()?;
let device = Device::new(&acl)?;
let context = AclContext::new(&device)?;
let stream = AclStream::new(&context)?;
// ── Phase 2: Data preparation ──
let x_host = common::read_buf_from_file::<u16>("test_data/input_x.bin");
let y_host = common::read_buf_from_file::<u16>("test_data/input_y.bin");
// Allocate device memory with HugeFirst policy (prefer huge pages for TLB efficiency)
let mut x_device = DeviceBuffer::from_slice_with_policy(
x_host.as_slice(), AclrtMemMallocPolicy::HugeFirst
)?;
let mut y_device = DeviceBuffer::from_slice_with_policy(
y_host.as_slice(), AclrtMemMallocPolicy::HugeFirst
)?;
let mut z_device = unsafe {
DeviceBuffer::<u16>::uninitialized_with_policy(
x_host.len(), AclrtMemMallocPolicy::HugeFirst
)?
};
// ── Phase 3: Kernel execution ──
unsafe {
// KernelLoader loads NPU binary from build.rs compilation artifacts
let kernel_loader = KernelLoader::new()?;
// Get kernel handle by symbol name "mul"
let kernel = kernel_loader.get_kernel("mul")?;
// Launch kernel with 2 parallel blocks
let block_dim: u32 = 2;
let mut args = [
x_device.as_mut_ptr() as *mut _,
y_device.as_mut_ptr() as *mut _,
z_device.as_mut_ptr() as *mut _,
];
kernel.launch(block_dim, &stream, &mut args)?;
}
// ── Phase 4: Synchronize and verify ──
stream.synchronize()?;
let res = z_device.to_host()?;
for (idx, elem) in res.iter().enumerate() {
let expected = x_host[idx].wrapping_mul(y_host[idx]);
assert_eq!(*elem, expected);
}
Ok(())
}
3.3 The Build System
build.rs bridges the Rust toolchain and the CANN compiler:
// build.rs
use ascend_rs_builder::KernelBuilder;
use std::path::PathBuf;
fn main() -> Result<(), Box<dyn std::error::Error>> {
println!("cargo:rerun-if-changed=kernels");
ascend_rs_builder::add_ascend_link_args()?;
let out_path = PathBuf::from(std::env::var("OUT_DIR").unwrap());
let kernel = out_path.join("kernel.o");
// Detects "kernels" is a directory → triggers Rust kernel compilation pipeline
KernelBuilder::new("kernels").copy_to(&kernel).build()?;
Ok(())
}
When KernelBuilder detects the input is a directory (containing Cargo.toml), it:
- Runs
cargo buildtargetingnvptx64-nvidia-cuda - Specifies
-Zcodegen-backend=rustc_codegen_mlirfor the custom codegen backend - The backend translates Rust MIR to MLIR
- The
mlir_to_cpppass converts MLIR into C++ source with AscendC API calls (DMA, vector ops, pipe barriers) - Invokes
bisheng(CANN C++ compiler) to compile the generated C++ into NPU binary (.acl.o)
Steps 4–5 are key: although CANN includes bishengir-compile (an MLIR-native compiler for 910B), the production pipeline uses the mlir_to_cpp path for all targets (both 310P and 910B). This C++ codegen approach provides access to the full AscendC feature set — DMA operations via DataCopy, TPipe infrastructure, and vector intrinsics. When the Rust kernel calls functions like ascend_reduce_max_f32, the mlir_to_cpp pass recognizes these in the MLIR and emits the corresponding AscendC vector operations (ReduceMax, Exp, etc.). All 522 tests passing on 910B3 hardware use this path.