Related Work: Rust on GPUs/NPUs, and NVIDIA-Side Integration - ascend-rs: Memory-Safe NPU Kernel Programming in Rust

English | 中文版

Summary: ascend-rs is not the first Rust project to target heterogeneous compute, and it is far from the only one claiming memory safety as a headline. This chapter situates ascend-rs against the active Rust-on-accelerator projects — rust-cuda, rust-gpu, krnl, cudarc, wgpu, and the recently announced OxiCUDA — and then explores a concrete integration path on the NVIDIA side. The mlir_to_gpu backend already exists in the tree (it emits CUDA C from the same tile MLIR that drives the Ascend path), so the question is no longer “can ascend-rs run a Rust kernel on an NVIDIA GPU?” but “what does the ecosystem get if the kernel, the host runtime, and the safety oracle are stitched together instead of each project reimplementing its own?”. Four integration opportunities are identified; one of them — running the Chapter 11 oracle on PTX emitted by a runtime-PTX project like OxiCUDA — is a genuinely novel joint contribution, since no Rust-GPU project today ships a compile-time safety oracle on the generated low-level IR.

12.1 The Landscape of Rust on Accelerators

Chapter 1 included a five-row table summarising the open-source landscape. That table is enough context for a background section but too coarse for an honest comparison, because it collapses very different designs onto a single “approach” cell. This section expands the matrix along the axes that actually matter: what each project replaces in the vendor stack, how kernels reach the device, whether there is any compile-time safety claim beyond Rust’s own type system, and what shape the host-side runtime takes.

Project	Target HW	Authoring layer	What’s replaced	Runtime model	Safety beyond Rust	Maturity
rust-cuda	NVIDIA GPU	Rust kernel (`#[kernel]`)	`nvcc` for kernels	AOT, NVVM IR → PTX	None beyond borrow checker	Rebooted after 3y dormancy
rust-gpu	Vulkan (any GPU)	Rust kernel	`glslc` / shader compilers	AOT, Rust → SPIR-V	None beyond borrow checker	Active
krnl	Vulkan (any GPU)	Rust kernel (macro)	Shader compiler + runtime	AOT, uses rust-gpu	Safe buffer/host API	Active
cudarc	NVIDIA GPU	C/C++ kernel (`.cu`)	CUDA C++ runtime API	JIT, runtime `nvrtc`	Safe driver/runtime bindings	Active, widely used
wgpu	Vulkan / Metal / D3D12 / WebGPU	WGSL / SPIR-V	Platform graphics API	Runtime	Safe wrapper API	Active
OxiCUDA	NVIDIA GPU (primary); Metal / Vulkan / ROCm / L0 backends	Rust AST → PTX data structures	cuBLAS / cuDNN / cuFFT / cuSPARSE / cuSOLVER / cuRAND + full SDK	JIT, runtime PTX emission	Safe API surface	v0.1, newly announced
ascend-rs	Ascend NPU (primary); 14 secondary vendor backends	Rust kernel (`ascend_std` tile/buffer APIs)	AscendC authoring in C++	AOT, MLIR → AscendC → bisheng	Compile-time safety oracle on generated MLIR / PTO-MLIR	Chapters 5 & 10 — 500+ kernels compile, DeepSeek decode 180+ tok/s on 910B2

A few observations jump out of that matrix that the ch01 table flattens.

First, only two projects attempt to replace anything in the vendor stack above the driver: OxiCUDA claims replacement of the entire CUDA user-space stack (cuBLAS through cuDNN); ascend-rs replaces the kernel authoring language for Ascend and builds its own compilation hub (ascend_compile) but still calls into ACL / CANN at runtime. The other five projects give you a safer way to express kernels but leave the vendor’s libraries in place.

Second, the runtime model is split cleanly down the middle. rust-cuda, rust-gpu, krnl, and ascend-rs all commit AOT: the kernel is compiled at build time to a machine-code artifact. cudarc, OxiCUDA, and wgpu’s compute path all use runtime compilation (via nvrtc, runtime PTX emission, or shader compilation on first draw). AOT gives you reproducibility and lets you run a safety analysis before the binary is built; JIT gives you flexibility (shape specialisation, target-dependent tuning). Chapters 10 and 11 only make sense under the AOT model — you cannot run check_all on PTX that has not been emitted yet.

Third, and this is what makes ascend-rs’s position distinct, nobody else ships a compile-time safety oracle on the generated low-level IR. Every other project’s safety story is “Rust’s borrow checker plus a safe API wrapper around the vendor runtime.” That is a real contribution — CUDA C++ aliasing and UAF bugs are endemic — but it says nothing about the memory layout of the generated kernel. Chapters 10 and 11 say something about that layout: they run check_aliasing, check_capacity, check_dead_writes, check_slayout_consistency on the stage-2 plan ptoas produces. That is an analysis a vendor toolchain should ship itself but doesn’t; ascend-rs ships it on the side as a “Rust safety card” for kernels the vendor’s own toolchain accepts with rc=0.

12.1.1 Where ascend-rs Is Behind the Others

An honest landscape section has to name the gaps. Three of them are visible on the matrix above.

Library coverage. OxiCUDA claims cuBLAS, cuDNN, cuFFT, cuSPARSE, cuSOLVER, and cuRAND equivalents. ascend-rs has an ascend_std standard library and a kernel inventory in Appendix E, but nothing equivalent to a full DNN library — our DNN story is “write the operator you need as a Rust tile kernel”, not “import ascend_dnn::conv2d”. For users whose mental model is “pick up the DNN library and go”, ascend-rs has more assembly required.

Ecosystem. rust-cuda, rust-gpu, wgpu, and cudarc all have years of crates-io presence, downstream users, and bug reports. ascend-rs is in development (Chapter 1’s status column says so) and lives in a private repo; the public yijunyu/ascend-rs is artifacts-only (see repository notes). “Widely used” is not a claim we can make yet.

Target breadth inside NVIDIA. rust-cuda / rust-gpu / OxiCUDA each target NVIDIA as a first-class citizen. ascend-rs has a mlir_to_gpu backend that emits CUDA C, but it’s a secondary codegen exit, not the focus. If your primary target is NVIDIA, the existing NVIDIA-first projects are ahead of us on NVIDIA-specific polish.

This chapter argues for integration, not competition, precisely because the axes above are largely orthogonal. OxiCUDA’s library coverage and ascend-rs’s safety oracle are not the same piece of work; they compose.

12.2 Axis-by-Axis Comparison with OxiCUDA

OxiCUDA is worth a dedicated section because, of the related projects, it is the one most often confused with what ascend-rs is doing — both have “replace the vendor stack” language in their summaries, and both ship Rust as the authoring surface. The confusion is shallow and worth clearing up.

12.2.1 Target Hardware

OxiCUDA is NVIDIA-first. The v0.1 announcement lists NVIDIA CUDA as the primary backend, with Metal / Vulkan / WebGPU / ROCm / Intel Level Zero as additional backends. The shared abstraction appears to be “a kernel and a runtime library”, with per-vendor backends implementing it.

ascend-rs is Ascend-first. CANN 8.5 on Ascend 910B2 / 310P is the primary target; the 14 mlir_to_*.rs files in crates/rustc_codegen_mlir/src/ are secondary codegen exits that share the ascend_tile_* MLIR dialect as a common IR. Those exits are (in alphabetical order): aie, bang, cpp, csl, gaudi, gpu, hexagon, linalg, msl, musa, nki, pto, spirv. For an NVIDIA user, mlir_to_gpu exists and emits CUDA C, but nobody is claiming it is as polished as a project that targets NVIDIA exclusively.

12.2.2 What Is Replaced

This is the axis where the two projects look similar in headline and diverge in detail.

OxiCUDA replaces the CUDA SDK runtime stack: you do not install nvcc, you do not install the CUDA toolkit, you do not write .cu files. The only runtime dependency left is libcuda.so (the driver). PTX is generated at runtime from Rust data structures and handed to the driver via the Driver API. The replacement is horizontal, across the user-space libraries.

ascend-rs replaces the kernel authoring language and ships a standalone compilation hub (ascend_compile), but does not replace CANN / ACL. The replacement is vertical, within the kernel-programming slice: your kernel source is Rust instead of AscendC C++, and rustc_codegen_mlir lowers it through MLIR to AscendC, then bisheng produces the NPU binary. CANN is still the runtime. We have not rewritten cuBLAS’s Ascend analogue (aclBLAS).

The difference matters for deployment. An OxiCUDA user’s NVIDIA machine ships with just the driver and the compiled Rust binary. An ascend-rs user’s 910B2 machine ships with CANN 8.5 installed, plus the Rust binary and the codegen artifact library. The first is more radical; the second is more conservative and lands faster on real vendor hardware.

12.2.3 Compilation Model

OxiCUDA generates PTX at runtime. There are no .cu files, no .o files on disk for the kernel path; the PTX string is built from Rust types inside the process, submitted to the Driver API, and the module handle goes in a cache. This is similar in spirit to cudarc plus nvrtc but cuts out the nvrtc dependency by emitting PTX directly.

ascend-rs compiles AOT. On cargo build, rustc_codegen_mlir produces MLIR, mlir_to_cpp or mlir_to_pto produces C++ or PTO-MLIR, bisheng (via ascend_compile) produces NPU object code, and the result is linked into a kernels.so that the host loads with libloading at run time. Chapter 7 has the full diagram.

Neither model is strictly better. AOT lets us run a compile-time safety oracle; JIT / runtime-emit lets OxiCUDA specialise on shapes the Rust code only learns at run time. The natural synthesis — which section 12.3 picks up — is that an AOT safety analysis can be applied to PTX even if that PTX was emitted at run time, as long as there is a moment to freeze it before launch.

12.2.4 Safety Story

Both projects say “type-safe, memory-safe Rust” on the authoring side, and both deliver on that claim at the API surface. Rust’s borrow checker catches what it catches on the host.

ascend-rs goes one step further with the Chapter 10 and Chapter 11 oracle. Those chapters describe six passes that run on the stage-2 plan ptoas produces: check_aliasing, check_capacity, check_dead_writes, check_slayout_consistency, and two others. They catch bugs that ptoas itself accepts with rc=0 — cases where the vendor toolchain produces a binary that runs and silently corrupts data. That is a class of bug no “safe wrapper around the runtime” can catch, because the unsafety lives below the wrapper, in the memory layout of the generated kernel.

OxiCUDA’s v0.1 does not, as of writing, document a corresponding analysis on its emitted PTX. This is not a criticism — v0.1 is v0.1, and the project’s focus at that stage is library coverage, not analysis. It is an integration opportunity, covered in section 12.3.3.

12.2.5 Scope

OxiCUDA is broader at the framework level: its announcement names computation graphs, GPU training, inference, and reinforcement learning among its goals. A v0.1 with that ambition is “land the surface area, polish later.”

ascend-rs is narrower: we ship a compiler backend, a standard library, a HAL, safe runtime wrappers, and a validated kernel suite (DeepSeek-R1-Distill-Qwen-1.5B end-to-end decode at 180+ tok/s on 910B2, 500+ compiling kernels across MultiKernelBench categories — see Chapter 9). We do not ship a training loop or an RL framework. Someone could build those on top of the HAL, but that is not what we have shipped.

12.2.6 The One-Line Comparison

OxiCUDA: delete the CUDA SDK, ship PTX from Rust at runtime, cover the NVIDIA library surface.

ascend-rs: write NPU kernels in safe Rust, compile through MLIR to vendor toolchains, prove safety properties the vendor toolchain cannot.

They are not the same project and they are not in competition. The next section argues they can be joined at the seams.

12.3 Integration Opportunities on the NVIDIA Side

crates/rustc_codegen_mlir/src/mlir_to_gpu.rs has been in the tree since we started the multi-vendor exits; it is not hypothetical. Given the same ascend_tile_* MLIR that drives the Ascend path, it emits a .cu file compilable by nvcc -arch=sm_80 or clang++ --cuda-gpu-arch=sm_80. The mapping table at the top of that file covers the core ops — ascend_tile_load_f32, _store_f32, _add_f32, _sub_f32, _mul_f32, _exp_f32, _softmax_f32, _reduce_max_f32, _reduce_sum_f32, _scale_f32 — via straightforward CUDA kernel patterns, with matmul_f32 currently emitted as a cuBLAS-SGEMM placeholder comment.

Given that starting point, four integration opportunities present themselves. They are ordered by how concrete each is today.

12.3.1 Short-Term: `mlir_to_gpu` + `cudarc` Host Runtime

Today, mlir_to_gpu produces a .cu string. What that string gets fed to is the user’s problem. There is no ascend-rs-provided NVIDIA host runtime.

The short-term integration is to connect the .cu output to cudarc’s safe driver/runtime bindings:

User writes the kernel in ascend_std tile API, exactly as for Ascend.
rustc_codegen_mlir with ACLRS_CODEGEN_PATH=gpu emits .cu.
nvcc (or clang++ --cuda-gpu-arch=sm_80) compiles to a .ptx or shared library.
Host-side, a small ascend_hal CUDA backend (analogous to the existing cuda backend referenced in crates/ascend_hal/ — see CLAUDE.md) uses cudarc for device init, allocation, stream creation, and launch.

The user gets a single Rust source that runs on 910B2 by setting ACLRS_CODEGEN_PATH=pto and on NVIDIA by setting ACLRS_CODEGEN_PATH=gpu. The host side is cudarc’s problem — a crate with a maintained API, a test suite, and thousands of downstream users — instead of ours.

This is the smallest integration that produces a real end-to-end path, and it is what someone would prototype first. The blockers are prosaic: the mlir_to_gpu matmul path needs the cuBLAS call filled in (it’s a TODO comment today), and the ascend_hal CUDA backend needs more test coverage (cudarc dependency is already present but behind a feature gate). Both are engineering, not research.

12.3.2 Medium-Term: Evaluate Runtime-PTX Emission as an Alternative to `nvcc`

The nvcc dependency in step 3 above is the same kind of build-time dependency that CANN’s bisheng imposes on the Ascend side. It is defensible — both are the vendor’s official compiler — but it is also the biggest integration friction: CI machines need the toolkit, Docker images balloon, nightly builds break when the vendor upgrades.

OxiCUDA’s runtime-PTX emission is the natural alternative. If the mlir_to_gpu backend emitted PTX directly (via an mlir_to_ptx or by going through LLVM’s NVPTX target) instead of CUDA C, the nvcc dependency disappears and the deployment profile matches what OxiCUDA is already doing on its side.

Two ways to get there:

Extend mlir_to_gpu to call LLVM’s NVPTX backend via mlir-sys, lowering the MLIR directly to PTX. This keeps the codegen entirely inside ascend-rs.
Cooperate with OxiCUDA: ascend-rs lowers tile MLIR to an intermediate form (LLVM dialect, or OxiCUDA’s Rust-AST data structures) and OxiCUDA’s existing PTX emitter takes over.

Option 2 is worth considering because OxiCUDA has already solved the “generate valid PTX for modern architectures” problem and is maintaining that PTX emitter. Option 1 is more self-contained but means we carry the PTX-generation complexity.

There is no obvious winner between these two; the right answer depends on how stable OxiCUDA’s API becomes and how much of NVPTX’s surface area mlir_to_gpu actually needs. A probe — generate PTX for the softmax kernel both ways, diff the outputs, benchmark on a real NVIDIA GPU — is the natural next step.

12.3.3 Medium-Term: Run the Chapter 11 Oracle on PTX

This is the opportunity with the highest novelty. The Chapter 11 oracle’s passes — check_aliasing, check_capacity, check_dead_writes, check_slayout_consistency — are not logically Ascend-specific. They operate on a stage-2 plan: a flat list of tiles with (space, offset, rows, cols, dtype, blayout, slayout) tuples, plus a dependency graph over them. Nothing in check_aliasing knows it is running on PTO-MLIR rather than some other IR. The Ascend-specific parts are all in the parser (parse_stage2) that produces the plan from ptoas --print-after-all output.

A parser that produces an equivalent stage-2 plan from PTX — from shared-memory allocations, from ld.global / st.global accesses, from warp-level shuffle patterns — would let the same six passes run on NVIDIA kernels. Every “my kernel runs but gives the wrong answer” bug that the Ascend oracle catches has an NVIDIA analogue (shared-memory bank conflicts, aliased __shared__ arrays, capacity-overshoots against a 48 KB or 100 KB per-SM limit).

Concretely:

Write a parse_ptx_stage2 that emits the same Plan struct parse_stage2 emits today, but from PTX instead of PTO-MLIR.
Run the existing check_all(plan) on the result.
Wire it to an environment variable (mirroring ACLRS_PTO_SAFETY) — e.g. ACLRS_PTX_SAFETY=error / =warn / unset.

The parser is the hard part; the checks are reused unchanged. This is a clean joint contribution for any PTX-emitting Rust-GPU project, OxiCUDA or otherwise. None of them ship a compile-time safety analysis on their generated PTX today.

12.3.4 Long-Term: A Shared Tile IR

At the top of the stack, ascend_tile_* intrinsics and whatever OxiCUDA (or rust-cuda, or rust-gpu) uses as a tile abstraction are solving the same problem: describe a rectangular tile of data and an operation over it, in a form that can be lowered to a specific vendor. We have 15 vendor backends in crates/rustc_codegen_mlir/src/mlir_to_*.rs; each of them consumes the same ascend_tile_* MLIR.

The long-term integration is to make that tile dialect a standalone artifact — an MLIR dialect and a reference lowering — and invite other projects to lower into it (OxiCUDA’s Rust AST → tile MLIR → PTX) or out of it (tile MLIR → SPIR-V → Vulkan, as mlir_to_spirv already does). At that point the picture is: one tile IR, N frontends, M backends, and a safety oracle that sits between the two.

This is the most speculative of the four opportunities and the one that most needs real user demand before it is worth the maintenance cost. It is included here because the first three opportunities naturally push in this direction — if you’re already sharing a host runtime (12.3.1), sharing a PTX emitter (12.3.2), and sharing a safety oracle (12.3.3), then sharing the IR that all of them lower from is just completing the picture.

12.4 What ascend-rs Would Ship Back to the Rust-GPU Ecosystem

If integration happens, the contributions flow in both directions. The parts of ascend-rs that are portable out of the Ascend context, in decreasing order of readiness:

The safety oracle. The six passes are already decoupled from Ascend specifics at the pass level. A PTX-stage-2 parser unlocks them for NVIDIA.
The ascend_compile compilation hub pattern. Today it dispatches to bisheng, but the three validation passes and the dual-flag infrastructure generalise: Chapter 7 describes a C++-to-binary hub, and CUDA C, SYCL, HIP, and AscendC all share the shape of that compilation pipeline. A multi-vendor compile crate factoring ascend_compile’s structure is a reasonable Rust crate on its own.
The MLIR tile dialect. 15 vendor backends already share it. Detaching it from the ascend-rs repo is work but not research.
The Rust-side kernel corpus. 500+ kernels across MultiKernelBench categories, covering everything from softmax to DeepSeek’s MLA attention, in safe tile-API Rust. For any project that wants a test suite, this is a starting point.

The parts that do not travel: the CANN-specific bits (pipe barriers, L0/L1/UB memory layout, SoC-version guards), the ACL runtime wrappers, the aiv_kernel macro’s AscendC-specific ABI, and anything that mentions 910B2. Those are ~30% of the codebase; the other ~70% is generically useful.

12.5 What This Chapter Is Not

It is worth stating what this chapter deliberately does not claim.

It does not claim ascend-rs has won a benchmark against OxiCUDA — they do not ship comparable benchmarks, and even if they did, they target different hardware.

It does not claim OxiCUDA is wrong to generate PTX at runtime. That is a legitimate design choice with real advantages; section 12.3.2 treats it as an integration opportunity, not a misfeature.

It does not claim the Chapter 11 oracle works on PTX today. Section 12.3.3 describes it as a medium-term opportunity for a reason: the PTX stage-2 parser has not been written. The claim is that the passes are portable, not that the parser is done.

It does not claim rust-cuda, rust-gpu, cudarc, or krnl are obsolete. Each of them occupies a distinct point in the design space (see 12.1), and for the use cases they were built for, they remain the right answer.

What this chapter does claim is narrow and, I hope, defensible: ascend-rs and the NVIDIA-side Rust projects are solving overlapping problems with complementary strengths, the integration surface is visible in today’s code (mlir_to_gpu + cudarc + the Chapter 11 oracle), and a serious effort to join them would produce a story stronger than any of the parts alone.

Keyboard shortcuts

ascend-rs: Memory-Safe NPU Kernel Programming in Rust