English | 中文版
8. Next Steps: Roadmap and Vision
Current Status
ascend-rs is in active development:
- Host API: Alpha stage. ACL operations, memory management, kernel launching, BLAS, DVPP, profiling, and HCCL are implemented.
- Build tooling: Alpha stage. Supports compilation of both C++ and Rust kernels with automatic codegen path selection.
ascend_compilecrate: Standalone kernel compilation library with C ABI, CLI, and Python bindings. Decouples bisheng invocation from rustc, enabling any C++ kernel generator to compile for Ascend NPU.- Device runtime: 505 Rust NPU kernels (486 compiletests + 16 deployable + 6 tile) with complete 1:1 MultiKernelBench coverage across 17 categories, 413 tests passing NPU correctness verification on Ascend 910B3 (0 fail, 0 crash), including 37 matmul tests via aclnn operator composition, and 6 memory safety case studies demonstrating structural advantages over AscendC C++.
- Benchmarks: Rust vector kernels match hand-optimized C++ performance (zero overhead) on softmax, activations, vec_add, and matmul.
Short-term Goals
Vector intrinsic coverage: The vector intrinsic API covers a comprehensive set of operations for f32 and f16:
Arithmetic:✓ ImplementedAdd,Sub,Mul,Div,Min,MaxReductions:✓ ImplementedReduceMax,ReduceMin,ReduceSumUnary math:✓ ImplementedExp,Abs,Ln,Sqrt,Rsqrt,ReciprocalScalar-vector:✓ ImplementedAdds,Muls,Maxs,Mins(f32 and f16)Activation functions:,Relu,Sigmoid,Tanh,GELU,Softmax,ELU,Swish,Mish✓ Implemented (16 activations)SELU,Softplus,Softsign,HardSigmoid,HardSwish,Leaky ReLU,Log SoftmaxComposite operations:✓ Implemented (17 composites inLayerNorm,RMSNorm,L1/L2 Norm,MSE/Huber/Hinge Loss,Cosine Similarity,SGD Update,Reduce Mean/Prodkernel_ops.rs)Cube engine:✓ Implementedmatmul_f16via Mmad FFI (f16 inputs → f32 output)Cube engine transpose:✓ Implementedmatmul_f16_transpose_bwith hardware L1→L0B transpose- Tiling and double-buffering: Queue-based (
TQue) pipeline for overlapping DMA and compute Type-safe buffer handles:✓ Implemented#[repr(transparent)]newtype wrappers (UbBuf,L1Buf,L0aBuf,L0bBuf,L0cBuf) that prevent mixing buffer memory levels at compile time
End-to-end neural network operator examples:
Conv2D✓ — Pre-built operator viaOpsBuilder/atc, with host-side Model+Dataset execution and CPU reference verificationMulti-Head Attention (MHA)✓ — Host-orchestrated scaled dot-product attention pipeline:Q*K^T(HGEMM) → scale (Rust kernel) → row-wise softmax (Rust kernel with f16 reduce/exp/muls intrinsics) →weights*V(HGEMM)BLAS API improvement✓ —acl_blas_gemm_exalpha/beta changed from owned to borrowed (&DeviceBox<T>), enabling reuse across multiple GEMM calls in pipelines like MHA
Device-side Rust language support: Core operators and codegen are complete:
Operators:✓ ImplementedAdd,Sub,Mul,Div,Rem, bitwise ops (BitAnd,BitOr,Shl,Shr)Codegen: Signed/float remainder, float-integer conversions✓ ImplementedType casting:✓ ImplementedCastcodegen for f16↔f32 conversions- Iterator combinators:
map,filter,fold,zip,enumerate, etc.
Mid-term Goals: Ecosystem Integration
ascend_compile as the universal compilation backend: The standalone ascend_compile crate provides a single, validated compilation path for any tool that generates AscendC C++ kernels. It exposes four interfaces:
| Interface | Consumer | Use Case |
|---|---|---|
| Rust API | rustc_codegen_mlir | ascend-rs’s own MLIR→C++→binary pipeline |
C ABI (libascend_compile.so) | Python via ctypes | Drop-in replacement for TileLang’s libgen.py |
CLI (ascend-compile) | Shell scripts, CI | Ad-hoc compilation and validation |
Python wrapper (ascend_compile.py) | TileLang, Triton backends | Direct Python integration |
Key features that benefit all consumers:
- 3 validation passes before compilation: entry point check, DMA/sync barrier check (error on 310P, warning on 910B), buffer size vs. hardware limits
- Dual flag paths:
--cce-aicore-archfor 310P/310B and--npu-arch -xascfor 910B (TileLang-compatible) - Both object and shared library output:
-c -o out.oor-fPIC --shared -o out.so
TileLang-Ascend integration: TileLang generates optimized AscendC C++ kernels from a Python DSL but relies on a bare subprocess.run(bisheng, ...) call with no validation. Replacing LibraryGenerator.compile_lib() with ascend_compile.compile_kernel() provides:
- Automatic target detection and correct flag selection
- Pre-compilation validation that catches common NPU bugs (missing sync barriers, buffer overflows)
- Consistent compilation across tools — the same flags ascend-rs uses for its own validated kernels
PyPTO integration: PyPTO (Parallel Tile Operations) is CANN’s high-level operator programming framework that compiles Python-level tensor operations through a ~90-instruction PTO virtual ISA down to AscendC C++ code. When PyPTO is released alongside the CANN framework, ascend_compile can serve as the compilation backend, and an ascend-rs interface to PyPTO would enable memory-safe static analysis of tile-level operators — catching buffer overflows, missing synchronization barriers, and incorrect DMA parameters at compile time that PyPTO currently validates only at code-generation time.
Triton-Ascend backend: Triton’s compiler pipeline produces target-specific IR that must be lowered to device binaries. A Triton backend for Ascend can use ascend_compile to handle the final AscendC C++ → NPU binary step, benefiting from the same validation and target abstraction.
PyTorch integration path: torch.compile with an Ascend backend could leverage ascend_compile through its C ABI to compile generated kernels without a Python→Rust dependency, using the same libascend_compile.so that TileLang uses.
Complete host API: All major CANN API modules now have safe Rust wrappers:
Tensor descriptors✓ —TensorDesc,DataBuffer,Dataset(28 methods)Model inference✓ —Model::from_file(),execute(),execute_async(),ModelDescription(16 methods)Event management✓ —AclEventwith record/sync/timing (8 methods)DVPP image preprocessing✓ —DvppChannel,PicDesc, resize/crop/JPEG/PNG (42 methods)Profiling API✓ —ProfSession,ProfConfig,StepInfo,ProfStamp(18 methods)HCCL distributed communication✓ — AllReduce, AllGather, Broadcast, ReduceScatter, Send/Recv (17 methods)
MLIR codegen backend improvements:
Rust intrinsics✓ — bit manipulation (ctlz/cttz/ctpop/bswap/bitreverse/rotate), float math (floor/ceil/round/trunc/copysign/fma), overflow arithmetic, saturating arithmeticFloat constant support✓ — proper MLIR attribute formatting with decimal pointsC++ codegen intrinsic translation✓ — all LLVM intrinsics now mapped to GCC builtins and C math functionsCorrectness fixes✓ —raw_eq(byte comparison),discriminant_value(enum match),const_uint_big(i128),static_addr_of(global symbols),codegen_static(initializer values)- Debug info generation (not yet started)
Long-term Vision
Ascend target specification — davinci-huawei-none: We have prepared a concrete Tier-3 target proposal for the Rust compiler. The target triple davinci-huawei-none follows established conventions (nvptx64-nvidia-cuda, amdgcn-amd-amdhsa) and defines ABI, calling conventions, and pointer sizes for the DaVinci NPU architecture. The target spec (upstream-tier3/compiler/rustc_target/src/spec/targets/davinci_huawei_none.rs) uses aarch64-unknown-none as the LLVM placeholder (since no DaVinci LLVM backend exists) and registers cfg(target_arch = "davinci") for conditional compilation. The upstream-tier3/ directory contains the complete submission package: target spec, platform-support documentation, patches for mod.rs/platform-support.md/bootstrap/sanity.rs, and community engagement materials (Zulip post, optional MCP draft, PR description). Our engagement plan: (1) post to Zulip #t-compiler/help for early feedback on the triplet name, (2) file an MCP if the novel MLIR codegen backend warrants compiler-team consensus, (3) open a draft PR to rust-lang/rust. Tier-3 targets have the lowest bar — no RFC, no CI, single-reviewer approval — and our in-tree changes contain no proprietary code.
Reducing the no_core burden: Maintaining a parallel core library reimplementation is a massive engineering effort. The long-term direction is to explore using -Zbuild-std=core with the MLIR backend to compile the Rust standard library source directly, rather than reimplementing by hand.
A unified Ascend compilation stack: The ascend_compile crate is the first step toward a unified compilation infrastructure where multiple frontends (Rust, Python DSLs, compiler IRs) share the same validated, target-aware backend. This mirrors the LLVM model — many frontends, one backend — but specialized for Ascend NPU hardware:
graph TD
A1["Rust kernels"] --> F["AscendC C++ · common IR"]
A2["TileLang (planned)"] -.-> F
A3["Triton (planned)"] -.-> F
A4["torch.compile (planned)"] -.-> F
A5["PyPTO (planned)"] -.-> F
A6["Future DSLs (planned)"] -.-> F
F --> G["ascend_compile: validate → target flags → bisheng → binary"]
G --> H["NPU Binary · .o / .so"]
Community Involvement
ascend-rs is currently in a private repository, pending an organizational decision on open-sourcing. Once released, it will welcome community participation. If you have Ascend NPU hardware and are interested in exploring memory-safe kernel programming, here are areas where contributions would be valuable:
- Add new vector intrinsics to
ascend_std: Following the established pattern ofextern "C"stubs +mlir_to_cpphandlers. - Write more compiletest tests: As new features are added to
ascend_std, corresponding compile tests should follow. - Expand host API wrappers: The CANN SDK has many unwrapped APIs, each of which can be contributed independently.
- Try writing more complex Rust kernels: Help discover gaps in the codegen backend and validate new intrinsics on NPU hardware.
- Integrate
ascend_compilewith your tool: If you work on TileLang, Triton, or other kernel compilers targeting Ascend, try replacing your compilation step withascend_compileand report issues.