Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

English | 中文版

8. Next Steps: Roadmap and Vision

Current Status

ascend-rs is in active development:

  • Host API: Alpha stage. ACL operations, memory management, kernel launching, BLAS, DVPP, profiling, and HCCL are implemented.
  • Build tooling: Alpha stage. Supports compilation of both C++ and Rust kernels with automatic codegen path selection.
  • ascend_compile crate: Standalone kernel compilation library with C ABI, CLI, and Python bindings. Decouples bisheng invocation from rustc, enabling any C++ kernel generator to compile for Ascend NPU.
  • Device runtime: 505 Rust NPU kernels (486 compiletests + 16 deployable + 6 tile) with complete 1:1 MultiKernelBench coverage across 17 categories, 413 tests passing NPU correctness verification on Ascend 910B3 (0 fail, 0 crash), including 37 matmul tests via aclnn operator composition, and 6 memory safety case studies demonstrating structural advantages over AscendC C++.
  • Benchmarks: Rust vector kernels match hand-optimized C++ performance (zero overhead) on softmax, activations, vec_add, and matmul.

Short-term Goals

Vector intrinsic coverage: The vector intrinsic API covers a comprehensive set of operations for f32 and f16:

  • Arithmetic: Add, Sub, Mul, Div, Min, Max ✓ Implemented
  • Reductions: ReduceMax, ReduceMin, ReduceSum ✓ Implemented
  • Unary math: Exp, Abs, Ln, Sqrt, Rsqrt, Reciprocal ✓ Implemented
  • Scalar-vector: Adds, Muls, Maxs, Mins (f32 and f16) ✓ Implemented
  • Activation functions: Relu, Sigmoid, Tanh, GELU, Softmax, ELU, Swish, Mish, SELU, Softplus, Softsign, HardSigmoid, HardSwish, Leaky ReLU, Log Softmax ✓ Implemented (16 activations)
  • Composite operations: LayerNorm, RMSNorm, L1/L2 Norm, MSE/Huber/Hinge Loss, Cosine Similarity, SGD Update, Reduce Mean/Prod ✓ Implemented (17 composites in kernel_ops.rs)
  • Cube engine: matmul_f16 via Mmad FFI (f16 inputs → f32 output) ✓ Implemented
  • Cube engine transpose: matmul_f16_transpose_b with hardware L1→L0B transpose ✓ Implemented
  • Tiling and double-buffering: Queue-based (TQue) pipeline for overlapping DMA and compute
  • Type-safe buffer handles: #[repr(transparent)] newtype wrappers (UbBuf, L1Buf, L0aBuf, L0bBuf, L0cBuf) that prevent mixing buffer memory levels at compile time ✓ Implemented

End-to-end neural network operator examples:

  • Conv2D ✓ — Pre-built operator via OpsBuilder/atc, with host-side Model+Dataset execution and CPU reference verification
  • Multi-Head Attention (MHA) ✓ — Host-orchestrated scaled dot-product attention pipeline: Q*K^T (HGEMM) → scale (Rust kernel) → row-wise softmax (Rust kernel with f16 reduce/exp/muls intrinsics) → weights*V (HGEMM)
  • BLAS API improvement ✓ — acl_blas_gemm_ex alpha/beta changed from owned to borrowed (&DeviceBox<T>), enabling reuse across multiple GEMM calls in pipelines like MHA

Device-side Rust language support: Core operators and codegen are complete:

  • Operators: Add, Sub, Mul, Div, Rem, bitwise ops (BitAnd, BitOr, Shl, Shr) ✓ Implemented
  • Codegen: Signed/float remainder, float-integer conversions ✓ Implemented
  • Type casting: Cast codegen for f16↔f32 conversions ✓ Implemented
  • Iterator combinators: map, filter, fold, zip, enumerate, etc.

Mid-term Goals: Ecosystem Integration

ascend_compile as the universal compilation backend: The standalone ascend_compile crate provides a single, validated compilation path for any tool that generates AscendC C++ kernels. It exposes four interfaces:

InterfaceConsumerUse Case
Rust APIrustc_codegen_mlirascend-rs’s own MLIR→C++→binary pipeline
C ABI (libascend_compile.so)Python via ctypesDrop-in replacement for TileLang’s libgen.py
CLI (ascend-compile)Shell scripts, CIAd-hoc compilation and validation
Python wrapper (ascend_compile.py)TileLang, Triton backendsDirect Python integration

Key features that benefit all consumers:

  • 3 validation passes before compilation: entry point check, DMA/sync barrier check (error on 310P, warning on 910B), buffer size vs. hardware limits
  • Dual flag paths: --cce-aicore-arch for 310P/310B and --npu-arch -xasc for 910B (TileLang-compatible)
  • Both object and shared library output: -c -o out.o or -fPIC --shared -o out.so

TileLang-Ascend integration: TileLang generates optimized AscendC C++ kernels from a Python DSL but relies on a bare subprocess.run(bisheng, ...) call with no validation. Replacing LibraryGenerator.compile_lib() with ascend_compile.compile_kernel() provides:

  • Automatic target detection and correct flag selection
  • Pre-compilation validation that catches common NPU bugs (missing sync barriers, buffer overflows)
  • Consistent compilation across tools — the same flags ascend-rs uses for its own validated kernels

PyPTO integration: PyPTO (Parallel Tile Operations) is CANN’s high-level operator programming framework that compiles Python-level tensor operations through a ~90-instruction PTO virtual ISA down to AscendC C++ code. When PyPTO is released alongside the CANN framework, ascend_compile can serve as the compilation backend, and an ascend-rs interface to PyPTO would enable memory-safe static analysis of tile-level operators — catching buffer overflows, missing synchronization barriers, and incorrect DMA parameters at compile time that PyPTO currently validates only at code-generation time.

Triton-Ascend backend: Triton’s compiler pipeline produces target-specific IR that must be lowered to device binaries. A Triton backend for Ascend can use ascend_compile to handle the final AscendC C++ → NPU binary step, benefiting from the same validation and target abstraction.

PyTorch integration path: torch.compile with an Ascend backend could leverage ascend_compile through its C ABI to compile generated kernels without a Python→Rust dependency, using the same libascend_compile.so that TileLang uses.

Complete host API: All major CANN API modules now have safe Rust wrappers:

  • Tensor descriptors ✓ — TensorDesc, DataBuffer, Dataset (28 methods)
  • Model inference ✓ — Model::from_file(), execute(), execute_async(), ModelDescription (16 methods)
  • Event management ✓ — AclEvent with record/sync/timing (8 methods)
  • DVPP image preprocessing ✓ — DvppChannel, PicDesc, resize/crop/JPEG/PNG (42 methods)
  • Profiling API ✓ — ProfSession, ProfConfig, StepInfo, ProfStamp (18 methods)
  • HCCL distributed communication ✓ — AllReduce, AllGather, Broadcast, ReduceScatter, Send/Recv (17 methods)

MLIR codegen backend improvements:

  • Rust intrinsics ✓ — bit manipulation (ctlz/cttz/ctpop/bswap/bitreverse/rotate), float math (floor/ceil/round/trunc/copysign/fma), overflow arithmetic, saturating arithmetic
  • Float constant support ✓ — proper MLIR attribute formatting with decimal points
  • C++ codegen intrinsic translation ✓ — all LLVM intrinsics now mapped to GCC builtins and C math functions
  • Correctness fixes ✓ — raw_eq (byte comparison), discriminant_value (enum match), const_uint_big (i128), static_addr_of (global symbols), codegen_static (initializer values)
  • Debug info generation (not yet started)

Long-term Vision

Ascend target specification — davinci-huawei-none: We have prepared a concrete Tier-3 target proposal for the Rust compiler. The target triple davinci-huawei-none follows established conventions (nvptx64-nvidia-cuda, amdgcn-amd-amdhsa) and defines ABI, calling conventions, and pointer sizes for the DaVinci NPU architecture. The target spec (upstream-tier3/compiler/rustc_target/src/spec/targets/davinci_huawei_none.rs) uses aarch64-unknown-none as the LLVM placeholder (since no DaVinci LLVM backend exists) and registers cfg(target_arch = "davinci") for conditional compilation. The upstream-tier3/ directory contains the complete submission package: target spec, platform-support documentation, patches for mod.rs/platform-support.md/bootstrap/sanity.rs, and community engagement materials (Zulip post, optional MCP draft, PR description). Our engagement plan: (1) post to Zulip #t-compiler/help for early feedback on the triplet name, (2) file an MCP if the novel MLIR codegen backend warrants compiler-team consensus, (3) open a draft PR to rust-lang/rust. Tier-3 targets have the lowest bar — no RFC, no CI, single-reviewer approval — and our in-tree changes contain no proprietary code.

Reducing the no_core burden: Maintaining a parallel core library reimplementation is a massive engineering effort. The long-term direction is to explore using -Zbuild-std=core with the MLIR backend to compile the Rust standard library source directly, rather than reimplementing by hand.

A unified Ascend compilation stack: The ascend_compile crate is the first step toward a unified compilation infrastructure where multiple frontends (Rust, Python DSLs, compiler IRs) share the same validated, target-aware backend. This mirrors the LLVM model — many frontends, one backend — but specialized for Ascend NPU hardware:

graph TD
    A1["Rust kernels"] --> F["AscendC C++ · common IR"]
    A2["TileLang (planned)"] -.-> F
    A3["Triton (planned)"] -.-> F
    A4["torch.compile (planned)"] -.-> F
    A5["PyPTO (planned)"] -.-> F
    A6["Future DSLs (planned)"] -.-> F
    F --> G["ascend_compile: validate → target flags → bisheng → binary"]
    G --> H["NPU Binary · .o / .so"]

Community Involvement

ascend-rs is currently in a private repository, pending an organizational decision on open-sourcing. Once released, it will welcome community participation. If you have Ascend NPU hardware and are interested in exploring memory-safe kernel programming, here are areas where contributions would be valuable:

  1. Add new vector intrinsics to ascend_std: Following the established pattern of extern "C" stubs + mlir_to_cpp handlers.
  2. Write more compiletest tests: As new features are added to ascend_std, corresponding compile tests should follow.
  3. Expand host API wrappers: The CANN SDK has many unwrapped APIs, each of which can be contributed independently.
  4. Try writing more complex Rust kernels: Help discover gaps in the codegen backend and validate new intrinsics on NPU hardware.
  5. Integrate ascend_compile with your tool: If you work on TileLang, Triton, or other kernel compilers targeting Ascend, try replacing your compilation step with ascend_compile and report issues.