English | 中文版
8. Next Steps: Roadmap and Vision
Current Status
ascend-rs has moved well past alpha in the areas covered by the preceding chapters. This roadmap focuses on what remains — the items the earlier chapters do not already demonstrate. Everything already demonstrated in Chapters 2–7, 9, 10, and 11 is treated as shipped and omitted here.
- Host API: Alpha-complete. ACL, memory, streams, events, HCCL, DVPP, profiling, and BLAS all have safe Rust wrappers.
ascend_compilecrate: Standalone compilation library with Rust API, C ABI, CLI, and Python bindings — the single path from AscendC C++ to NPU binary for every frontend in the stack.- Device runtime: 1565 Rust NPU kernels (489 compiletests + 16 deployable), 413 passing NPU correctness on Ascend 910B3 across 17 MultiKernelBench categories.
- PyPTO / PTO-MLIR path: Integrated. Emitter (
mlir_to_pto) →ptoas 0.26→ AscendC → bisheng. DeepSeek-R1-Distill-Qwen-1.5B end-to-end decode at 114–187 tok/s on 910B2 via this path (Chapter 10). - PTO safety oracle: Shipped (Chapter 11).
pto_to_rustcatchesPlanMemoryPassplacement bugs that ptoas itself accepts withrc=0. - Performance parity with hand-tuned AscendC: Achieved on softmax, activations, vec_add, and all four DeepSeek decode matmul shapes (Chapter 9, 10).
Short-term Goals
The short list of things not yet in tree:
- Tiling and double-buffering: Queue-based (
TQue) pipeline API for overlapping DMA and compute. The PTO path already pipelines implicitly viaPlanMemoryPass; this goal is theascend_stdbuffer-API analogue. - Iterator combinators:
map,filter,fold,zip,enumerateon device-side Rust slices — currently usable but inefficiently lowered. - Debug info generation: DWARF sections for NPU binaries so
ccec-level diagnostics link back to Rust source. - Qwen-7B / DeepSeek-V2-Lite model upgrade: 1.5B-distill is too weak a headline; 7B and 16B-MoE are the publishable stories (tracked in
project_deepseek_model_upgrade_plan).
Mid-term Goals: Ecosystem Integration
ascend_compile is designed as a single validated backend for every AscendC C++ producer. PyPTO is already plugged in; the remaining frontends are the mid-term work:
- TileLang →
ascend_compile: TileLang currently callsbishengvia a baresubprocess.runwith no validation. ReplacingLibraryGenerator.compile_lib()withascend_compile.compile_kernel()gives TileLang the same validation passes (entry-point, DMA/sync barrier, buffer-vs-cap) that ascend-rs uses for its own kernels. - Triton → Ascend: A Triton backend for Ascend can use
ascend_compileto handle the final AscendC C++ → NPU binary step, so the Triton team does not need to duplicate the target-flag / validation logic already inascend_compile. - PyTorch → Ascend:
torch.compilewith an Ascend backend can link againstlibascend_compile.sovia C ABI — no Python-to-Rust dependency, the same binary TileLang uses. - PTO safety oracle → upstream ptoas: Chapter 11 listed six invariants the oracle enforces externally. Folding the first four (
aliasing,capacity,op-constraint,matmul-bounds) intoptoas’s ownVerifyAfterPlanMemoryPasswould make them a first-class compiler guarantee rather than an opt-in external check.
Long-term Vision
Ascend target specification — davinci-huawei-none: A concrete Tier-3 target proposal is ready for the Rust compiler. The target triple follows nvptx64-nvidia-cuda / amdgcn-amd-amdhsa conventions and defines ABI, calling conventions, and pointer sizes for DaVinci. The spec at upstream-tier3/compiler/rustc_target/src/spec/targets/davinci_huawei_none.rs uses aarch64-unknown-none as the LLVM placeholder (no DaVinci LLVM backend exists yet) and registers cfg(target_arch = "davinci"). Engagement plan: (1) Zulip #t-compiler/help post for early feedback on the triplet, (2) MCP if the MLIR codegen backend warrants compiler-team consensus, (3) draft PR to rust-lang/rust. Tier-3 has the lowest bar — no RFC, no CI, single-reviewer approval.
Reducing the no_core burden: A parallel core reimplementation is a heavy engineering tax. The direction is to explore -Zbuild-std=core with the MLIR backend and compile the standard library source directly rather than reimplement by hand.
A unified Ascend compilation stack: Chapter 7 showed ascend_compile as the IR hub today. The long-term picture closes the loop between frontends, the shared stage-2 plan, and the safety oracle — so every path into an NPU binary passes through the same validated pipeline and the same compile-time guarantees:
graph TD
A1["Rust kernels<br/>(shipped)"] ==> F
A5["PyPTO / PTO-MLIR<br/>mlir_to_pto → ptoas<br/>(shipped · Chapter 7,10)"] ==> F
A2["TileLang<br/>(planned)"] -.-> F
A3["Triton<br/>(planned)"] -.-> F
A4["torch.compile<br/>(planned)"] -.-> F
A6["Future DSLs"] -.-> F
F["AscendC C++<br/>common IR"] ==> O["pto_to_rust safety oracle<br/>(shipped · Chapter 11)<br/>aliasing · capacity · op-constraint<br/>matmul-bounds · dead-tile · linear-use"]
F ==> G["ascend_compile<br/>validate → target flags → bisheng"]
O -.->|"diagnostics on<br/>original .acl.pto"| A5
O -.->|"upstream candidates<br/>VerifyAfterPlanMemoryPass"| U["ptoas (future)"]
G ==> H["NPU Binary · .o / .so"]
H ==> D["DeepSeek e2e<br/>114–187 tok/s on 910B2<br/>(shipped · Chapter 10)"]
classDef shipped fill:#d4f5d4,stroke:#2b8a3e,stroke-width:2px
classDef planned fill:#f5f5f5,stroke:#adb5bd,stroke-dasharray:3 3
class A1,A5,F,G,O,H,D shipped
class A2,A3,A4,A6,U planned
Bold edges are paths already running in tree; dashed edges are planned. The diagram makes the one asymmetry explicit: today the oracle observes ptoas from outside. The dashed edge from oracle to ptoas (future) is the upstream-integration arrow — once the first four oracle checks land inside PlanMemoryPass, that part of the diagram collapses into a single node.
Community Involvement
ascend-rs is currently in a private repository pending an organizational decision on open-sourcing. Once released, these are the tractable contribution slots:
- Add new vector intrinsics to
ascend_std: Follow the established pattern ofextern "C"stubs +mlir_to_cpphandlers. - Write more compiletest tests: As
ascend_stdgrows, compile tests should follow. - Expand host API wrappers: CANN has many unwrapped APIs; each is an independent contribution.
- Write more complex Rust kernels: Help discover gaps in the codegen backend and validate new intrinsics on NPU hardware.
- Integrate
ascend_compilewith your tool: If you work on TileLang, Triton, or another kernel compiler targeting Ascend, try replacing your compilation step withascend_compileand report issues. - Extend the PTO safety oracle:
pto_to_rustis ~600 lines. Additional checks (loop-aware liveness to promote[linear-use]from warning to error, per-SoCDeviceSpecentries for 910C / 310P3) are self-contained PRs.