English | 中文版
13. Next Steps: Roadmap and Vision
Current Status
ascend-rs has moved well past alpha in the areas covered by the preceding chapters. This roadmap focuses on what remains — the items the earlier chapters do not already demonstrate. Everything already demonstrated in Chapters 2–7, 9, 10, 11, and 12 is treated as shipped and omitted here.
- Host API: Alpha-complete. ACL, memory, streams, events, HCCL, DVPP, profiling, and BLAS all have safe Rust wrappers.
ascend_compilecrate: Standalone compilation library with Rust API, C ABI, CLI, and Python bindings — the single path from AscendC C++ to NPU binary for every frontend in the stack (see §7.1.1 for the architecture).- Device runtime: 1565 Rust NPU kernels (489 compiletests + 16 deployable), 413 passing NPU correctness on Ascend 910B3 across 17 MultiKernelBench categories.
- PyPTO / PTO-MLIR path: Integrated. Emitter (
mlir_to_pto) →ptoas 0.26→ AscendC → bisheng. DeepSeek-R1-Distill-Qwen-1.5B end-to-end decode at 114–187 tok/s on 910B2 via this path (Chapter 10). - Safety oracle: Shipped (Chapters 11 & 12). Six
check_*passes on the stage-2 planptoasproduces; Path A + Path C ingress for linalg kernels from third-party frontends; catchesPlanMemoryPassplacement bugs thatptoasitself accepts withrc=0.
What follows are three directions — not a task list. Each absorbs multiple threads that used to be tracked separately.
Direction 1: Close the Kernel-Authoring Loop — Double Buffering, Iterators, Debuginfo
The core MLIR backend is feature-complete for the ops Chapters 3–5 exercise: arithmetic, reductions, unary math, scalar-vector, sixteen activations, seventeen composite ops, cube-engine matmul (including hardware L1→L0B transpose), and type-safe buffer newtypes that prevent mixing UB/L1/L0 at compile time. What’s left is about expressiveness and developer experience rather than new ops.
- Queue-based pipelining (
TQue). The existing codegen emits straight-line kernels and inferspipe_barrierautomatically (§7BufDepTracker). Moving toTQue-based double/triple-buffering lets DMA overlap with compute, which is the next performance headroom on memory-bound kernels. DeepSeek decode lm_head already exploits this through manual chunking (§9 chunk sweep); the compiler should do it for you. - Iterator combinators in kernel code.
map,filter,fold,zip,enumerate— the shapes a kernel author expects to write. These need matching codegen inmlir_to_cpp/mlir_to_pto; the ops themselves exist. - Debug info. The MLIR backend currently emits no DWARF. Adding it unlocks gdb/lldb stepping inside generated AscendC, which is the missing piece for anyone debugging a kernel that “runs but gives the wrong answer” despite passing the oracle.
These are engineering, not research. Each is a bounded piece of work with a clear acceptance test: cargo run a kernel that uses .iter().map().sum() and see the generated .cpp use ReduceSum correctly; step through a kernel’s UB accesses in a debugger.
Direction 2: ascend_compile as the Universal Compilation Backend
The Chapter 7 architecture already names four frontends besides our own mlir_to_cpp / mlir_to_pto: TileLang, Triton-Ascend, torch.compile, and PyPTO. Each produces AscendC C++; each currently calls bisheng directly with inconsistent flags and no validation. The roadmap here is less about new code than about wiring:
- TileLang drives a
subprocess.run(bisheng, ...)with no validation today; dropping inascend_compilevia its Python wrapper gives TileLang automatic target detection, the three validation passes, and flag parity with our own kernels. - Triton-Ascend lowers its IR to AscendC; the last mile is identical for any C++ frontend.
torch.compilewith an Ascend backend can calllibascend_compile.soviactypesand skip the Python-to-Rust dependency entirely.- PyPTO, when it ships alongside CANN, is the most natural customer: its ~90-instruction virtual ISA already lowers to AscendC, and running it through
ascend_compilemeans the safety oracle can see the same plans.
The deliverable is not more backends — it is fewer bespoke compilation pipelines in the Ascend ecosystem. The LLVM picture applies: many frontends, one validated backend.
This also sets up Direction 3 by creating a common interception point for the safety oracle.
Direction 3: Widen the Safety Oracle’s Reach
Chapters 11 and 12 ship the oracle for PTO-MLIR and for ingested linalg. The natural next moves keep the passes and swap the parser:
- PTX (NVIDIA). Chapter 12 §12.3.3 describes this: the six
check_*passes are not Ascend-specific — they operate on a stage-2 plan of(space, offset, rows, cols, dtype, blayout, slayout)tuples. Aparse_ptx_stage2would let them run on PTX emitted bymlir_to_gpu, by a runtime-PTX project like OxiCUDA, or by anything else. Shared-memory bank conflicts, aliased__shared__arrays, capacity overshoots against a 48 KB or 100 KB per-SM limit all map onto the existing checks. - Other vendors via the shared tile IR.
crates/rustc_codegen_mlir/src/has 15mlir_to_*backends today (aie,bang,cpp,csl,gaudi,gpu,hexagon,linalg,msl,musa,nki,pto,spirv, …). Each lowers from the sameascend_tile_*dialect. A parser that reads that dialect directly — before any vendor-specific lowering — gives the oracle the earliest, cleanest shot at every target. - Upstream to Rust. A concrete Tier-3 target spec (
davinci-huawei-none) is prepared inupstream-tier3/: target triple, ABI, platform-support docs, patches tomod.rs/platform-support.md/bootstrap/sanity.rs, and the community materials (Zulip post, optional MCP draft, PR description). The engagement plan is (1) Zulip#t-compiler/helpfor the triplet name, (2) MCP if the novel MLIR codegen warrants compiler-team consensus, (3) draft PR torust-lang/rust. Tier-3 has the lowest bar — no RFC, no CI, single-reviewer approval — and our in-tree changes contain no proprietary code.
The long-term question lurking behind all three bullets: can the #![no_core] reimplementation in ascend_std eventually be replaced by -Zbuild-std=core on top of the upstream target? That would cut the biggest maintenance tax the project carries today.
Community Involvement
ascend-rs is pending an open-source release decision. Once public, the contribution surface will be:
- New
ascend_stdintrinsics — follow theextern "C"stub +mlir_to_cpphandler pattern. - Kernel corpus — write real kernels, report codegen gaps.
- Host API coverage — CANN has more APIs than we wrap.
- Frontend integrations — if you work on TileLang, Triton, PyPTO, or
torch.compile’s Ascend path, try replacing your compile step withascend_compileand file the issues. - Oracle parsers — write a stage-2 parser for another IR (PTX, SPIR-V, LLVM NVPTX), and the six
check_*passes come for free.