English | 中文版

13. Next Steps: Roadmap and Vision

Current Status

ascend-rs has moved well past alpha in the areas covered by the preceding chapters. This roadmap focuses on what remains — the items the earlier chapters do not already demonstrate. Everything already demonstrated in Chapters 2–7, 9, 10, 11, and 12 is treated as shipped and omitted here.

Host API: Alpha-complete. ACL, memory, streams, events, HCCL, DVPP, profiling, and BLAS all have safe Rust wrappers.
ascend_compile crate: Standalone compilation library with Rust API, C ABI, CLI, and Python bindings — the single path from AscendC C++ to NPU binary for every frontend in the stack (see §7.1.1 for the architecture).
Device runtime: 1565 Rust NPU kernels (489 compiletests + 16 deployable), 413 passing NPU correctness on Ascend 910B3 across 17 MultiKernelBench categories.
PyPTO / PTO-MLIR path: Integrated. Emitter (mlir_to_pto) → ptoas 0.26 → AscendC → bisheng. DeepSeek-R1-Distill-Qwen-1.5B end-to-end decode at 114–187 tok/s on 910B2 via this path (Chapter 10).
Safety oracle: Shipped (Chapters 11 & 12). Six check_* passes on the stage-2 plan ptoas produces; Path A + Path C ingress for linalg kernels from third-party frontends; catches PlanMemoryPass placement bugs that ptoas itself accepts with rc=0.

What follows are three directions — not a task list. Each absorbs multiple threads that used to be tracked separately.

Direction 1: Close the Kernel-Authoring Loop — Double Buffering, Iterators, Debuginfo

The core MLIR backend is feature-complete for the ops Chapters 3–5 exercise: arithmetic, reductions, unary math, scalar-vector, sixteen activations, seventeen composite ops, cube-engine matmul (including hardware L1→L0B transpose), and type-safe buffer newtypes that prevent mixing UB/L1/L0 at compile time. What’s left is about expressiveness and developer experience rather than new ops.

Queue-based pipelining (TQue). The existing codegen emits straight-line kernels and infers pipe_barrier automatically (§7 BufDepTracker). Moving to TQue-based double/triple-buffering lets DMA overlap with compute, which is the next performance headroom on memory-bound kernels. DeepSeek decode lm_head already exploits this through manual chunking (§9 chunk sweep); the compiler should do it for you.
Iterator combinators in kernel code. map, filter, fold, zip, enumerate — the shapes a kernel author expects to write. These need matching codegen in mlir_to_cpp / mlir_to_pto; the ops themselves exist.
Debug info. The MLIR backend currently emits no DWARF. Adding it unlocks gdb/lldb stepping inside generated AscendC, which is the missing piece for anyone debugging a kernel that “runs but gives the wrong answer” despite passing the oracle.

These are engineering, not research. Each is a bounded piece of work with a clear acceptance test: cargo run a kernel that uses .iter().map().sum() and see the generated .cpp use ReduceSum correctly; step through a kernel’s UB accesses in a debugger.

Direction 2: `ascend_compile` as the Universal Compilation Backend

The Chapter 7 architecture already names four frontends besides our own mlir_to_cpp / mlir_to_pto: TileLang, Triton-Ascend, torch.compile, and PyPTO. Each produces AscendC C++; each currently calls bisheng directly with inconsistent flags and no validation. The roadmap here is less about new code than about wiring:

TileLang drives a subprocess.run(bisheng, ...) with no validation today; dropping in ascend_compile via its Python wrapper gives TileLang automatic target detection, the three validation passes, and flag parity with our own kernels.
Triton-Ascend lowers its IR to AscendC; the last mile is identical for any C++ frontend.
torch.compile with an Ascend backend can call libascend_compile.so via ctypes and skip the Python-to-Rust dependency entirely.
PyPTO, when it ships alongside CANN, is the most natural customer: its ~90-instruction virtual ISA already lowers to AscendC, and running it through ascend_compile means the safety oracle can see the same plans.

The deliverable is not more backends — it is fewer bespoke compilation pipelines in the Ascend ecosystem. The LLVM picture applies: many frontends, one validated backend.

This also sets up Direction 3 by creating a common interception point for the safety oracle.

Direction 3: Widen the Safety Oracle’s Reach

Chapters 11 and 12 ship the oracle for PTO-MLIR and for ingested linalg. The natural next moves keep the passes and swap the parser:

PTX (NVIDIA). Chapter 12 §12.3.3 describes this: the six check_* passes are not Ascend-specific — they operate on a stage-2 plan of (space, offset, rows, cols, dtype, blayout, slayout) tuples. A parse_ptx_stage2 would let them run on PTX emitted by mlir_to_gpu, by a runtime-PTX project like OxiCUDA, or by anything else. Shared-memory bank conflicts, aliased __shared__ arrays, capacity overshoots against a 48 KB or 100 KB per-SM limit all map onto the existing checks.
Other vendors via the shared tile IR. crates/rustc_codegen_mlir/src/ has 15 mlir_to_* backends today (aie, bang, cpp, csl, gaudi, gpu, hexagon, linalg, msl, musa, nki, pto, spirv, …). Each lowers from the same ascend_tile_* dialect. A parser that reads that dialect directly — before any vendor-specific lowering — gives the oracle the earliest, cleanest shot at every target.
Upstream to Rust. A concrete Tier-3 target spec (davinci-huawei-none) is prepared in upstream-tier3/: target triple, ABI, platform-support docs, patches to mod.rs/platform-support.md/bootstrap/sanity.rs, and the community materials (Zulip post, optional MCP draft, PR description). The engagement plan is (1) Zulip #t-compiler/help for the triplet name, (2) MCP if the novel MLIR codegen warrants compiler-team consensus, (3) draft PR to rust-lang/rust. Tier-3 has the lowest bar — no RFC, no CI, single-reviewer approval — and our in-tree changes contain no proprietary code.

The long-term question lurking behind all three bullets: can the #![no_core] reimplementation in ascend_std eventually be replaced by -Zbuild-std=core on top of the upstream target? That would cut the biggest maintenance tax the project carries today.

Community Involvement

ascend-rs is pending an open-source release decision. Once public, the contribution surface will be:

New ascend_std intrinsics — follow the extern "C" stub + mlir_to_cpp handler pattern.
Kernel corpus — write real kernels, report codegen gaps.
Host API coverage — CANN has more APIs than we wrap.
Frontend integrations — if you work on TileLang, Triton, PyPTO, or torch.compile’s Ascend path, try replacing your compile step with ascend_compile and file the issues.
Oracle parsers — write a stage-2 parser for another IR (PTX, SPIR-V, LLVM NVPTX), and the six check_* passes come for free.

Keyboard shortcuts