Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

English | 中文版

8. Next Steps: Roadmap and Vision

Current Status

ascend-rs has moved well past alpha in the areas covered by the preceding chapters. This roadmap focuses on what remains — the items the earlier chapters do not already demonstrate. Everything already demonstrated in Chapters 2–7, 9, 10, and 11 is treated as shipped and omitted here.

  • Host API: Alpha-complete. ACL, memory, streams, events, HCCL, DVPP, profiling, and BLAS all have safe Rust wrappers.
  • ascend_compile crate: Standalone compilation library with Rust API, C ABI, CLI, and Python bindings — the single path from AscendC C++ to NPU binary for every frontend in the stack.
  • Device runtime: 1565 Rust NPU kernels (489 compiletests + 16 deployable), 413 passing NPU correctness on Ascend 910B3 across 17 MultiKernelBench categories.
  • PyPTO / PTO-MLIR path: Integrated. Emitter (mlir_to_pto) → ptoas 0.26 → AscendC → bisheng. DeepSeek-R1-Distill-Qwen-1.5B end-to-end decode at 114–187 tok/s on 910B2 via this path (Chapter 10).
  • PTO safety oracle: Shipped (Chapter 11). pto_to_rust catches PlanMemoryPass placement bugs that ptoas itself accepts with rc=0.
  • Performance parity with hand-tuned AscendC: Achieved on softmax, activations, vec_add, and all four DeepSeek decode matmul shapes (Chapter 9, 10).

Short-term Goals

The short list of things not yet in tree:

  • Tiling and double-buffering: Queue-based (TQue) pipeline API for overlapping DMA and compute. The PTO path already pipelines implicitly via PlanMemoryPass; this goal is the ascend_std buffer-API analogue.
  • Iterator combinators: map, filter, fold, zip, enumerate on device-side Rust slices — currently usable but inefficiently lowered.
  • Debug info generation: DWARF sections for NPU binaries so ccec-level diagnostics link back to Rust source.
  • Qwen-7B / DeepSeek-V2-Lite model upgrade: 1.5B-distill is too weak a headline; 7B and 16B-MoE are the publishable stories (tracked in project_deepseek_model_upgrade_plan).

Mid-term Goals: Ecosystem Integration

ascend_compile is designed as a single validated backend for every AscendC C++ producer. PyPTO is already plugged in; the remaining frontends are the mid-term work:

  • TileLang → ascend_compile: TileLang currently calls bisheng via a bare subprocess.run with no validation. Replacing LibraryGenerator.compile_lib() with ascend_compile.compile_kernel() gives TileLang the same validation passes (entry-point, DMA/sync barrier, buffer-vs-cap) that ascend-rs uses for its own kernels.
  • Triton → Ascend: A Triton backend for Ascend can use ascend_compile to handle the final AscendC C++ → NPU binary step, so the Triton team does not need to duplicate the target-flag / validation logic already in ascend_compile.
  • PyTorch → Ascend: torch.compile with an Ascend backend can link against libascend_compile.so via C ABI — no Python-to-Rust dependency, the same binary TileLang uses.
  • PTO safety oracle → upstream ptoas: Chapter 11 listed six invariants the oracle enforces externally. Folding the first four (aliasing, capacity, op-constraint, matmul-bounds) into ptoas’s own VerifyAfterPlanMemoryPass would make them a first-class compiler guarantee rather than an opt-in external check.

Long-term Vision

Ascend target specification — davinci-huawei-none: A concrete Tier-3 target proposal is ready for the Rust compiler. The target triple follows nvptx64-nvidia-cuda / amdgcn-amd-amdhsa conventions and defines ABI, calling conventions, and pointer sizes for DaVinci. The spec at upstream-tier3/compiler/rustc_target/src/spec/targets/davinci_huawei_none.rs uses aarch64-unknown-none as the LLVM placeholder (no DaVinci LLVM backend exists yet) and registers cfg(target_arch = "davinci"). Engagement plan: (1) Zulip #t-compiler/help post for early feedback on the triplet, (2) MCP if the MLIR codegen backend warrants compiler-team consensus, (3) draft PR to rust-lang/rust. Tier-3 has the lowest bar — no RFC, no CI, single-reviewer approval.

Reducing the no_core burden: A parallel core reimplementation is a heavy engineering tax. The direction is to explore -Zbuild-std=core with the MLIR backend and compile the standard library source directly rather than reimplement by hand.

A unified Ascend compilation stack: Chapter 7 showed ascend_compile as the IR hub today. The long-term picture closes the loop between frontends, the shared stage-2 plan, and the safety oracle — so every path into an NPU binary passes through the same validated pipeline and the same compile-time guarantees:

graph TD
    A1["Rust kernels<br/>(shipped)"] ==> F
    A5["PyPTO / PTO-MLIR<br/>mlir_to_pto → ptoas<br/>(shipped · Chapter 7,10)"] ==> F
    A2["TileLang<br/>(planned)"] -.-> F
    A3["Triton<br/>(planned)"] -.-> F
    A4["torch.compile<br/>(planned)"] -.-> F
    A6["Future DSLs"] -.-> F
    F["AscendC C++<br/>common IR"] ==> O["pto_to_rust safety oracle<br/>(shipped · Chapter 11)<br/>aliasing · capacity · op-constraint<br/>matmul-bounds · dead-tile · linear-use"]
    F ==> G["ascend_compile<br/>validate → target flags → bisheng"]
    O -.->|"diagnostics on<br/>original .acl.pto"| A5
    O -.->|"upstream candidates<br/>VerifyAfterPlanMemoryPass"| U["ptoas (future)"]
    G ==> H["NPU Binary · .o / .so"]
    H ==> D["DeepSeek e2e<br/>114–187 tok/s on 910B2<br/>(shipped · Chapter 10)"]
    classDef shipped fill:#d4f5d4,stroke:#2b8a3e,stroke-width:2px
    classDef planned fill:#f5f5f5,stroke:#adb5bd,stroke-dasharray:3 3
    class A1,A5,F,G,O,H,D shipped
    class A2,A3,A4,A6,U planned

Bold edges are paths already running in tree; dashed edges are planned. The diagram makes the one asymmetry explicit: today the oracle observes ptoas from outside. The dashed edge from oracle to ptoas (future) is the upstream-integration arrow — once the first four oracle checks land inside PlanMemoryPass, that part of the diagram collapses into a single node.

Community Involvement

ascend-rs is currently in a private repository pending an organizational decision on open-sourcing. Once released, these are the tractable contribution slots:

  1. Add new vector intrinsics to ascend_std: Follow the established pattern of extern "C" stubs + mlir_to_cpp handlers.
  2. Write more compiletest tests: As ascend_std grows, compile tests should follow.
  3. Expand host API wrappers: CANN has many unwrapped APIs; each is an independent contribution.
  4. Write more complex Rust kernels: Help discover gaps in the codegen backend and validate new intrinsics on NPU hardware.
  5. Integrate ascend_compile with your tool: If you work on TileLang, Triton, or another kernel compiler targeting Ascend, try replacing your compilation step with ascend_compile and report issues.
  6. Extend the PTO safety oracle: pto_to_rust is ~600 lines. Additional checks (loop-aware liveness to promote [linear-use] from warning to error, per-SoC DeviceSpec entries for 910C / 310P3) are self-contained PRs.