English | 中文版
Appendix I: Performance Differential Analysis
Analysis of 998 CANN 8.5 kernel round-trip performance patterns across the real ascendc-to-rs transpilation batch (same corpus as Appendix G).
The ascend-rs compilation pipeline (Rust → MLIR → C++ → bisheng) introduces specific code-generation patterns compared to hand-written AscendC C++. This appendix identifies those patterns, classifies their impact, and proposes generalisable optimisations.
Scope note. Of the 998 kernels, 247 are Transpiled (real compute body) and 751 are Registered (identity stub body). The slowdown patterns in §I.2 (
TBufvsTQue,PIPE_ALLbarriers, no double-buffering, uniform buffer sizing) are properties of the codegen path — they are the patternsmlir_to_cpp.rsemits for any kernel body that contains DMA+compute. Registered kernels technically exhibit theTBufpattern in their emitted stub too, but since the stub body only does a copy, the 2% slowdown number is only meaningful for the 247 Transpiled kernels. The table counts are reported against all 998 because the codegen path is uniform; readers interested in the realised runtime gap should restrict the denominator to 247.
I.1 Performance Classification
| Classification | Count | % | Description |
|---|---|---|---|
| EQUIVALENT | 121 | 12% | Generated code matches original C++ performance |
| SLOW_1.02X | 877 | 88% | ~2% slower due to barrier and buffer-overhead patterns |
| SLOW_1.2X | 0 | 0% | ~20% slower (none observed) |
| SLOW_1.5X | 0 | 0% | ~50% slower (none observed) |
| SLOW_2X+ | 0 | 0% | 2× or slower (none observed) |
Note: the 2% overhead comes from TBuf + PIPE_ALL patterns; actual runtime difference at NPU-kernel-launch granularity is typically within measurement noise.
I.2 Slowdown Patterns
TBuf instead of TQue (HIGH)
Affected kernels: 998/998
Problem: uses TBuf<VECCALC> instead of TQue<VECIN/VECOUT>. TBuf requires an explicit pipe_barrier(PIPE_ALL) for every sync point, while TQue uses hardware flags for fine-grained pipe overlap.
Fix: generate TQue<QuePosition::VECIN, depth> with AllocTensor / FreeTensor lifecycle instead of the TBuf.Get / TBuf.Get pattern.
PIPE_ALL barriers (full pipeline stall) (HIGH)
Affected kernels: 998/998
Problem: every ascend_pipe_barrier() generates pipe_barrier(PIPE_ALL) which stalls all hardware pipes simultaneously. The original C++ uses per-pipe sync via TQue or selective PIPE_V / PIPE_MTE2 flags.
Fix: use pipe_barrier(PIPE_V) for compute-only sync, PIPE_MTE2 for DMA sync, or eliminate barriers entirely with TQue.
No double-buffering (HIGH)
Affected kernels: 998/998
Problem: DMA and compute are fully serialised: load → barrier → compute → barrier → store. Original C++ overlaps tile N+1 DMA with tile N compute using TQue depth = 2.
Fix: detect tiling loops and generate TQue with depth 2. Use EnQue / DeQue to overlap DMA with compute across tiles.
Uniform maximum buffer sizing (LOW)
Affected kernels: 998/998
Problem: all TBuf get an identical maximum size = (UB_SIZE - 8 KB) / num_bufs. Original C++ sizes each buffer to its actual data needs. Wastes UB space when buffers have different usage.
Fix: track actual buffer usage in MLIR and allocate proportionally.
Scalar math vectorisation workaround (MEDIUM)
Affected kernels: 1/998
Problem: scalar log / exp / sqrt operations are vectorised via a 1 KB scratch buffer because the scalar pipe hangs on some NPU models. Adds DMA + buffer overhead for each scalar math op.
Fix: use the scalar pipe on models that support it; on others, amortise by batching scalar ops.
I.3 Optimisation Opportunities
Barrier-elision opportunity (MEDIUM)
Applicable kernels: 998/998
Description: consecutive vector ops on different buffers do not need barriers between them. The current codegen inserts barriers whenever dirty_bufs overlap, but many ops are independent.
Implementation: implement per-buffer dirty tracking at the MLIR level. Only insert a barrier when a read-after-write hazard exists on the same buffer.
Loop-unrolling candidate (LOW)
Applicable kernels: 998/998
Description: small fixed-iteration loops (e.g. softmax’s 2-pass reduce) could be unrolled. The current codegen emits generic while (true) loops.
Implementation: detect loops with known small trip counts and unroll.
Operation-fusion candidate (MEDIUM)
Applicable kernels: 0/998 (future)
Description: sequential vector ops on the same buffer (e.g. Sub → Exp or Div → Cast) could be fused into a single vector instruction or at least share a barrier.
Implementation: detect chains of unary/binary ops on the same buffer and fuse into composite AscendC instructions.
I.4 Generalisable Optimisation Plan
Based on the pattern analysis, three optimisations would close the performance gap for the majority of kernels:
Priority 1: TQue migration (closes ~50% of gap)
Replace TBuf<VECCALC> with TQue<VECIN/VECOUT> in the MLIR → C++ codegen. This eliminates PIPE_ALL barriers in favour of hardware-flag-based sync, and enables double-buffering for DMA / compute overlap.
Affected files: crates/rustc_codegen_mlir/src/mlir_to_cpp.rs
Changes required:
- Change buffer declarations from
TBuf<TPosition::VECCALC>toTQue<QuePosition::VECIN>/TQue<QuePosition::VECOUT>. - Replace
tbuf.Get<T>()withinQueue.AllocTensor<T>()/inQueue.DeQue<T>(). - Add
inQueue.EnQue(tensor)/outQueue.FreeTensor(tensor)lifecycle. - Replace
pipe_barrier(PIPE_ALL)with implicitTQuesync.
Priority 2: Barrier elision (closes ~20% of gap)
Implement per-buffer dirty tracking to eliminate barriers between independent vector operations.
Current behaviour: every vector op that reads a dirty buffer triggers PIPE_ALL.
Proposed behaviour: track dirty state per buffer. Only barrier when:
- a DMA load writes buffer B, then a vector op reads buffer B;
- a vector op writes buffer B, then a DMA store reads buffer B;
- skip barriers between
Add(buf0, buf1, buf2)andMul(buf3, buf0, buf4)whenbuf0is not dirty.
Priority 3: Operation fusion (closes ~10% of gap)
Fuse sequential vector ops on the same buffer into compound operations:
Sub(buf, x, max)→Exp(buf, buf)→ single AscendC call with Sub+Exp;Muls(buf, buf, scale)→Adds(buf, buf, bias)→ MulAdd composite;- eliminate intermediate barriers between fused ops.
I.5 Per-Category Performance Summary
Scaled to the real ascendc-to-rs batch categories. Every category has the same two-class split; the EQUIVALENT fraction is higher where single-vector-op patterns dominate (notably ops_transformer, because attention / MLP kernels tend to reuse one buffer without triggering the DMA / compute overlap path).
| Category | Total | Equivalent | Slow 1.02× | Slow 1.2× | Slow 1.5× | Slow 2×+ |
|---|---|---|---|---|---|---|
| ops_cv | 41 | 4 | 37 | 0 | 0 | 0 |
| ops_legacy | 343 | 0 | 343 | 0 | 0 | 0 |
| ops_math | 155 | 12 | 143 | 0 | 0 | 0 |
| ops_nn | 306 | 6 | 300 | 0 | 0 | 0 |
| ops_oam | 3 | 0 | 3 | 0 | 0 | 0 |
| ops_transformer | 150 | 99 | 51 | 0 | 0 | 0 |
| Total | 998 | 121 | 877 | 0 | 0 | 0 |
The ops_transformer category has the highest proportion of EQUIVALENT kernels (66%) because transformer attention / MLP kernels tend to use single-vector-op patterns that do not trigger DMA / compute pipeline overlap — so the TBuf vs TQue distinction has less impact.
I.6 Per-Kernel Detail
The full per-kernel performance report (all 998 real-batch kernels) is maintained as a machine-generated companion file: blog/appendix_perf_report.md in the repository. It lists each kernel’s performance classification (EQUIVALENT / SLOW_1.02X) and the specific slowdown patterns (S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, etc.) that apply.
I.7 PTO Path: Double-Buffering Resolved (2026-04-02)
The three “HIGH” slowdown patterns above (TBuf, PIPE_ALL, no double-buffering) apply exclusively to the mlir_to_cpp codegen path. The PTO tile path (mlir_to_pto.rs → ptoas) addresses all three simultaneously:
| Slowdown pattern | mlir_to_cpp status | PTO tile path status |
|---|---|---|
| TBuf instead of TQue | Affects 998/998 kernels | N/A — PTO uses tile buffers, not TBuf/TQue |
| PIPE_ALL barriers | Affects 998/998 kernels | Eliminated — ptoas inserts only 2 fine-grained flags per softmax |
| No double-buffering | Affects 998/998 kernels | Resolved — GEP offset fix enables concurrent tload scheduling |
The tile_softmax_double_buf example achieves 1.62× per-tile throughput (0.0034 ms vs 0.0055 ms baseline) on Ascend 910B2. The GEP offset fix in mlir_to_pto.rs (commits bea12b77, 9537834a) is what enables the concurrent scheduling — prior to the fix, all partition_view ops emitted offsets=[%c0,%c0], making both loads reference the same tensor row. See §4.7 for the results table and Appendix J §J.4 for the full implementation detail.