Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

English | 中文版

Appendix I: Performance Differential Analysis

Analysis of 998 CANN 8.5 kernel round-trip performance patterns across the real ascendc-to-rs transpilation batch (same corpus as Appendix G).

The ascend-rs compilation pipeline (Rust → MLIR → C++ → bisheng) introduces specific code-generation patterns compared to hand-written AscendC C++. This appendix identifies those patterns, classifies their impact, and proposes generalisable optimisations.

Scope note. Of the 998 kernels, 247 are Transpiled (real compute body) and 751 are Registered (identity stub body). The slowdown patterns in §I.2 (TBuf vs TQue, PIPE_ALL barriers, no double-buffering, uniform buffer sizing) are properties of the codegen path — they are the patterns mlir_to_cpp.rs emits for any kernel body that contains DMA+compute. Registered kernels technically exhibit the TBuf pattern in their emitted stub too, but since the stub body only does a copy, the 2% slowdown number is only meaningful for the 247 Transpiled kernels. The table counts are reported against all 998 because the codegen path is uniform; readers interested in the realised runtime gap should restrict the denominator to 247.

I.1 Performance Classification

ClassificationCount%Description
EQUIVALENT12112%Generated code matches original C++ performance
SLOW_1.02X87788%~2% slower due to barrier and buffer-overhead patterns
SLOW_1.2X00%~20% slower (none observed)
SLOW_1.5X00%~50% slower (none observed)
SLOW_2X+00%2× or slower (none observed)

Note: the 2% overhead comes from TBuf + PIPE_ALL patterns; actual runtime difference at NPU-kernel-launch granularity is typically within measurement noise.

I.2 Slowdown Patterns

TBuf instead of TQue (HIGH)

Affected kernels: 998/998

Problem: uses TBuf<VECCALC> instead of TQue<VECIN/VECOUT>. TBuf requires an explicit pipe_barrier(PIPE_ALL) for every sync point, while TQue uses hardware flags for fine-grained pipe overlap.

Fix: generate TQue<QuePosition::VECIN, depth> with AllocTensor / FreeTensor lifecycle instead of the TBuf.Get / TBuf.Get pattern.


PIPE_ALL barriers (full pipeline stall) (HIGH)

Affected kernels: 998/998

Problem: every ascend_pipe_barrier() generates pipe_barrier(PIPE_ALL) which stalls all hardware pipes simultaneously. The original C++ uses per-pipe sync via TQue or selective PIPE_V / PIPE_MTE2 flags.

Fix: use pipe_barrier(PIPE_V) for compute-only sync, PIPE_MTE2 for DMA sync, or eliminate barriers entirely with TQue.


No double-buffering (HIGH)

Affected kernels: 998/998

Problem: DMA and compute are fully serialised: load → barrier → compute → barrier → store. Original C++ overlaps tile N+1 DMA with tile N compute using TQue depth = 2.

Fix: detect tiling loops and generate TQue with depth 2. Use EnQue / DeQue to overlap DMA with compute across tiles.


Uniform maximum buffer sizing (LOW)

Affected kernels: 998/998

Problem: all TBuf get an identical maximum size = (UB_SIZE - 8 KB) / num_bufs. Original C++ sizes each buffer to its actual data needs. Wastes UB space when buffers have different usage.

Fix: track actual buffer usage in MLIR and allocate proportionally.


Scalar math vectorisation workaround (MEDIUM)

Affected kernels: 1/998

Problem: scalar log / exp / sqrt operations are vectorised via a 1 KB scratch buffer because the scalar pipe hangs on some NPU models. Adds DMA + buffer overhead for each scalar math op.

Fix: use the scalar pipe on models that support it; on others, amortise by batching scalar ops.


I.3 Optimisation Opportunities

Barrier-elision opportunity (MEDIUM)

Applicable kernels: 998/998

Description: consecutive vector ops on different buffers do not need barriers between them. The current codegen inserts barriers whenever dirty_bufs overlap, but many ops are independent.

Implementation: implement per-buffer dirty tracking at the MLIR level. Only insert a barrier when a read-after-write hazard exists on the same buffer.


Loop-unrolling candidate (LOW)

Applicable kernels: 998/998

Description: small fixed-iteration loops (e.g. softmax’s 2-pass reduce) could be unrolled. The current codegen emits generic while (true) loops.

Implementation: detect loops with known small trip counts and unroll.


Operation-fusion candidate (MEDIUM)

Applicable kernels: 0/998 (future)

Description: sequential vector ops on the same buffer (e.g. SubExp or DivCast) could be fused into a single vector instruction or at least share a barrier.

Implementation: detect chains of unary/binary ops on the same buffer and fuse into composite AscendC instructions.


I.4 Generalisable Optimisation Plan

Based on the pattern analysis, three optimisations would close the performance gap for the majority of kernels:

Priority 1: TQue migration (closes ~50% of gap)

Replace TBuf<VECCALC> with TQue<VECIN/VECOUT> in the MLIR → C++ codegen. This eliminates PIPE_ALL barriers in favour of hardware-flag-based sync, and enables double-buffering for DMA / compute overlap.

Affected files: crates/rustc_codegen_mlir/src/mlir_to_cpp.rs

Changes required:

  1. Change buffer declarations from TBuf<TPosition::VECCALC> to TQue<QuePosition::VECIN> / TQue<QuePosition::VECOUT>.
  2. Replace tbuf.Get<T>() with inQueue.AllocTensor<T>() / inQueue.DeQue<T>().
  3. Add inQueue.EnQue(tensor) / outQueue.FreeTensor(tensor) lifecycle.
  4. Replace pipe_barrier(PIPE_ALL) with implicit TQue sync.

Priority 2: Barrier elision (closes ~20% of gap)

Implement per-buffer dirty tracking to eliminate barriers between independent vector operations.

Current behaviour: every vector op that reads a dirty buffer triggers PIPE_ALL.

Proposed behaviour: track dirty state per buffer. Only barrier when:

  • a DMA load writes buffer B, then a vector op reads buffer B;
  • a vector op writes buffer B, then a DMA store reads buffer B;
  • skip barriers between Add(buf0, buf1, buf2) and Mul(buf3, buf0, buf4) when buf0 is not dirty.

Priority 3: Operation fusion (closes ~10% of gap)

Fuse sequential vector ops on the same buffer into compound operations:

  • Sub(buf, x, max)Exp(buf, buf) → single AscendC call with Sub+Exp;
  • Muls(buf, buf, scale)Adds(buf, buf, bias) → MulAdd composite;
  • eliminate intermediate barriers between fused ops.

I.5 Per-Category Performance Summary

Scaled to the real ascendc-to-rs batch categories. Every category has the same two-class split; the EQUIVALENT fraction is higher where single-vector-op patterns dominate (notably ops_transformer, because attention / MLP kernels tend to reuse one buffer without triggering the DMA / compute overlap path).

CategoryTotalEquivalentSlow 1.02×Slow 1.2×Slow 1.5×Slow 2×+
ops_cv41437000
ops_legacy3430343000
ops_math15512143000
ops_nn3066300000
ops_oam303000
ops_transformer1509951000
Total998121877000

The ops_transformer category has the highest proportion of EQUIVALENT kernels (66%) because transformer attention / MLP kernels tend to use single-vector-op patterns that do not trigger DMA / compute pipeline overlap — so the TBuf vs TQue distinction has less impact.

I.6 Per-Kernel Detail

The full per-kernel performance report (all 998 real-batch kernels) is maintained as a machine-generated companion file: blog/appendix_perf_report.md in the repository. It lists each kernel’s performance classification (EQUIVALENT / SLOW_1.02X) and the specific slowdown patterns (S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, etc.) that apply.

I.7 PTO Path: Double-Buffering Resolved (2026-04-02)

The three “HIGH” slowdown patterns above (TBuf, PIPE_ALL, no double-buffering) apply exclusively to the mlir_to_cpp codegen path. The PTO tile path (mlir_to_pto.rsptoas) addresses all three simultaneously:

Slowdown patternmlir_to_cpp statusPTO tile path status
TBuf instead of TQueAffects 998/998 kernelsN/A — PTO uses tile buffers, not TBuf/TQue
PIPE_ALL barriersAffects 998/998 kernelsEliminated — ptoas inserts only 2 fine-grained flags per softmax
No double-bufferingAffects 998/998 kernelsResolved — GEP offset fix enables concurrent tload scheduling

The tile_softmax_double_buf example achieves 1.62× per-tile throughput (0.0034 ms vs 0.0055 ms baseline) on Ascend 910B2. The GEP offset fix in mlir_to_pto.rs (commits bea12b77, 9537834a) is what enables the concurrent scheduling — prior to the fix, all partition_view ops emitted offsets=[%c0,%c0], making both loads reference the same tensor row. See §4.7 for the results table and Appendix J §J.4 for the full implementation detail.