English | 中文版
Appendix F: Performance Benchmarks
This appendix provides an interactive comparison of AscendC C++ (hand-optimized reference kernels) versus ascend-rs (Rust-generated) kernel performance across different NPU targets.
Methodology
- Wall-clock timing:
clock_gettime(CLOCK_MONOTONIC)around kernel launch +aclrtSynchronizeStream - Iterations: 1 warmup + 10 timed, median reported
- Compilation: Both C++ and Rust kernels compiled with
bishengat-O2 - Ratio: Rust time / C++ time (< 1.0 = Rust is faster)
Interactive Results
Note: If the interactive table does not render (e.g., in PDF), see the static table below.
Static Summary
| Kernel | Size | Target | C++ (ms) | Rust (ms) | Ratio |
|---|---|---|---|---|---|
| relu | 256 | 310P | 0.078 | 0.075 | 0.96x |
| relu | 1024 | 310P | 0.075 | 0.076 | 1.01x |
| relu | 4096 | 310P | 0.075 | 0.076 | 1.01x |
| relu | 16384 | 310P | 0.083 | 0.083 | 1.00x |
| sigmoid | 256 | 310P | 0.075 | 0.075 | 1.00x |
| sigmoid | 1024 | 310P | 0.075 | 0.074 | 0.99x |
| sigmoid | 4096 | 310P | 0.077 | 0.077 | 1.00x |
| sigmoid | 16384 | 310P | 0.086 | 0.086 | 1.00x |
| softmax | 256 | 310P | 0.078 | 0.077 | 0.99x |
| softmax | 1024 | 310P | 0.077 | 0.076 | 0.99x |
| softmax | 4096 | 310P | 0.079 | 0.079 | 1.00x |
| softmax | 16384 | 310P | 0.089 | 0.087 | 0.98x |
| tanh | 256 | 310P | 0.075 | 0.077 | 1.03x |
| tanh | 1024 | 310P | 0.075 | 0.076 | 1.01x |
| tanh | 4096 | 310P | 0.076 | 0.078 | 1.03x |
| tanh | 16384 | 310P | 0.085 | 0.086 | 1.01x |
| gelu | 256 | 910B3 | 0.023 | 0.019 | 0.83x |
| gelu | 1024 | 910B3 | 0.022 | 0.019 | 0.86x |
| gelu | 4096 | 910B3 | 0.023 | 0.019 | 0.83x |
| gelu | 16384 | 910B3 | 0.024 | 0.023 | 0.96x |
| relu | 256 | 910B3 | 0.030 | 0.030 | 1.00x |
| relu | 1024 | 910B3 | 0.028 | 0.028 | 1.00x |
| relu | 4096 | 910B3 | 0.029 | 0.026 | 0.90x |
| relu | 16384 | 910B3 | 0.029 | 0.031 | 1.07x |
| sigmoid | 256 | 910B3 | 0.028 | 0.028 | 1.00x |
| sigmoid | 1024 | 910B3 | 0.028 | 0.024 | 0.86x |
| sigmoid | 4096 | 910B3 | 0.029 | 0.028 | 0.97x |
| sigmoid | 16384 | 910B3 | 0.029 | 0.030 | 1.03x |
| softmax | 256 | 910B3 | 0.031 | 0.032 | 1.03x |
| softmax | 1024 | 910B3 | 0.031 | 0.031 | 1.00x |
| softmax | 4096 | 910B3 | 0.021 | 0.021 | 1.00x |
| tanh | 256 | 910B3 | 0.029 | 0.030 | 1.03x |
| tanh | 1024 | 910B3 | 0.028 | 0.026 | 0.93x |
| tanh | 4096 | 910B3 | 0.028 | 0.028 | 1.00x |
| tanh | 16384 | 910B3 | 0.029 | 0.030 | 1.03x |
Benchmarks collected on Ascend 910B3 and 310P hardware. Auto-generated from
kernels.db.