English | 中文版
10. DeepSeek Inference: A Cross-Platform Kernel Benchmark Suite
Summary: Softmax and GEMM are useful microbenchmarks, but a real inference workload is the only honest test of a kernel toolchain. We packaged the 13 kernels needed for a full DeepSeek-R1-Distill-Qwen-1.5B decode step as a portable suite, ran the Rust source through
mlir_to_msl, and measured the result on Apple silicon. The generated Metal kernels reach 91.7 tok/s on M2 Max (60% of the 400 GB/s memory-bandwidth ceiling) and 33–35 tok/s on M4, beating Apple’s hand-tuned MLX runtime on decode. The same Rust source targets nine other backends; this chapter documents the suite so it can be reproduced on any of them.
10.1 Why DeepSeek?
DeepSeek-R1-Distill-Qwen-1.5B is small enough to fit in 8 GB of unified memory, large enough to be bandwidth-bound on every realistic accelerator, and architecturally representative of the modern transformer family:
- Grouped-query attention (GQA) — 12 Q-heads share 2 KV-heads.
- SwiGLU MLP — three matmuls per layer, fusable into one kernel.
- RMSNorm — replaces LayerNorm everywhere.
- Rotary position embeddings — applied in-place to Q and K.
Per token, decode reads ≈ 2.6 GB of weights across 28 layers. That makes it a bandwidth benchmark, not a FLOPs benchmark. The hardware ceiling is bandwidth ÷ bytes_per_token:
| Device | Memory bandwidth | Theoretical max tok/s |
|---|---|---|
| Apple M2 Max | 400 GB/s | 154 |
| Apple M4 | 120 GB/s | 46 |
| Apple M4 Pro | 273 GB/s | 105 |
| NVIDIA H100 SXM | 3,350 GB/s | 1,288 |
| NVIDIA RTX 4090 | 1,008 GB/s | 388 |
| AWS Trainium2 | 2,800 GB/s | 1,077 |
| Huawei Ascend 910B2 | 1,228 GB/s | 472 |
| Cambricon MLU590 | 1,228 GB/s | 472 |
Any kernel that reaches 60% of this number is competitive with hand-tuned production code; reaching 80% is the goal of a memory-bound kernel.
10.2 The 13-Kernel Suite
A full transformer layer in decode mode reduces to 8 dispatches plus 5 model-level kernels (embedding, two RMSNorm variants, RoPE, argmax). The complete list, with shapes for the 1.5B model (D=1536, NH=12, NKV=2, DH=128, INTER=8960, VOCAB=151936):
| # | Kernel | Op | Input → Output shape |
|---|---|---|---|
| 1 | rms_norm_1536 | RMSNorm + γ scale | (1, D) → (1, D) |
| 2 | embedding_lookup | gather row from table | (VOCAB, D), (1,) → (1, D) |
| 3 | q_proj_matvec | matvec + bias | (1, D) → (1, NH·DH) |
| 4 | kv_proj_matvec | fused K + V matvec + bias | (1, D) → (1, NKV·DH) × 2 |
| 5 | rope_q_decode | RoPE on Q heads, in place | (NH, DH) → (NH, DH) |
| 6 | rope_k_decode | RoPE on K heads, in place | (NKV, DH) → (NKV, DH) |
| 7 | attention_decode_gqa | GQA attention with KV cache | (NH, DH) + KV cache → (NH, DH) |
| 8 | o_proj_residual | O-projection + residual add | (1, NH·DH) → (1, D) |
| 9 | mlp_gate_up_silu | fused gate + up + silu·mul | (1, D) → (1, INTER) |
| 10 | down_proj_residual | down-projection + residual add | (1, INTER) → (1, D) |
| 11 | silu_mul_fused | standalone SwiGLU | (1, INTER) × 2 → (1, INTER) |
| 12 | residual_add | elementwise add | (1, D) × 2 → (1, D) |
| 13 | argmax_greedy | argmax over logits | (1, VOCAB) → (1, 1) u32 |
The full Rust source is at crates/deepseek_metal/src/tile_kernels.rs, expressed against the safe tile.rs view API:
#[ascend_std::aiv_kernel]
pub unsafe fn rms_norm_1536(input: *const f32, gamma: *const f32, output: *mut f32) {
let ctx = unsafe { GmDeviceCtx::new() };
let in_v = unsafe { ctx.view::<1, D, f32>(input) };
let g_v = unsafe { ctx.view::<1, D, f32>(gamma) };
let out_v = unsafe { ctx.view_mut::<1, D, f32>(output) };
let x = tile_load_view_f32(&in_v);
let g = tile_load_view_f32(&g_v);
let normed = safe::tile_rms_norm_f32::<1, D>(x, 1e-6);
let out = safe::tile_mul_f32::<1, D>(normed, g);
tile_store_view_f32(&out_v, out);
}
The same source compiles to all ten mlir_to_<target> backends. Per-target reference kernels are checked in under benchmarks/deepseek_tile_kernels/templates/<target>/.
10.3 Apple M2 Max — Headline Result
Hardware: Apple M2 Max, 12-core CPU, 38-core GPU, 400 GB/s unified memory bandwidth, macOS 14.5, Metal 3.1.
Setup: 28-layer DeepSeek-R1-Distill-Qwen-1.5B, bf16 weights uploaded directly to GPU as Metal bfloat. Single Metal command buffer per forward pass. Repetition penalty 1.3, temperature 0.0 (greedy).
| Implementation | Decode tok/s | % of peak (154) |
|---|---|---|
| ascend-rs (Rust → MSL) | 91.7 | 60% |
| MLX 0.29.1 (Apple, hand-tuned) | ≈ 88 | 57% |
The Rust-source kernels, after passing through rustc_codegen_mlir → mlir_to_msl, outperform Apple’s hand-tuned MLX on decode. Decode is the dominant cost in a typical inference session (one prompt, hundreds of generated tokens), so this is the number that matters for end-user latency.
How that 91.7 was reached
Optimization rounds on M2 Max (each step measured against the previous):
| Step | tok/s | Δ |
|---|---|---|
| Baseline (templates as committed) | 90.3 | — |
attention_decode_v4 (TG-mem Q cache + float4) | 91.3 | +1.0 |
| Token-buffer hoist out of inner loop | 91.7 | +0.4 |
| Final | 91.7 | +1.4 |
Two attempted optimisations were measured and rolled back because they regressed:
| Attempted | tok/s | Δ |
|---|---|---|
matvec_f16_cached (manual A-cache) | 85.1 | −5.2 (revert) |
| Fused RMSNorm + next matvec | 78.7 | −13 (revert) |
The lessons are documented in crates/deepseek_metal/templates/ and in the optimization log; the short version is that the Apple GPU’s L1/L2 already caches reused activations, so manual threadgroup caching only helps when (a) the data doesn’t fit in cache and (b) the per-thread compute is large enough to amortize the barrier. For decode matvec with K = 1536 (6 KB), neither holds.
10.4 Apple M4 — Smaller-Memory Result
Hardware: Apple M4, 4P+6E CPU, 10-core GPU, 120 GB/s memory bandwidth, macOS 14.5.
| Implementation | Decode tok/s | Prefill tok/s |
|---|---|---|
| ascend-rs (Rust → MSL) | 33–35 | 9.3 |
| MLX 0.29.1 | 32 | 72 |
The M4 result confirms the M2 Max story for decode: the codegen path beats MLX (33–35 vs 32). Prefill is a different story — MLX uses Apple’s simdgroup_matrix_multiply primitive, which fits prefill’s compute-bound profile (large matmuls, M ≫ 1) very well. The ascend-rs prefill path uses a tiled matmul kernel that hits 9.3 tok/s; closing the prefill gap is in scope for the next iteration (templates/matmul_simd.metal is the in-progress replacement).
10.5 Where the Time Goes — Per-Kernel Breakdown
For one decoded token on M2 Max (28 layers × 8 dispatches + 5 model-level dispatches = 229 kernel launches):
| Kernel class | Per-token time (ms) | % of decode |
|---|---|---|
| Q/K/V/O matvecs | 4.3 | 39% |
| Gate + up + silu (MLP) | 3.1 | 28% |
| Down-projection | 2.1 | 19% |
| Attention (decode v4) | 0.8 | 7% |
| RMSNorm × 2/layer | 0.4 | 4% |
| RoPE Q + K | 0.2 | 2% |
| Argmax over vocab | 0.1 | 1% |
| Total | 11.0 | 100% |
The seven matvec/MLP kernels — items 3, 4, 8, 9, 10 from the suite in §10.2 — account for 86% of decode time. Optimisation effort returns the most when spent on those kernels, which is why all the wins listed in §10.3 targeted the matvec / attention path. Norms and RoPE together cost less than 1 ms per token; fusing them away (as we tried) saves no measurable bandwidth and adds compute.
10.6 Cross-Vendor Status
The same Rust source under crates/deepseek_metal/src/tile_kernels.rs is the input to all ten codegen backends. As of this writing:
| Backend | Target | Suite compiles | End-to-end run | Notes |
|---|---|---|---|---|
mlir_to_msl | Apple M-series GPU (Metal) | yes | yes | 91.7 tok/s on M2 Max |
mlir_to_gpu | NVIDIA (CUDA) | yes | pending | Uses cudarc runtime |
mlir_to_musa | Moore Threads MTT S4000 | yes | pending | Source-level CUDA compatible |
mlir_to_cpp | Huawei Ascend 910B (V-pipe) | yes | partial | Cube ops route through PTO |
mlir_to_pto | Huawei Ascend 910B (cube) | yes | pending | ptoas shim awaits CANN 9.x |
mlir_to_nki | AWS Trainium / Trainium2 | yes | pending | Emits NKI Python |
mlir_to_aie | AMD Ryzen AI (AIE2P) | yes | pending | IRON Python via aiecc.py |
mlir_to_bang | Cambricon MLU370/590 | yes | pending | Explicit DMA model |
mlir_to_gaudi | Intel Gaudi 2/3 | yes | pending | TPC-C, 256-wide SIMD |
mlir_to_spirv | Vulkan / Metal (SPIR-V) | yes | pending | Compute shaders |
“Compiles” means the kernel goes through mlir_to_<target> and the vendor’s compiler accepts the output. “End-to-end run” means it produces correct logits on real hardware against a known-good reference.
The set of “pending” entries is not a measure of how far each backend has to go — it is a measure of how much hardware-time we have allocated to driving the harness on each rig. The codegen surface for all ten is complete and unit-tested under crates/mlir_to_<target>_tests/.
10.7 Reproducing the Apple Result
# Clone the public artifact + benchmark repo.
git clone https://github.com/yijunyu/ascend-rs
cd ascend-rs
# On a Mac with Xcode command-line tools and a Hugging Face token in env:
cargo run --release -p deepseek_metal -- \
--prompt "The capital of France is" \
--max-tokens 128
The first run downloads DeepSeek-R1-Distill-Qwen-1.5B from Hugging Face (≈ 3 GB) and caches it at ~/.cache/huggingface/. Subsequent runs print:
Loaded DeepSeek-R1-Distill-Qwen-1.5B on Metal
Prefill: 0.23s (26.1 tok/s)
[generated text]
Generated 128 tokens in 1.40s (91.43 tok/s)
The MLX baseline used for comparison:
pip install mlx mlx-lm
python -m mlx_lm.generate \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
--prompt "The capital of France is" \
--max-tokens 128
Both runs use the same model weights and the same prompt; only the kernel implementation differs.
10.8 Why a Suite, Not a Single Kernel
Single-kernel benchmarks (softmax, GEMM, RMSNorm in isolation) are useful for diagnosing a specific bottleneck, but they systematically over-report the value of optimisations that don’t compose:
- Caching activations is a clear win on a standalone matvec benchmark and a clear loss inside a transformer layer where the cache is already warm from the previous matvec.
- Fusing RMSNorm into the next matvec wins on a fused-kernel microbenchmark and loses inside a real layer where the same norm output is consumed by three matvecs (Q, K, V).
- A “fast attention” kernel that ignores the KV cache is irrelevant; in decode, the KV cache is the attention input.
A 13-kernel suite tied to a real model is the smallest benchmark that catches these mistakes. It also lets vendors compare backends honestly: every one of the ten backends sees the same Rust source, the same shapes, and the same memory-traffic budget.
10.9 Key Takeaways
-
The Rust-to-Metal codegen path matches or beats hand-tuned MLX on decode. 91.7 tok/s on M2 Max (vs ≈ 88 for MLX) and 33–35 tok/s on M4 (vs 32 for MLX) demonstrate that a memory-safe kernel toolchain does not give up performance on the path that matters most for interactive inference.
-
Decode is bandwidth-bound; the suite hits 60% of peak. The remaining 40% is split between dispatch overhead (≈ 229 launches per token) and matmul kernels that are not yet using Apple’s
simdgroup_matrix_multiplyprimitive. Both have known fixes. -
Microbenchmarks lie about full-pipeline performance. Two optimisations measured in isolation as wins (caching, fusion) regressed the full decode path by 5–13 tok/s. Suite-level measurement is the only way to catch this.
-
One Rust source, ten backends. The same
tile_kernels.rscompiles throughmlir_to_<target>for Metal, CUDA, MUSA, AscendC, PTO, NKI, AIE, BANG, Gaudi, and SPIR-V. Apple is the first backend to be measured end-to-end at production fidelity; the rest have the codegen surface ready and are blocked only on hardware time.