Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

English | 中文版

10. DeepSeek Inference: A Cross-Platform Kernel Benchmark Suite

Summary: Softmax and GEMM are useful microbenchmarks, but a real inference workload is the only honest test of a kernel toolchain. We packaged the 13 kernels needed for a full DeepSeek-R1-Distill-Qwen-1.5B decode step as a portable suite, ran the Rust source through mlir_to_msl, and measured the result on Apple silicon. The generated Metal kernels reach 91.7 tok/s on M2 Max (60% of the 400 GB/s memory-bandwidth ceiling) and 33–35 tok/s on M4, beating Apple’s hand-tuned MLX runtime on decode. The same Rust source targets nine other backends; this chapter documents the suite so it can be reproduced on any of them.


10.1 Why DeepSeek?

DeepSeek-R1-Distill-Qwen-1.5B is small enough to fit in 8 GB of unified memory, large enough to be bandwidth-bound on every realistic accelerator, and architecturally representative of the modern transformer family:

  • Grouped-query attention (GQA) — 12 Q-heads share 2 KV-heads.
  • SwiGLU MLP — three matmuls per layer, fusable into one kernel.
  • RMSNorm — replaces LayerNorm everywhere.
  • Rotary position embeddings — applied in-place to Q and K.

Per token, decode reads ≈ 2.6 GB of weights across 28 layers. That makes it a bandwidth benchmark, not a FLOPs benchmark. The hardware ceiling is bandwidth ÷ bytes_per_token:

DeviceMemory bandwidthTheoretical max tok/s
Apple M2 Max400 GB/s154
Apple M4120 GB/s46
Apple M4 Pro273 GB/s105
NVIDIA H100 SXM3,350 GB/s1,288
NVIDIA RTX 40901,008 GB/s388
AWS Trainium22,800 GB/s1,077
Huawei Ascend 910B21,228 GB/s472
Cambricon MLU5901,228 GB/s472

Any kernel that reaches 60% of this number is competitive with hand-tuned production code; reaching 80% is the goal of a memory-bound kernel.


10.2 The 13-Kernel Suite

A full transformer layer in decode mode reduces to 8 dispatches plus 5 model-level kernels (embedding, two RMSNorm variants, RoPE, argmax). The complete list, with shapes for the 1.5B model (D=1536, NH=12, NKV=2, DH=128, INTER=8960, VOCAB=151936):

#KernelOpInput → Output shape
1rms_norm_1536RMSNorm + γ scale(1, D)(1, D)
2embedding_lookupgather row from table(VOCAB, D), (1,)(1, D)
3q_proj_matvecmatvec + bias(1, D)(1, NH·DH)
4kv_proj_matvecfused K + V matvec + bias(1, D)(1, NKV·DH) × 2
5rope_q_decodeRoPE on Q heads, in place(NH, DH)(NH, DH)
6rope_k_decodeRoPE on K heads, in place(NKV, DH)(NKV, DH)
7attention_decode_gqaGQA attention with KV cache(NH, DH) + KV cache → (NH, DH)
8o_proj_residualO-projection + residual add(1, NH·DH)(1, D)
9mlp_gate_up_silufused gate + up + silu·mul(1, D)(1, INTER)
10down_proj_residualdown-projection + residual add(1, INTER)(1, D)
11silu_mul_fusedstandalone SwiGLU(1, INTER) × 2 → (1, INTER)
12residual_addelementwise add(1, D) × 2 → (1, D)
13argmax_greedyargmax over logits(1, VOCAB)(1, 1) u32

The full Rust source is at crates/deepseek_metal/src/tile_kernels.rs, expressed against the safe tile.rs view API:

#[ascend_std::aiv_kernel]
pub unsafe fn rms_norm_1536(input: *const f32, gamma: *const f32, output: *mut f32) {
    let ctx = unsafe { GmDeviceCtx::new() };
    let in_v   = unsafe { ctx.view::<1, D, f32>(input) };
    let g_v    = unsafe { ctx.view::<1, D, f32>(gamma) };
    let out_v  = unsafe { ctx.view_mut::<1, D, f32>(output) };

    let x      = tile_load_view_f32(&in_v);
    let g      = tile_load_view_f32(&g_v);
    let normed = safe::tile_rms_norm_f32::<1, D>(x, 1e-6);
    let out    = safe::tile_mul_f32::<1, D>(normed, g);
    tile_store_view_f32(&out_v, out);
}

The same source compiles to all ten mlir_to_<target> backends. Per-target reference kernels are checked in under benchmarks/deepseek_tile_kernels/templates/<target>/.


10.3 Apple M2 Max — Headline Result

Hardware: Apple M2 Max, 12-core CPU, 38-core GPU, 400 GB/s unified memory bandwidth, macOS 14.5, Metal 3.1.

Setup: 28-layer DeepSeek-R1-Distill-Qwen-1.5B, bf16 weights uploaded directly to GPU as Metal bfloat. Single Metal command buffer per forward pass. Repetition penalty 1.3, temperature 0.0 (greedy).

ImplementationDecode tok/s% of peak (154)
ascend-rs (Rust → MSL)91.760%
MLX 0.29.1 (Apple, hand-tuned)≈ 8857%

The Rust-source kernels, after passing through rustc_codegen_mlir → mlir_to_msl, outperform Apple’s hand-tuned MLX on decode. Decode is the dominant cost in a typical inference session (one prompt, hundreds of generated tokens), so this is the number that matters for end-user latency.

How that 91.7 was reached

Optimization rounds on M2 Max (each step measured against the previous):

Steptok/sΔ
Baseline (templates as committed)90.3
attention_decode_v4 (TG-mem Q cache + float4)91.3+1.0
Token-buffer hoist out of inner loop91.7+0.4
Final91.7+1.4

Two attempted optimisations were measured and rolled back because they regressed:

Attemptedtok/sΔ
matvec_f16_cached (manual A-cache)85.1−5.2 (revert)
Fused RMSNorm + next matvec78.7−13 (revert)

The lessons are documented in crates/deepseek_metal/templates/ and in the optimization log; the short version is that the Apple GPU’s L1/L2 already caches reused activations, so manual threadgroup caching only helps when (a) the data doesn’t fit in cache and (b) the per-thread compute is large enough to amortize the barrier. For decode matvec with K = 1536 (6 KB), neither holds.


10.4 Apple M4 — Smaller-Memory Result

Hardware: Apple M4, 4P+6E CPU, 10-core GPU, 120 GB/s memory bandwidth, macOS 14.5.

ImplementationDecode tok/sPrefill tok/s
ascend-rs (Rust → MSL)33–359.3
MLX 0.29.13272

The M4 result confirms the M2 Max story for decode: the codegen path beats MLX (33–35 vs 32). Prefill is a different story — MLX uses Apple’s simdgroup_matrix_multiply primitive, which fits prefill’s compute-bound profile (large matmuls, M ≫ 1) very well. The ascend-rs prefill path uses a tiled matmul kernel that hits 9.3 tok/s; closing the prefill gap is in scope for the next iteration (templates/matmul_simd.metal is the in-progress replacement).


10.5 Where the Time Goes — Per-Kernel Breakdown

For one decoded token on M2 Max (28 layers × 8 dispatches + 5 model-level dispatches = 229 kernel launches):

Kernel classPer-token time (ms)% of decode
Q/K/V/O matvecs4.339%
Gate + up + silu (MLP)3.128%
Down-projection2.119%
Attention (decode v4)0.87%
RMSNorm × 2/layer0.44%
RoPE Q + K0.22%
Argmax over vocab0.11%
Total11.0100%

The seven matvec/MLP kernels — items 3, 4, 8, 9, 10 from the suite in §10.2 — account for 86% of decode time. Optimisation effort returns the most when spent on those kernels, which is why all the wins listed in §10.3 targeted the matvec / attention path. Norms and RoPE together cost less than 1 ms per token; fusing them away (as we tried) saves no measurable bandwidth and adds compute.


10.6 Cross-Vendor Status

The same Rust source under crates/deepseek_metal/src/tile_kernels.rs is the input to all ten codegen backends. As of this writing:

BackendTargetSuite compilesEnd-to-end runNotes
mlir_to_mslApple M-series GPU (Metal)yesyes91.7 tok/s on M2 Max
mlir_to_gpuNVIDIA (CUDA)yespendingUses cudarc runtime
mlir_to_musaMoore Threads MTT S4000yespendingSource-level CUDA compatible
mlir_to_cppHuawei Ascend 910B (V-pipe)yespartialCube ops route through PTO
mlir_to_ptoHuawei Ascend 910B (cube)yespendingptoas shim awaits CANN 9.x
mlir_to_nkiAWS Trainium / Trainium2yespendingEmits NKI Python
mlir_to_aieAMD Ryzen AI (AIE2P)yespendingIRON Python via aiecc.py
mlir_to_bangCambricon MLU370/590yespendingExplicit DMA model
mlir_to_gaudiIntel Gaudi 2/3yespendingTPC-C, 256-wide SIMD
mlir_to_spirvVulkan / Metal (SPIR-V)yespendingCompute shaders

“Compiles” means the kernel goes through mlir_to_<target> and the vendor’s compiler accepts the output. “End-to-end run” means it produces correct logits on real hardware against a known-good reference.

The set of “pending” entries is not a measure of how far each backend has to go — it is a measure of how much hardware-time we have allocated to driving the harness on each rig. The codegen surface for all ten is complete and unit-tested under crates/mlir_to_<target>_tests/.


10.7 Reproducing the Apple Result

# Clone the public artifact + benchmark repo.
git clone https://github.com/yijunyu/ascend-rs
cd ascend-rs

# On a Mac with Xcode command-line tools and a Hugging Face token in env:
cargo run --release -p deepseek_metal -- \
    --prompt "The capital of France is" \
    --max-tokens 128

The first run downloads DeepSeek-R1-Distill-Qwen-1.5B from Hugging Face (≈ 3 GB) and caches it at ~/.cache/huggingface/. Subsequent runs print:

Loaded DeepSeek-R1-Distill-Qwen-1.5B on Metal
Prefill: 0.23s (26.1 tok/s)
[generated text]
Generated 128 tokens in 1.40s (91.43 tok/s)

The MLX baseline used for comparison:

pip install mlx mlx-lm
python -m mlx_lm.generate \
    --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
    --prompt "The capital of France is" \
    --max-tokens 128

Both runs use the same model weights and the same prompt; only the kernel implementation differs.


10.8 Why a Suite, Not a Single Kernel

Single-kernel benchmarks (softmax, GEMM, RMSNorm in isolation) are useful for diagnosing a specific bottleneck, but they systematically over-report the value of optimisations that don’t compose:

  • Caching activations is a clear win on a standalone matvec benchmark and a clear loss inside a transformer layer where the cache is already warm from the previous matvec.
  • Fusing RMSNorm into the next matvec wins on a fused-kernel microbenchmark and loses inside a real layer where the same norm output is consumed by three matvecs (Q, K, V).
  • A “fast attention” kernel that ignores the KV cache is irrelevant; in decode, the KV cache is the attention input.

A 13-kernel suite tied to a real model is the smallest benchmark that catches these mistakes. It also lets vendors compare backends honestly: every one of the ten backends sees the same Rust source, the same shapes, and the same memory-traffic budget.


10.9 Key Takeaways

  1. The Rust-to-Metal codegen path matches or beats hand-tuned MLX on decode. 91.7 tok/s on M2 Max (vs ≈ 88 for MLX) and 33–35 tok/s on M4 (vs 32 for MLX) demonstrate that a memory-safe kernel toolchain does not give up performance on the path that matters most for interactive inference.

  2. Decode is bandwidth-bound; the suite hits 60% of peak. The remaining 40% is split between dispatch overhead (≈ 229 launches per token) and matmul kernels that are not yet using Apple’s simdgroup_matrix_multiply primitive. Both have known fixes.

  3. Microbenchmarks lie about full-pipeline performance. Two optimisations measured in isolation as wins (caching, fusion) regressed the full decode path by 5–13 tok/s. Suite-level measurement is the only way to catch this.

  4. One Rust source, ten backends. The same tile_kernels.rs compiles through mlir_to_<target> for Metal, CUDA, MUSA, AscendC, PTO, NKI, AIE, BANG, Gaudi, and SPIR-V. Apple is the first backend to be measured end-to-end at production fidelity; the rest have the codegen surface ready and are blocked only on hardware time.