English | 中文版

10. DeepSeek Inference: A Cross-Platform Kernel Benchmark Suite

Summary: Softmax and GEMM are useful microbenchmarks, but a real inference workload is the only honest test of a kernel toolchain. We packaged the 13 kernels needed for a full DeepSeek-R1-Distill-Qwen-1.5B decode step as a portable suite, ran the Rust source through mlir_to_msl, and measured the result on Apple silicon. The generated Metal kernels reach 91.7 tok/s on M2 Max (60% of the 400 GB/s memory-bandwidth ceiling) and 33–35 tok/s on M4, beating Apple’s hand-tuned MLX runtime on decode. The same Rust source targets nine other backends; this chapter documents the suite so it can be reproduced on any of them.

10.1 Why DeepSeek?

DeepSeek-R1-Distill-Qwen-1.5B is small enough to fit in 8 GB of unified memory, large enough to be bandwidth-bound on every realistic accelerator, and architecturally representative of the modern transformer family:

Grouped-query attention (GQA) — 12 Q-heads share 2 KV-heads.
SwiGLU MLP — three matmuls per layer, fusable into one kernel.
RMSNorm — replaces LayerNorm everywhere.
Rotary position embeddings — applied in-place to Q and K.

Per token, decode reads ≈ 2.6 GB of weights across 28 layers. That makes it a bandwidth benchmark, not a FLOPs benchmark. The hardware ceiling is bandwidth ÷ bytes_per_token:

Device	Memory bandwidth	Theoretical max tok/s
Apple M2 Max	400 GB/s	154
Apple M4	120 GB/s	46
Apple M4 Pro	273 GB/s	105
NVIDIA H100 SXM	3,350 GB/s	1,288
NVIDIA RTX 4090	1,008 GB/s	388
AWS Trainium2	2,800 GB/s	1,077
Huawei Ascend 910B2	1,228 GB/s	472
Cambricon MLU590	1,228 GB/s	472

Any kernel that reaches 60% of this number is competitive with hand-tuned production code; reaching 80% is the goal of a memory-bound kernel.

10.2 The 13-Kernel Suite

A full transformer layer in decode mode reduces to 8 dispatches plus 5 model-level kernels (embedding, two RMSNorm variants, RoPE, argmax). The complete list, with shapes for the 1.5B model (D=1536, NH=12, NKV=2, DH=128, INTER=8960, VOCAB=151936):

#	Kernel	Op	Input → Output shape
1	`rms_norm_1536`	RMSNorm + γ scale	`(1, D)` → `(1, D)`
2	`embedding_lookup`	gather row from table	`(VOCAB, D)`, `(1,)` → `(1, D)`
3	`q_proj_matvec`	matvec + bias	`(1, D)` → `(1, NH·DH)`
4	`kv_proj_matvec`	fused K + V matvec + bias	`(1, D)` → `(1, NKV·DH)` × 2
5	`rope_q_decode`	RoPE on Q heads, in place	`(NH, DH)` → `(NH, DH)`
6	`rope_k_decode`	RoPE on K heads, in place	`(NKV, DH)` → `(NKV, DH)`
7	`attention_decode_gqa`	GQA attention with KV cache	`(NH, DH)` + KV cache → `(NH, DH)`
8	`o_proj_residual`	O-projection + residual add	`(1, NH·DH)` → `(1, D)`
9	`mlp_gate_up_silu`	fused gate + up + silu·mul	`(1, D)` → `(1, INTER)`
10	`down_proj_residual`	down-projection + residual add	`(1, INTER)` → `(1, D)`
11	`silu_mul_fused`	standalone SwiGLU	`(1, INTER)` × 2 → `(1, INTER)`
12	`residual_add`	elementwise add	`(1, D)` × 2 → `(1, D)`
13	`argmax_greedy`	argmax over logits	`(1, VOCAB)` → `(1, 1)` u32

The full Rust source is at crates/deepseek_metal/src/tile_kernels.rs, expressed against the safe tile.rs view API:

#[ascend_std::aiv_kernel]
pub unsafe fn rms_norm_1536(input: *const f32, gamma: *const f32, output: *mut f32) {
    let ctx = unsafe { GmDeviceCtx::new() };
    let in_v   = unsafe { ctx.view::<1, D, f32>(input) };
    let g_v    = unsafe { ctx.view::<1, D, f32>(gamma) };
    let out_v  = unsafe { ctx.view_mut::<1, D, f32>(output) };

    let x      = tile_load_view_f32(&in_v);
    let g      = tile_load_view_f32(&g_v);
    let normed = safe::tile_rms_norm_f32::<1, D>(x, 1e-6);
    let out    = safe::tile_mul_f32::<1, D>(normed, g);
    tile_store_view_f32(&out_v, out);
}

The same source compiles to all ten mlir_to_<target> backends. Per-target reference kernels are checked in under benchmarks/deepseek_tile_kernels/templates/<target>/.

10.3 Apple M2 Max — Headline Result

Hardware: Apple M2 Max, 12-core CPU, 38-core GPU, 400 GB/s unified memory bandwidth, macOS 14.5, Metal 3.1.

Setup: 28-layer DeepSeek-R1-Distill-Qwen-1.5B, bf16 weights uploaded directly to GPU as Metal bfloat. Single Metal command buffer per forward pass. Repetition penalty 1.3, temperature 0.0 (greedy).

Implementation	Decode tok/s	% of peak (154)
ascend-rs (Rust → MSL)	91.7	60%
MLX 0.29.1 (Apple, hand-tuned)	≈ 88	57%

The Rust-source kernels, after passing through rustc_codegen_mlir → mlir_to_msl, outperform Apple’s hand-tuned MLX on decode. Decode is the dominant cost in a typical inference session (one prompt, hundreds of generated tokens), so this is the number that matters for end-user latency.

How that 91.7 was reached

Optimization rounds on M2 Max (each step measured against the previous):

Step	tok/s	Δ
Baseline (templates as committed)	90.3	—
`attention_decode_v4` (TG-mem Q cache + float4)	91.3	+1.0
Token-buffer hoist out of inner loop	91.7	+0.4
Final	91.7	+1.4

Two attempted optimisations were measured and rolled back because they regressed:

Attempted	tok/s	Δ
`matvec_f16_cached` (manual A-cache)	85.1	−5.2 (revert)
Fused RMSNorm + next matvec	78.7	−13 (revert)

The lessons are documented in crates/deepseek_metal/templates/ and in the optimization log; the short version is that the Apple GPU’s L1/L2 already caches reused activations, so manual threadgroup caching only helps when (a) the data doesn’t fit in cache and (b) the per-thread compute is large enough to amortize the barrier. For decode matvec with K = 1536 (6 KB), neither holds.

10.4 Apple M4 — Smaller-Memory Result

Hardware: Apple M4, 4P+6E CPU, 10-core GPU, 120 GB/s memory bandwidth, macOS 14.5.

Implementation	Decode tok/s	Prefill tok/s
ascend-rs (Rust → MSL)	33–35	9.3
MLX 0.29.1	32	72

The M4 result confirms the M2 Max story for decode: the codegen path beats MLX (33–35 vs 32). Prefill is a different story — MLX uses Apple’s simdgroup_matrix_multiply primitive, which fits prefill’s compute-bound profile (large matmuls, M ≫ 1) very well. The ascend-rs prefill path uses a tiled matmul kernel that hits 9.3 tok/s; closing the prefill gap is in scope for the next iteration (templates/matmul_simd.metal is the in-progress replacement).

10.5 Where the Time Goes — Per-Kernel Breakdown

For one decoded token on M2 Max (28 layers × 8 dispatches + 5 model-level dispatches = 229 kernel launches):

Kernel class	Per-token time (ms)	% of decode
Q/K/V/O matvecs	4.3	39%
Gate + up + silu (MLP)	3.1	28%
Down-projection	2.1	19%
Attention (decode v4)	0.8	7%
RMSNorm × 2/layer	0.4	4%
RoPE Q + K	0.2	2%
Argmax over vocab	0.1	1%
Total	11.0	100%

The seven matvec/MLP kernels — items 3, 4, 8, 9, 10 from the suite in §10.2 — account for 86% of decode time. Optimisation effort returns the most when spent on those kernels, which is why all the wins listed in §10.3 targeted the matvec / attention path. Norms and RoPE together cost less than 1 ms per token; fusing them away (as we tried) saves no measurable bandwidth and adds compute.

10.6 Cross-Vendor Status

The same Rust source under crates/deepseek_metal/src/tile_kernels.rs is the input to all ten codegen backends. As of this writing:

Backend	Target	Suite compiles	End-to-end run	Notes
`mlir_to_msl`	Apple M-series GPU (Metal)	yes	yes	91.7 tok/s on M2 Max
`mlir_to_gpu`	NVIDIA (CUDA)	yes	pending	Uses `cudarc` runtime
`mlir_to_musa`	Moore Threads MTT S4000	yes	pending	Source-level CUDA compatible
`mlir_to_cpp`	Huawei Ascend 910B (V-pipe)	yes	partial	Cube ops route through PTO
`mlir_to_pto`	Huawei Ascend 910B (cube)	yes	pending	`ptoas` shim awaits CANN 9.x
`mlir_to_nki`	AWS Trainium / Trainium2	yes	pending	Emits NKI Python
`mlir_to_aie`	AMD Ryzen AI (AIE2P)	yes	pending	IRON Python via `aiecc.py`
`mlir_to_bang`	Cambricon MLU370/590	yes	pending	Explicit DMA model
`mlir_to_gaudi`	Intel Gaudi 2/3	yes	pending	TPC-C, 256-wide SIMD
`mlir_to_spirv`	Vulkan / Metal (SPIR-V)	yes	pending	Compute shaders

“Compiles” means the kernel goes through mlir_to_<target> and the vendor’s compiler accepts the output. “End-to-end run” means it produces correct logits on real hardware against a known-good reference.

The set of “pending” entries is not a measure of how far each backend has to go — it is a measure of how much hardware-time we have allocated to driving the harness on each rig. The codegen surface for all ten is complete and unit-tested under crates/mlir_to_<target>_tests/.

10.7 Reproducing the Apple Result

# Clone the public artifact + benchmark repo.
git clone https://github.com/yijunyu/ascend-rs
cd ascend-rs

# On a Mac with Xcode command-line tools and a Hugging Face token in env:
cargo run --release -p deepseek_metal -- \
    --prompt "The capital of France is" \
    --max-tokens 128

The first run downloads DeepSeek-R1-Distill-Qwen-1.5B from Hugging Face (≈ 3 GB) and caches it at ~/.cache/huggingface/. Subsequent runs print:

Loaded DeepSeek-R1-Distill-Qwen-1.5B on Metal
Prefill: 0.23s (26.1 tok/s)
[generated text]
Generated 128 tokens in 1.40s (91.43 tok/s)

The MLX baseline used for comparison:

pip install mlx mlx-lm
python -m mlx_lm.generate \
    --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
    --prompt "The capital of France is" \
    --max-tokens 128

Both runs use the same model weights and the same prompt; only the kernel implementation differs.

10.8 Why a Suite, Not a Single Kernel

Single-kernel benchmarks (softmax, GEMM, RMSNorm in isolation) are useful for diagnosing a specific bottleneck, but they systematically over-report the value of optimisations that don’t compose:

Caching activations is a clear win on a standalone matvec benchmark and a clear loss inside a transformer layer where the cache is already warm from the previous matvec.
Fusing RMSNorm into the next matvec wins on a fused-kernel microbenchmark and loses inside a real layer where the same norm output is consumed by three matvecs (Q, K, V).
A “fast attention” kernel that ignores the KV cache is irrelevant; in decode, the KV cache is the attention input.

A 13-kernel suite tied to a real model is the smallest benchmark that catches these mistakes. It also lets vendors compare backends honestly: every one of the ten backends sees the same Rust source, the same shapes, and the same memory-traffic budget.

10.9 Key Takeaways

The Rust-to-Metal codegen path matches or beats hand-tuned MLX on decode. 91.7 tok/s on M2 Max (vs ≈ 88 for MLX) and 33–35 tok/s on M4 (vs 32 for MLX) demonstrate that a memory-safe kernel toolchain does not give up performance on the path that matters most for interactive inference.
Decode is bandwidth-bound; the suite hits 60% of peak. The remaining 40% is split between dispatch overhead (≈ 229 launches per token) and matmul kernels that are not yet using Apple’s simdgroup_matrix_multiply primitive. Both have known fixes.
Microbenchmarks lie about full-pipeline performance. Two optimisations measured in isolation as wins (caching, fusion) regressed the full decode path by 5–13 tok/s. Suite-level measurement is the only way to catch this.
One Rust source, ten backends. The same tile_kernels.rs compiles through mlir_to_<target> for Metal, CUDA, MUSA, AscendC, PTO, NKI, AIE, BANG, Gaudi, and SPIR-V. Apple is the first backend to be measured end-to-end at production fidelity; the rest have the codegen surface ready and are blocked only on hardware time.

Keyboard shortcuts

ascend-rs: Memory-Safe NPU Kernel Programming in Rust