Čipy AI ⭐ Notable

Performance per dollar is getting faster and cheaper: GLM-5.2 on AMD MI355X

Sobota 4. júla 2026 • Source: Wafer AI

What happened

TensorWave engineers published a benchmark of GLM-5.2 inference on AMD MI355X GPUs using MXFP4 quantization via AMD Quark. Results are surprisingly competitive against NVIDIA flagship hardware.

Context and impact

NVIDIA has long benefited from the CUDA ecosystem moat discouraging hardware migration. This benchmark suggests the barrier is cracking: AMD now achieves 80% of NVIDIA B200 inference performance at a fraction of the cost, without requiring custom kernel development — historically the key deterrent.

Details

Aggregate throughput: 2,626 tok/s/node at 2.4 req/s
Single-stream: 213 tok/s (10k input / 1.5k output)
Comparison: 80% of NVIDIA B200 performance, at over 2× lower cost
Quantization: MXFP4 via AMD Quark
Framework: sglang (fixed speculative decoding support)
Takeaway: CUDA moat is eroding — no custom kernels needed

Open original source Wafer AI