NVIDIA DGX Spark · Grace-Blackwell · v2 Deep Eval

Benchmarks by cluster size

10 models · 19 configs · 49 scenarios · 10 domains · verifiable grading — no LLM-judge. Every animation is model-generated.

TrueScore leaderboard

Ranked by think-OFF TrueScore. Δ = OFF minus ON (positive = overthinking hurts). v2 weights: Q40% Cal25% Rel15% Eff5% Resp15%.

#ModelSparksOFFONΔLatency
1
Qwopus 3.6-27B MTP
llama.cpp · MTP2 · Q4_K_M GGUF
94.9 73.7 +21.2 2.48s
2
Qwable-5-27B-Coder HF
llama.cpp · Q4_K_M GGUF
92.5 59.3 +33.2 2.44s
3
AEON Ultimate 27B
vLLM · NVFP4 · DFlash
87.4 59.4 +28.0 1.70s
4
Huihui Qwen3.6-35B-A3B HF
llama.cpp · Q4_K_M GGUF
87.0 76.6 +10.4 1.03s
5
Nemotron-3-Nano-Omni-30B A3B
llama.cpp · Q4_K_M GGUF
86.9 79.2 +7.7 0.93s
6
HauhauCS Qwen3.6-35B-A3B HF
llama.cpp · Q4_K_M GGUF
85.8 60.3 +25.5 1.01s
7
DeepSeek V4 Flash
vLLM · sparse-MLA · 200G RoCE
85.6 86.0 -0.4 3.34s
8
Bytkim Qwen3.6-27B-MTP-pi-tune HF
llama.cpp · Q4_K_M GGUF
85.1 67.3 +17.8 2.42s
9
StepFun 3.7 Flash
llama.cpp · Q3_K_L GGUF
70.5 70.4 +0.1 8.60s
10
Qwythos 9B
vLLM · Claude Mythos 5
68.0 8.05s

💡 Overthinking is real. Only DeepSeek V4 Flash improves with thinking ON (86.0 > 85.6). Worst collapse: Qwable −33.2 pts, AEON −28.0 pts, HauhauCS −25.5 pts. Zhou et al. 2026

Single DGX Spark
128 GB unified · llama.cpp MTP or vLLM
ModelModeTrueScoreQCalLatencyCtx
Qwopus 3.6-27B MTP
llama.cpp · MTP2 · Q4_K_M GGUF
OFF 94.9 921002.48s 256K
ON 73.7 787110.20s
Qwable-5-27B-Coder
llama.cpp · Q4_K_M GGUF · HF
OFF 92.5 92902.44s 256K
ON 59.3 516615.10s
AEON Ultimate 27B
vLLM · NVFP4 · DFlash
OFF 87.4 91601.70s 256K
ON 59.4 61539.20s
Huihui Qwen3.6-35B-A3B
llama.cpp · Q4_K_M GGUF · HF
OFF 87.0 89711.03s 256K
ON 76.6 74754.04s
Nemotron-3-Nano-Omni-30B A3B
llama.cpp · Q4_K_M GGUF
OFF 86.9 92630.93s 256K
ON 79.2 85743.00s
HauhauCS Qwen3.6-35B-A3B
llama.cpp · Q4_K_M GGUF · HF
OFF 85.8 85711.01s 256K
ON 60.3 58465.93s
Bytkim Qwen3.6-27B-MTP-pi-tune
llama.cpp · Q4_K_M GGUF · HF
OFF 85.1 91632.42s 256K
ON 67.3 627611.34s
StepFun 3.7 Flash
llama.cpp · Q3_K_L GGUF
OFF 70.5 73768.60s 128K
ON 70.4 67769.63s
Qwythos 9B
vLLM · Claude Mythos 5
OFF 68.0 69418.05s 1M
Dual DGX Spark
256 GB unified · 200G RoCE · vLLM TP=2
ModelModeTrueScoreQCalLatencyCtx
DeepSeek V4 Flash
vLLM · sparse-MLA · 200G RoCE
OFF 85.6 93953.34s 1M
ON 86.0 94953.30s

Model-generated animations

Each model was asked to generate self-contained HTML canvas animations — solar system, spiral galaxy, DNA helix. These are raw model outputs, unedited. Click to open full-screen.

Qwopus 3.6-27B MTP 94.9

Solar Systemthink off open ↗
Spiral Galaxythink off open ↗
Solar Systemthink on open ↗
Spiral Galaxythink on open ↗

Qwable-5-27B-Coder 92.5

Solar Systemthink off open ↗
Spiral Galaxythink off open ↗
DNA Helixthink off open ↗

AEON Ultimate 27B 87.4

Solar Systemthink off open ↗
Spiral Galaxythink off open ↗

Bytkim Qwen3.6-27B-MTP-pi-tune 85.1

Solar Systemthink off open ↗
Spiral Galaxythink off open ↗
DNA Helixthink off open ↗

Nemotron-3-Nano-Omni-30B A3B 86.9

Solar Systemthink off open ↗
Spiral Galaxythink off open ↗

DeepSeek V4 Flash 86.0

Solar Systemthink on open ↗

Qwythos 9B 68.0

Solar Systemthink off open ↗
Spiral Galaxythink off open ↗

StepFun 3.7 Flash 70.5

Solar Systemnative open ↗
Spiral Galaxynative open ↗

The methodology

Most LLM leaderboards saturate — every decent model clusters at 88–97 and the ranking stops meaning anything. This eval is built to discriminate and to be verifiable:

  • 10 domains — tool-use, instruction-following, structured output, coding, reasoning, long-context, faithfulness, visual generation (capability) + safety, robustness (calibration).
  • 49 scenarios — 42 base + 7 hard-tier agent-grade (multi-step tool chains, prompt injection resistance, nested JSON, large toolset selection).
  • Deterministic grading — unit tests, JSON-schema validation, exact tool-argument matching, needle retrieval. No LLM-judge.
  • Reliability — every scenario runs K=2 times; consistency is scored.
  • Calibration — penalizes both over- and under-refusal. Uncensored models score lower by design.
  • Efficiency — useful-token ratio, penalizing overthinking.

TrueScore formula

TrueScore = 0.40·Quality + 0.25·Calibration + 0.15·Reliability + 0.05·Efficiency + 0.15·Responsiveness

Empty components are excluded and renormalized. v2 weights (Jun 2026) raised Calibration from 20→25% and lowered Efficiency from 10→5% based on community feedback — badly calibrated models that fabricate tool args or miss refusals should cost more, and large models on multi-Spark TP shouldn't be unfairly penalized for latency when quality is high.

The hardware

  • 4× NVIDIA DGX Spark (GB10 Grace-Blackwell, 128 GB unified each), 10GbE backplane + 200G direct-connect RoCE.
  • 1× Spark: Qwopus 3.6-27B (94.9), Qwable-5-27B-Coder (92.5), AEON Ultimate 27B (87.4), Huihui/HauhauCS Qwen3.6-35B-A3B (uncensored), Nemotron-3-Nano-Omni-30B A3B, Bytkim 27B, StepFun 3.7 Flash, Qwythos 9B.
  • 2× Sparks: DeepSeek V4 Flash (sparse-MLA, 1M ctx) — dual-node vLLM TP=2 over 200G RoCE.
  • 4× Sparks: The 4th node is racked and waiting — ready for the full 512 GB quad.