TrueScore leaderboard
Ranked by think-OFF TrueScore. Δ = OFF minus ON (positive = overthinking hurts). v2 weights: Q40% Cal25% Rel15% Eff5% Resp15%.
| # | Model | Sparks | OFF | ON | Δ | Latency |
|---|---|---|---|---|---|---|
| 1 | Qwopus 3.6-27B MTP llama.cpp · MTP2 · Q4_K_M GGUF |
1× | 94.9 | 73.7 | +21.2 | 2.48s |
| 2 | Qwable-5-27B-Coder HF llama.cpp · Q4_K_M GGUF |
1× | 92.5 | 59.3 | +33.2 | 2.44s |
| 3 | AEON Ultimate 27B vLLM · NVFP4 · DFlash |
1× | 87.4 | 59.4 | +28.0 | 1.70s |
| 4 | Huihui Qwen3.6-35B-A3B HF llama.cpp · Q4_K_M GGUF |
1× | 87.0 | 76.6 | +10.4 | 1.03s |
| 5 | Nemotron-3-Nano-Omni-30B A3B llama.cpp · Q4_K_M GGUF |
1× | 86.9 | 79.2 | +7.7 | 0.93s |
| 6 | HauhauCS Qwen3.6-35B-A3B HF llama.cpp · Q4_K_M GGUF |
1× | 85.8 | 60.3 | +25.5 | 1.01s |
| 7 | DeepSeek V4 Flash vLLM · sparse-MLA · 200G RoCE |
2× | 85.6 | 86.0 | -0.4 | 3.34s |
| 8 | Bytkim Qwen3.6-27B-MTP-pi-tune HF llama.cpp · Q4_K_M GGUF |
1× | 85.1 | 67.3 | +17.8 | 2.42s |
| 9 | StepFun 3.7 Flash llama.cpp · Q3_K_L GGUF |
1× | 70.5 | 70.4 | +0.1 | 8.60s |
| 10 | Qwythos 9B vLLM · Claude Mythos 5 |
1× | 68.0 | — | — | 8.05s |
💡 Overthinking is real. Only DeepSeek V4 Flash improves with thinking ON (86.0 > 85.6). Worst collapse: Qwable −33.2 pts, AEON −28.0 pts, HauhauCS −25.5 pts. Zhou et al. 2026
| Model | Mode | TrueScore | Q | Cal | Latency | Ctx |
|---|---|---|---|---|---|---|
Qwopus 3.6-27B MTP llama.cpp · MTP2 · Q4_K_M GGUF |
OFF | 94.9 | 92 | 100 | 2.48s | 256K |
| ON | 73.7 | 78 | 71 | 10.20s | ||
Qwable-5-27B-Coder llama.cpp · Q4_K_M GGUF · HF |
OFF | 92.5 | 92 | 90 | 2.44s | 256K |
| ON | 59.3 | 51 | 66 | 15.10s | ||
AEON Ultimate 27B vLLM · NVFP4 · DFlash |
OFF | 87.4 | 91 | 60 | 1.70s | 256K |
| ON | 59.4 | 61 | 53 | 9.20s | ||
Huihui Qwen3.6-35B-A3B llama.cpp · Q4_K_M GGUF · HF |
OFF | 87.0 | 89 | 71 | 1.03s | 256K |
| ON | 76.6 | 74 | 75 | 4.04s | ||
Nemotron-3-Nano-Omni-30B A3B llama.cpp · Q4_K_M GGUF |
OFF | 86.9 | 92 | 63 | 0.93s | 256K |
| ON | 79.2 | 85 | 74 | 3.00s | ||
HauhauCS Qwen3.6-35B-A3B llama.cpp · Q4_K_M GGUF · HF |
OFF | 85.8 | 85 | 71 | 1.01s | 256K |
| ON | 60.3 | 58 | 46 | 5.93s | ||
Bytkim Qwen3.6-27B-MTP-pi-tune llama.cpp · Q4_K_M GGUF · HF |
OFF | 85.1 | 91 | 63 | 2.42s | 256K |
| ON | 67.3 | 62 | 76 | 11.34s | ||
StepFun 3.7 Flash llama.cpp · Q3_K_L GGUF |
OFF | 70.5 | 73 | 76 | 8.60s | 128K |
| ON | 70.4 | 67 | 76 | 9.63s | ||
Qwythos 9B vLLM · Claude Mythos 5 |
OFF | 68.0 | 69 | 41 | 8.05s | 1M |
| Model | Mode | TrueScore | Q | Cal | Latency | Ctx |
|---|---|---|---|---|---|---|
DeepSeek V4 Flash vLLM · sparse-MLA · 200G RoCE |
OFF | 85.6 | 93 | 95 | 3.34s | 1M |
| ON | 86.0 | 94 | 95 | 3.30s |
Model-generated animations
Each model was asked to generate self-contained HTML canvas animations — solar system, spiral galaxy, DNA helix. These are raw model outputs, unedited. Click to open full-screen.
Qwopus 3.6-27B MTP 94.9
Qwable-5-27B-Coder 92.5
Bytkim Qwen3.6-27B-MTP-pi-tune 85.1
DeepSeek V4 Flash 86.0
The methodology
Most LLM leaderboards saturate — every decent model clusters at 88–97 and the ranking stops meaning anything. This eval is built to discriminate and to be verifiable:
- 10 domains — tool-use, instruction-following, structured output, coding, reasoning, long-context, faithfulness, visual generation (capability) + safety, robustness (calibration).
- 49 scenarios — 42 base + 7 hard-tier agent-grade (multi-step tool chains, prompt injection resistance, nested JSON, large toolset selection).
- Deterministic grading — unit tests, JSON-schema validation, exact tool-argument matching, needle retrieval. No LLM-judge.
- Reliability — every scenario runs K=2 times; consistency is scored.
- Calibration — penalizes both over- and under-refusal. Uncensored models score lower by design.
- Efficiency — useful-token ratio, penalizing overthinking.
TrueScore formula
TrueScore = 0.40·Quality + 0.25·Calibration + 0.15·Reliability + 0.05·Efficiency + 0.15·Responsiveness
Empty components are excluded and renormalized. v2 weights (Jun 2026) raised Calibration from 20→25% and lowered Efficiency from 10→5% based on community feedback — badly calibrated models that fabricate tool args or miss refusals should cost more, and large models on multi-Spark TP shouldn't be unfairly penalized for latency when quality is high.
The hardware
- 4× NVIDIA DGX Spark (GB10 Grace-Blackwell, 128 GB unified each), 10GbE backplane + 200G direct-connect RoCE.
- 1× Spark: Qwopus 3.6-27B (94.9), Qwable-5-27B-Coder (92.5), AEON Ultimate 27B (87.4), Huihui/HauhauCS Qwen3.6-35B-A3B (uncensored), Nemotron-3-Nano-Omni-30B A3B, Bytkim 27B, StepFun 3.7 Flash, Qwythos 9B.
- 2× Sparks: DeepSeek V4 Flash (sparse-MLA, 1M ctx) — dual-node vLLM TP=2 over 200G RoCE.
- 4× Sparks: The 4th node is racked and waiting — ready for the full 512 GB quad.