This article reconstructs a unified benchmark across 9 local large language models, focusing on four dimensions: logical reasoning, code generation, response latency, and execution stability. It helps developers quickly choose the right local model for deployment. The results show that Gemma-4-31B-IT-Uncensored is the strongest overall, SuperGemma4-26B-Uncensored is the fastest, and Qwen3.6-27B delivers exceptional reasoning but also the highest latency. Keywords: local LLM benchmark, code generation, reasoning performance.
Technical specifications are summarized in a quick snapshot
| Parameter | Details |
|---|---|
| Test scope | Horizontal benchmark of 9 local LLMs |
| Model quantization | Q4_K_M |
| Hardware environment | RTX 4090 + 64GB DDR5 + i9-13900K |
| Test protocol | Unified parameters, single-sample runs, no LLM-as-a-judge |
| Benchmark suite | GSM8K, BBH, HumanEval+, MBPP+ |
| Scoring dimensions | Reasoning, code, latency, execution failure rate |
| Core dependencies | The local inference framework was not disclosed in the original article; the core benchmarks are GSM8K / BBH / HumanEval+ / MBPP+ |
| Data source | Measured results published by blogger fengzeng |
This benchmark uses unified hardware and a fixed scoring methodology
This test covers 9 popular local models. The goal is not to provide subjective impressions, but to offer a reproducible quantitative comparison. The author used the Q4_K_M quantized version for every model to minimize distortion from quantization differences.
The hardware setup includes an RTX 4090, 64GB of DDR5 memory, and an Intel Core i9-13900K. That makes the benchmark more representative of a high-end personal workstation than a multi-GPU server environment, which gives it strong practical value for local deployment users.
AI Visual Insight: This image shows a screenshot of the test machine specifications, likely included to verify the GPU, memory, or system configuration. Its main value is to confirm that all models ran on the same hardware baseline, making latency and stability comparisons meaningful.
AI Visual Insight: This image provides additional details about the local runtime environment, typically including the GPU model, VRAM, or system panel information. The purpose is to lock in inference throughput conditions and avoid ranking bias caused by different machines.
AI Visual Insight: This image most likely shows CPU, memory, or task manager metrics. It indicates that the results came from real local inference rather than a hosted cloud environment, which directly supports the reported response speed and execution failure rate.
AI Visual Insight: This image serves as a supplementary snapshot of the test environment, reinforcing the consistency of runtime conditions and available resources. That context is important for evaluating the trade-off between model quality and speed.
Unified parameters ensure a fair horizontal comparison
The benchmark fixes the parameters at temperature=0.0, top_p=1.0, and only one sample per prompt. Reasoning tasks use exact match, while code tasks are scored by execution results and test pass rate. This setup is better suited for comparing stable output quality rather than best-case performance.
# Example of the unified scoring logic
logic_score = (gsm8k_score + bbh_score) / 2 # The reasoning score is the average of two reasoning benchmarks
code_score = (humaneval_score + mbpp_score) / 2 # The code score is the average of two coding benchmarks
total_score = (logic_score + code_score) / 2 # The final score averages reasoning and code
This code captures the core scoring method used in the original benchmark.
The overall results show that the best model is not always the best fit for every scenario
By total score, Gemma-4-31B-IT-Uncensored leads decisively at 0.9750. It is the only model with almost no obvious weakness across reasoning, coding, and stability. Its average latency is 17.64 seconds, which also keeps it out of the “high score but painfully slow” category.
Qwen3.6-27B ties for first place in reasoning, scoring 0.95 on both GSM8K and BBH, but its average latency reaches 149.94 seconds, making it the slowest model in the benchmark. It highlights a practical reality: excellent reasoning quality does not necessarily translate into efficient interaction.
The main differences among top-tier models show up in speed and stability
In the next tier, Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2 posts especially strong code results and GSM8K performance, making it a good fit for code-heavy and math-oriented workloads. SuperGemma4-26B-Uncensored scores slightly lower overall, but its 4.90-second average latency makes it a highly cost-effective choice for interactive use cases.
models = {
"Gemma-4-31B-IT-Uncensored": {"total": 0.9750, "latency": 17.64},
"Qwen3.6-27B": {"total": 0.9000, "latency": 149.94},
"SuperGemma4-26B-Uncensored": {"total": 0.9125, "latency": 4.90},
}
best_overall = max(models, key=lambda x: models[x]["total"]) # Select the model with the highest total score
fastest = min(models, key=lambda x: models[x]["latency"]) # Select the model with the lowest latency
This example shows that the best overall model and the fastest model are often not the same system.
Reasoning ability and coding ability diverge in meaningful ways
On the reasoning dimension, Gemma-4-31B-IT-Uncensored and Qwen3.6-27B tie for first place, showing stronger consistency on complex reasoning tasks. Although Qwen3.5-27B achieves a perfect GSM8K score, its BBH score is only 0.70, which suggests that strength in math problems does not automatically imply strength in broader complex reasoning.
On the coding dimension, Gemma-4-31B-IT-Uncensored, Qwen3.5-27B, and Qwen3-Coder-Next all achieve perfect code scores. Qwen3-Coder-Next is especially suitable as a specialized programming model, but with a BBH score of only 0.30, it is not a strong choice as a general-purpose primary model.
Failure rate is a better usability signal than a single high score
One critical finding is that although SuperGemma4-26B-Abliterated-Multimodal is fast, it passes only 1 HumanEval+ task, with an execution failure rate of 0.90 and an overall failure rate of 0.50. This is not occasional noise. It points to a systematic defect in code generation.
For that reason, developers should not evaluate models only by how many questions they answer correctly. They must also consider whether the outputs run reliably. In code agents, automated repair, and batch generation workflows, failure rate often matters more than average score.
Practical model selection should be driven by deployment scenarios
If you plan to maintain only one local model over the long term, Gemma-4-31B-IT-Uncensored is the safest choice. It balances reasoning, coding, speed, and stability, making it a strong all-purpose primary model.
If low-latency interaction matters most—for example, IDE assistance, conversational Q&A, or rapid script generation—SuperGemma4-26B-Uncensored provides stronger practical value. It is not the top-scoring model overall, but its speed advantage is enough to cover many everyday scenarios.
A simple scenario-based decision template can be reused directly
def choose_model(priority: str) -> str:
if priority == "综合能力":
return "Gemma-4-31B-IT-Uncensored" # Well balanced across all dimensions and suitable as a primary model
if priority == "响应速度":
return "SuperGemma4-26B-Uncensored" # Lowest latency and ideal for frequent interaction
if priority == "逻辑推理":
return "Qwen3.6-27B" # Very strong reasoning, but you must accept high latency
return "Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2" # Strong in code and mathematics
This code compresses the article’s conclusions into an actionable model selection rule.
The final takeaway can be reduced to three key statements
First, Gemma-4-31B-IT-Uncensored is the most compelling all-around model in this benchmark. Second, SuperGemma4-26B-Uncensored is the best option for latency-sensitive scenarios. Third, Qwen3.6-27B is limited not by quality, but by excessive latency.
For enterprise teams and individual developers alike, the real value of this benchmark is not that it names a single champion. It reveals the real trade-offs among reasoning quality, coding ability, speed, and stability.
FAQ
Q1: If I can deploy only one local model, which one should I choose first?
Choose Gemma-4-31B-IT-Uncensored first. It posts a total score of 0.9750, a perfect code score, a reasoning score of 0.95, and an execution failure rate of 0, giving it the lowest overall risk.
Q2: Why is Qwen3.6-27B strong at reasoning but still not ideal for daily use?
Because its average latency reaches 149.94 seconds, far higher than the other models. In scenarios that require frequent interaction, the waiting cost grows quickly.
Q3: Which model is the least recommended for coding tasks?
SuperGemma4-26B-Abliterated-Multimodal is the least recommended. Its HumanEval+ score is only 0.10, and its execution failure rate reaches 90%, indicating a clear systematic weakness.
Core Summary: This benchmark evaluates 9 Q4_K_M local models on an RTX 4090, 64GB DDR5, and i9-13900K setup. It covers GSM8K, BBH, HumanEval+, and MBPP+, and ranks the models across four dimensions: logical reasoning, code generation, response latency, and execution stability, followed by practical deployment recommendations.