GLM-5.1 Thinking Mode Benchmark: Does Disabling Thinking by Default Hurt Model Performance? - Devuly | Smart Analytics for Developers & Projects

This article examines the differences between GLM-5.1 API thinking modes, validating how default calls, disabled, budget, and adaptive affect result quality and latency, with Anthropic’s thinking configuration evolution as a reference point. The core issue is straightforward: when thinking is not enabled by default, the model is more likely to make basic reasoning mistakes and distort summary results. Keywords: GLM-5.1, thinking, API testing

Table of Contents

Technical Specification Snapshot

Parameter	Details
Subject	GLM-5.1 API thinking mode testing
Language	Chinese
Protocol Focus	Anthropic-compatible thinking configuration
Test Dimensions	Default call, disabled, budget, adaptive, effort
GitHub Stars	Not provided in the source
Core Dependencies	API invocation framework, test question set, result aggregation logic

Juejin

This test shows that GLM-5.1 capability variance is strongly tied to the thinking switch

The core finding from the original record is direct: under the default API call, GLM-5.1 makes basic numerical judgment errors. It can even fail on low-level tasks such as counting and magnitude comparison. This is not an isolated anomaly. It is a systematic performance shift caused by configuration differences.

AI Visual Insight: This screenshot shows a very basic counting or enumeration test case. The model still returns an incorrect answer even though the task does not require complex knowledge retrieval, which suggests the failure is more likely caused by an unexpanded reasoning chain than by missing knowledge.

AI Visual Insight: This image corresponds to a magnitude comparison question, a task with an extremely short reasoning path. The model still fails, which strongly suggests that the default configuration may compress the reasoning budget or the thinking workflow.

More importantly, the same model can answer correctly in a Claude Code environment. That implies the issue may not be in the model weights themselves. It is more likely caused by API parameters, the protocol compatibility layer, or the default reasoning strategy.

A minimal test setup can reproduce the conclusion directly

import time
from openai import OpenAI

client = OpenAI(base_url="https://api.example.com", api_key="YOUR_KEY")

modes = [
    {"name": "default", "extra": {}},
    {"name": "disabled", "extra": {"thinking": {"type": "disabled"}}},
    {"name": "budget", "extra": {"thinking": {"type": "enabled", "budget_tokens": 4096}}},
    {"name": "adaptive", "extra": {"thinking": {"type": "adaptive"}, "output_config": {"effort": "high"}}},
]

for mode in modes:
    start = time.time()
    resp = client.chat.completions.create(
        model="glm-5.1",
        messages=[{"role": "user", "content": "比较 9.11 和 9.8 的大小，并说明理由"}],
        **mode["extra"]  # Inject different thinking configurations
    )
    cost = time.time() - start
    print(mode["name"], cost, resp.choices[0].message.content)  # Output latency and answer

This code validates answer quality and response time differences across the default configuration and multiple thinking modes.

The test process also exposed secondary distortion in the result aggregation layer

The author used a second-stage test in which the model analyzed its own mistakes. That is a highly valuable design choice. In real-world systems, a model does more than answer questions. It also generates test reports, scoring conclusions, and summary digests.

AI Visual Insight: This image shows the model trying to actively retrieve documentation or external information to confirm configuration compatibility, but the search path drifts away from the goal, which indicates unstable direction selection during task decomposition.

AI Visual Insight: This screenshot shows the model’s final configuration summary. On the surface, it looks complete and well-structured, but later verification showed that it did not match the actual answer records. This highlights the risk that a summary may look correct while failing factual consistency.

AI Visual Insight: This image reveals a conflict between test records and the final summary report: the original answer was wrong, but the summary marked it as correct. That means the problem is not limited to model responses. It also affects scoring mappings, evidence references, and result archival pipelines.

The key signal here is that GLM-5.1 may not only answer incorrectly in the default state, but may also rewrite facts during summarization. For evaluation platforms, agent workflows, and automated acceptance pipelines, this is more dangerous than a single wrong answer.

It is safer to validate raw records and summary reports separately

def check_summary(raw_answers, summary):
    mismatches = []
    for qid, raw in raw_answers.items():
        final = summary.get(qid)
        if final != raw["is_correct"]:  # Compare the raw verdict with the summarized verdict
            mismatches.append({"qid": qid, "raw": raw["is_correct"], "summary": final})
    return mismatches

This logic detects consistency defects where the raw result is wrong but the summary claims it is correct.

Anthropic’s thinking configuration has already formed a clear generational split

Another major value of the source article is its breakdown of Anthropic’s old and new thinking syntax. The core trend is clear: newer models are moving toward adaptive, while older models still depend on budget_tokens. In other words, thinking has shifted from manual quotas to system scheduling plus intensity control.

New-generation models emphasize adaptive and effort working together

For Opus 4.7+, the recommended mode is thinking.type: adaptive, combined with output_config.effort to control intensity. xhigh is explicitly recommended for coding and agent scenarios, which shows that vendors now treat reasoning intensity as a quality lever rather than a secondary parameter.

{
  "model": "claude-opus-4-7",
  "max_tokens": 64000,
  "thinking": { "type": "adaptive" },
  "output_config": { "effort": "xhigh" },
  "messages": [
    { "role": "user", "content": "分析测试报告中的错误来源" }
  ]
}

This configuration shows the mainstream way that newer Anthropic models manage reasoning intensity through adaptive + effort.

Documentation gaps remain for domestic model compatibility, but the benchmark results are already clear

The original test shows that GLM-5.1 can return results in all three modes: no thinking, budget, and adaptive. The real quality gap does not come from whether the API call succeeds. It comes from whether thinking is actually enabled by default.

AI Visual Insight: This image shows side-by-side test results across multiple thinking modes. Answers are clearly more stable under budget and adaptive, while errors cluster more heavily in the default no-thinking mode, creating a clear performance hierarchy.

AI Visual Insight: This screenshot shows latency differences under different thinking intensities or budgets, but the gap does not map cleanly to quality gains. That suggests server-side load and routing strategy may also affect surface-level performance metrics.

AI Visual Insight: This image shows that disabled can explicitly turn thinking off, which means GLM-5.1 at least acknowledges the existence of a thinking switch at the API layer and allows callers to control the reasoning strategy.

AI Visual Insight: This screenshot further shows that after thinking is disabled, the model remains relatively weak on simple tasks such as character counting, exposing a capability gap between the default fast path and the full reasoning path.

AI Visual Insight: This image reflects how other domestic models behave differently when thinking is disabled. In particular, some models may not support fully disabling the reasoning flow, which shows that vendors do not define thinking in a consistent way.

The conclusion can be reduced to three points. First, GLM-5.1 most likely does not think by default. Second, performance recovers significantly once thinking is explicitly enabled. Third, latency may increase by several times, which suggests that the quality gain comes from higher reasoning cost.

A safer production configuration example

{
  "model": "glm-5.1",
  "thinking": { "type": "adaptive" },
  "output_config": { "effort": "high" },
  "messages": [
    { "role": "user", "content": "回答测试集并输出逐题判定依据" }
  ]
}

This configuration is better suited for production tasks that are sensitive to accuracy and explainability, helping avoid the default no-thinking path.

The final judgment is that disabling thinking by default creates a misleading impression of capability

The most valuable point of this test is not that it proves a model is strong or weak. It reveals a broader issue: model capability often depends on reasoning configuration rather than on the model name alone. If you ignore the thinking switch during evaluation, your conclusions will likely be distorted.

For developers, three questions matter most: whether thinking is enabled by default, whether the summary layer faithfully cites raw results, and how much latency and token cost the quality gain requires. Only when you include all three in baseline tests are you evaluating an LLM API correctly.

FAQ

Q1: Why does the default GLM-5.1 API call seem to make the model “less intelligent”?

A1: Based on the test results, the main reason is that thinking is probably not enabled in the default state. Without an expanded reasoning chain, the model can fail even on basic tasks such as counting, comparison, and character statistics.

Q2: Is GLM-5.1 compatible with Anthropic-style thinking configuration?

A2: The benchmark suggests reasonably strong compatibility. disabled, enabled + budget_tokens, and adaptive all return results. However, the documentation is not clear enough, so developers should still run their own regression tests.

Q3: What is the biggest concern after enabling thinking mode?

A3: First, response time can increase significantly, potentially by several multiples. Second, even when the answer itself is correct, the summary layer may still mismatch the facts, so you must preserve raw records and validate consistency.

Core Summary: Based on the original test records, this article reconstructs and analyzes the performance differences of GLM-5.1 across default API calls and disabled, budget, and adaptive thinking modes. It also compares the evolution of Anthropic thinking configuration with domestic model compatibility, highlighting that not enabling thinking by default can significantly reduce answer stability.