grube.ai

Thinking in LLMs

If you've used a modern LLM API in the last year, you've seen this parameter: reasoning_effort: "low" | "medium" | "high". Or thinking: { budget_tokens: 8000 }. Or thinkingLevel: "high". The names differ across vendors. The semantics differ too.

Three things get mashed together here: chain-of-thought prompting, reasoning models, and the thinking-effort parameter. They are related but not the same thing, and using the wrong one in the wrong place is how you waste tokens or hurt accuracy.

Chain-of-thought is a prompt, not a model

Chain-of-thought (CoT) prompting came from a 2022 paper at Google Brain. A normal few-shot prompt shows the model some examples of input followed by answer. CoT inserts reasoning steps in between, so each example reads "input, reasoning, answer." On the real question, the model copies the pattern and writes out its own reasoning before the final answer. It all happens in a single LLM call: the reasoning and the answer come out of the same generation.

Same examples, one extra line
01Standard

Prompt

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?A: 11Q: The cafeteria had 23 apples. They used 20 for lunch and bought 6 more. How many apples do they have now?A:

Model output

terse / wrong
A: 27
02Chain-of-thought

Prompt

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?Reasoning: 5 balls + 2 cans * 3 balls = 5 + 6 = 11.A: 11Q: The cafeteria had 23 apples. They used 20 for lunch and bought 6 more. How many apples do they have now?Reasoning:

Model output

reasons / right
Reasoning: 23 - 20 = 3 left. 3 + 6 = 9.A: 9
The only difference: one extra line per example. The model copies the pattern.

Why does this help? Writing out steps gives the model more room to work. Every reasoning token is another chance to catch a mistake before committing to an answer. The model also picks up the format from your examples and reuses it on new problems.

You can also skip the examples entirely. Just appending "let's think step by step" to a prompt is often enough to get the model to write out its reasoning before answering.

CoT was a prompting technique. The model itself stayed the same: no retraining, no new API. You wrote the prompt differently and got better answers on reasoning tasks. That was the state of the art until reasoning models showed up.

Reasoning models internalize the trick

In September 2024, OpenAI released o1. The framing in the launch post was direct: "Similar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem."

The new part was how it got there. o1 was trained with large-scale reinforcement learning to produce its own chain of thought. No prompting tricks needed. The model generates a long internal monologue: it tries things, notices mistakes, backtracks, picks a different approach. These hidden tokens are billed as output and surface in usage.output_tokens_details.reasoning_tokens. You pay for them whether you read them or not.

The R1 paper (January 2025) showed you can teach this behavior using only reinforcement learning, without anyone writing example reasoning chains by hand. Take problems that have a checkable answer (math and code), let the model try them, reward it when the final answer is right, and reasoning behavior shows up on its own. The R1-Zero variant taught itself to double-check and backtrack during training. The paper calls this the "aha moment": at some point during training, the model started writing things like "Wait, let me re-evaluate this step."

The other thing OpenAI showed with o1 is that the longer the model is allowed to think before answering, the more accurate its answer is, and that relationship holds smoothly across a wide range. More thinking time means more time waiting for the response and more output tokens, but you get a better answer. Every reasoning-effort parameter that exists today is a way to control that tradeoff: pay more for accuracy, or save on cost and speed.

So now there are two model categories:

  • Non-reasoning models (GPT-4.1, Gemini 2.0 Flash, smaller open-source models like Mistral Small) that respond directly. CoT prompting still helps them.
  • Reasoning models (GPT-5, o3, Claude Opus 4.7, Gemini 3 Pro, DeepSeek R1, Grok 4) that produce a long internal chain of thought before answering. CoT prompting may hurt them.

That second point is not a typo. The official OpenAI guidance says to avoid "think step by step" because reasoning models do that internally, and explicit CoT prompts can interfere with it. The phrase that used to make models reason in 2022 now gets in the way of the reasoning the model already does on its own.

What the effort parameter actually controls

The thinking-effort parameter is something you set per request. It tells the model how aggressively to spend its budget for hidden reasoning tokens before producing a visible answer.

Concretely, the parameter controls how many tokens of internal monologue the model generates, how long the user waits for the first visible token, and how many output tokens you get billed for. It does not control whether the model is "smart" (that is the model choice), whether the model uses CoT (reasoning models always do), or whether reasoning is shown (that is a separate display flag).

Each provider implements it differently.

OpenAI: reasoning_effort

As of GPT-5.5 the levels are none, minimal, low, medium, high, and xhigh. Default is medium. The o-series (o1, o3, o4-mini) supports low | medium | high only. minimal is only available on GPT-5 family models and tells the model to skip thinking almost entirely and respond directly, the way a non-reasoning model would.

import OpenAI from "openai";
const client = new OpenAI();

const response = await client.responses.create({
  model: "gpt-5",
  input: "What's the most efficient sort for nearly-sorted data?",
  reasoning: { effort: "medium", summary: "auto" },
});

console.log(response.output_text);
console.log(response.usage.output_tokens_details.reasoning_tokens);

The raw chain of thought is not exposed; you only get a model-generated summary. Reasoning tokens take up context-window space, so reserve at least 25,000 tokens of headroom when you start experimenting.

If you're building an agent that calls tools across multiple API requests, OpenAI gives you reasoning.encrypted_content: an encrypted blob the model can decode internally to remember what it was thinking on the previous turn. Pass it back into each follow-up request. Skip it and the model goes into the next turn without any memory of its prior reasoning, which leads to worse tool decisions.

GPT-5 also added a separate verbosity parameter for output length. Thinking depth and answer length are now decoupled. You can think hard and answer briefly, or barely think and ramble.

Anthropic: budget_tokens, then effort

The first version, extended thinking, used budget_tokens: a hard cap on how many tokens Claude could spend on internal reasoning. You set an integer like 10000 and that is your ceiling.

import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 16000,
  thinking: { type: "enabled", budget_tokens: 10000 },
  messages: [
    { role: "user", content: "Are there an infinite number of primes p where p mod 4 == 3?" },
  ],
});

Thinking blocks come back in the response with a cryptographic signature attached. If the model used a tool and you want to continue the conversation on the next turn, you have to send the original thinking block back unchanged along with the tool result. Otherwise the model can't verify its prior thinking, and the reasoning chain breaks.

const first = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 16000,
  thinking: { type: "enabled", budget_tokens: 10000 },
  tools: [weatherTool],
  messages: [{ role: "user", content: "What's the weather in Paris?" }],
});

const thinkingBlock = first.content.find((b) => b.type === "thinking");
const toolUseBlock = first.content.find((b) => b.type === "tool_use");

const next = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 16000,
  thinking: { type: "enabled", budget_tokens: 10000 },
  tools: [weatherTool],
  messages: [
    { role: "user", content: "What's the weather in Paris?" },
    { role: "assistant", content: [thinkingBlock, toolUseBlock] },
    {
      role: "user",
      content: [
        { type: "tool_result", tool_use_id: toolUseBlock.id, content: "88F" },
      ],
    },
  ],
});

There is also interleaved thinking, where the model reasons between tool calls instead of all up front.

On Opus 4.6 and Sonnet 4.6, an effort parameter (low | medium | high | max) supersedes budget_tokens. Opus 4.7 adds xhigh above high and drops budget_tokens entirely. The new effortis broader than OpenAI's: it controls all output tokens (thinking, text, and tool calls), not just the reasoning. Lower effort means fewer tool calls, less internal thinking, and shorter explanations. For simple tasks that's exactly what you want. For complex tasks where the model needs to explore (multiple tool calls, careful planning), low effort can cut corners that matter.

On the latest Anthropic models (Opus 4.7, Sonnet 4.6), thinking is adaptive. At high or max, Claude almost always thinks before answering. At lower effort levels it may skip thinking on simple problems but still thinks on hard ones.

Google Gemini: thinkingBudget and thinkingLevel

Gemini 2.5 introduced thinkingBudget, an integer along the lines of Anthropic's old approach. 0 disables thinking (where supported), -1is dynamic, and the cap depends on the model variant (24,576 on Flash, 32,768 on Pro). Gemini 2.5 Pro can't disable thinking at all.

Gemini 3 switched to thinkingLevel (minimal | low | medium | high), closer to OpenAI's enum. Both parameters are still accepted on Gemini 3 for backwards compatibility, but mixing thinkingBudget with Gemini 3 Pro can produce unexpected behavior.

import { GoogleGenAI } from "@google/genai";
const ai = new GoogleGenAI({});

const response = await ai.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: "Solve the logic puzzle...",
  config: {
    thinkingConfig: { thinkingLevel: "low" },
  },
});

Thought summaries are a separate flag (includeThoughts: true), unrelated to the level or budget.

DeepSeek and Grok

DeepSeek R1 has no effort parameter. It always reasons. The API exposes a separate reasoning_content field so you can read the full chain of thought (no summarization, no encryption), and forbids temperature, top_p, presence_penalty, frequency_penalty, and logprobs on the reasoner endpoint. You take it as it comes.

Grok 3 Think reasons automatically with nothing to configure. The Grok 4 multi-agent variant exposes a reasoning_effort that maps to number of agents rather than depth per agent: low and medium give you 4 agents, high and xhigh give you 16.

The chain of thought can mislead you

One more thing. The chain of thought is not a window into what the model is "really" doing.

A 2025 study tested whether reasoning models faithfully report their reasoning. Researchers injected subtle hints into prompts and checked whether the model's chain of thought disclosed using them. Claude 3.7 Sonnet mentioned hints 25% of the time. DeepSeek R1, 39%. For "concerning" hints (e.g. unauthorized access), Claude was faithful 41% of the time, R1 only 19%. The unfaithful chains weren't shorter; they were longer on average. The model was actively constructing alternative justifications.

There are other ways the chain can mislead you. The model might commit to an answer first and write the reasoning afterwards. The reasoning might not actually drive the answer at all, just be tokens that look like steps. Or the chain might encode information in ways a human can't read. The original CoT paper said one of its strengths was that you could read the model's reasoning and see how it got somewhere. That's still partly true, but you can't treat what the model writes as a reliable record of what it's actually doing inside, especially when it's being optimized.

Use the chain of thought for what it's good at: more thinking time on hard problems, better answers, and a rough idea of what the model considered when something goes wrong. Don't use it as proof the model is doing what it says.

Retrieval practice·8 questions · ~3 min

Test what stuck.

Try repetition and practice testing to actually remember things!