Inside the reason phase. How LLMs turn a goal into the next action, why chain-of-thought is just more tokens, and why your prompt is the agent's architecture.
I once asked a model “is 1219 prime?” and got a confident, wrong “yes.” I asked the same model the same question with one extra line in the system prompt, think step by step before you answer, and watched it walk through trial division up to $\sqrt{1219}$ and arrive at the right answer. Same weights, same question, different output, because the second prompt made room for the model to do the work in tokens before committing to an answer.
That gap is what this post is about.
TL;DR
- LLMs reason by generating tokens. There is no separate thinking module.
- Chain-of-thought, ReAct, and “reasoning models” are all variations on the same idea: give the model space to write before it commits.
- The system prompt is not styling. It is architecture. It decides what the agent perceives, how it deliberates, and when it stops.
- You can build a working ReAct agent in about 90 lines of plain Python with no SDK. We will.
It is tempting to imagine a hidden module inside an LLM that “thinks” before it speaks. There isn’t one. Every output token is produced the same way: by sampling from a distribution conditioned on every token that came before it, including the model’s own recent output.
That is the entire trick behind chain-of-thought. When you let a model write “first I’ll check if 1219 is divisible by 2, then 3, then 5…” before it has to produce its answer, those tokens become part of the context the next token is conditioned on. The model uses its own scratch work as a runway. The runway is what produces the right answer.
This has two practical consequences worth tattooing somewhere:
Just answer yes or no. is asking the model to skip the reasoning step. Sometimes that’s what you want. Often it isn’t.So-called “reasoning models” (the GPT-5 reasoning family, Claude’s extended thinking, Gemini’s thinking modes) are this idea taken seriously. They generate a long internal scratchpad of tokens you usually don’t see, then produce the visible answer. Same mechanism, different bookkeeping.
Here is the difference, end to end:
flowchart LR
Q([Question]) --> Direct[Direct answer]
Direct --> A1([Often wrong on multi-step problems])
Q2([Question]) --> Think[Write reasoning tokens]
Think --> Conclude[Commit to answer]
Conclude --> A2([Conditioned on its own work])
classDef question fill:#dbeafe,stroke:#1d4ed8,color:#1e3a8a
classDef think fill:#fce7f3,stroke:#be185d,color:#831843
classDef act fill:#dcfce7,stroke:#15803d,color:#14532d
classDef terminal fill:#f3f4f6,stroke:#374151,color:#111827
class Q,Q2 question
class Direct,Think think
class Conclude act
class A1,A2 terminal
The top path is a chatbot reflex. The bottom path is what every agent does, repeatedly, inside its loop.
ReAct (Yao et al., 2022 ) is the simplest pattern that turns the reason phase into something agentic. The model alternates between three things:
Then the model writes another Thought, conditioned on what it just observed. The loop ends when the model writes a final answer instead of an action.
Two things make ReAct stick. First, it makes the reasoning visible. You can read the trace and see exactly where the agent went sideways. Second, it works with any model that can follow a format. No SDK required.
Let’s build one.
We’ll write three things: a prompt that defines the format, a parser that extracts steps from model output, and a loop that runs them until the model is done.
SYSTEM_PROMPT = """You are a helpful assistant that solves problems step by step.
You have access to these tools:
- calc(expression: str): evaluate an arithmetic expression
- lookup(key: str): look up a fact in the knowledge base
For each turn, respond in this exact format:
Thought: <your reasoning about what to do next>
Action: <tool name>
Action Input: <input to the tool>
When you have the final answer, respond with:
Thought: <final reasoning>
Final Answer: <the answer>
Use exactly one Action per turn and then stop. Do not write the Observation
or a Final Answer in the same response as an Action; wait for the system to
give you the Observation in the next turn.
"""
This prompt is doing four jobs at once:
Thought:Action:Action Input:Final Answer:Every line is a design decision, not decoration.
The model returns a blob of text. We need to know whether it’s an action or a final answer.
import re
from dataclasses import dataclass
@dataclass
class Step:
kind: str # "action" or "final"
thought: str
tool: str | None = None
tool_input: str | None = None
answer: str | None = None
def parse_step(text: str) -> Step:
# If the model hallucinated its own Observation:, slice it off.
# Real observations come from our tool runner, not the model.
cutoff = re.search(r"\n\s*Observation:", text)
if cutoff:
text = text[: cutoff.start()]
thought = re.search(r"Thought:\s*(.+?)(?=\n[A-Z]|\Z)", text, re.S)
action = re.search(r"Action:\s*(.+)", text)
inp = re.search(r"Action Input:\s*(.+)", text)
# Prefer Action over Final Answer if the model wrote both in one turn:
# it skipped past the Observation it should have waited for. Run the tool.
if action and inp:
return Step(kind="action",
thought=thought.group(1).strip() if thought else "",
tool=action.group(1).strip(),
tool_input=inp.group(1).strip())
final = re.search(r"Final Answer:\s*(.+)", text, re.S)
if final:
return Step(kind="final",
thought=thought.group(1).strip() if thought else "",
answer=final.group(1).strip())
return Step(kind="action",
thought=thought.group(1).strip() if thought else "",
tool=action.group(1).strip() if action else "",
tool_input=inp.group(1).strip() if inp else "")
A few things worth noticing here. The parser is deliberately strict: if the model deviates from the format, we’d rather fail loudly than guess. In production, you’d add a retry that pushes the malformed output back to the model with a “please follow the format” nudge. The version here doesn’t, on purpose, because hidden retries make it harder to see what the model actually did.
There are two pieces of defense in here. Some ReAct examples pass stop=["Observation:"] to the API to prevent the model from generating its own observation after writing an action. Reasoning models like gpt-5 reject the stop parameter, so we defend in the parser instead.
The first defense is the slice at the top: if the model writes its own Observation: line, we cut everything from there on. Real observations come from our tool runner, never from the model. The second defense is the ordering of the matches: we check for Action/Action Input before Final Answer. If the model wrote both in one turn (which happens; the prompt asks it to wait but does not enforce it), we treat the step as an action, run the tool, and let the real observation drive the next turn. Without that ordering, a model that races ahead to a guessed Final Answer never gets corrected by a tool call.
from openai import OpenAI
client = OpenAI()
MAX_STEPS = 6
TOOLS = {
"calc": lambda expr: str(eval_arith(expr)),
"lookup": lambda key: KB.get(key.strip().lower(), "unknown"),
}
def run(goal: str) -> str:
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": goal},
]
for step_num in range(MAX_STEPS):
resp = client.chat.completions.create(
model="gpt-5",
messages=messages,
)
text = resp.choices[0].message.content
step = parse_step(text)
if step.kind == "final":
return step.answer
observation = TOOLS[step.tool](step.tool_input)
messages.append({"role": "assistant", "content": text})
messages.append({"role": "user", "content": f"Observation: {observation}"})
raise RuntimeError("agent exceeded MAX_STEPS without producing a final answer")
The loop body is twenty lines and it does almost everything an agent framework does for you. Three details earn their keep:
stop parameter. Older ReAct examples pass stop=["Observation:"] to halt generation before the model invents its own observation. Reasoning models like gpt-5 reject stop outright, so we moved the defense into the parser (above) and don’t need it on the API call.MAX_STEPS = 6 is a hard ceiling. Real agents loop until they succeed or until you stop them. You always need the second condition.That’s the whole pattern. Every framework you’ll see later in this series is doing this with more ceremony.
A complete, runnable version of this agent (including the calc and lookup implementations and a small toy knowledge base) is here:
Drop both files into a folder, set OPENAI_API_KEY, and:
uv sync
uv run python agent.py "What's the boiling point of water in Fahrenheit?"
A trimmed trace:
--- step 1 ---
Thought: I should look up the factual temperature value requested.
Action: lookup
Action Input: boiling point of water in Fahrenheit
Thought: The standard boiling point of water at 1 atmosphere is 212 degrees Fahrenheit.
Final Answer: 212 °F
parsed: Step(kind='action', thought='I should look up the factual temperature value requested.', tool='lookup', tool_input='boiling point of water in Fahrenheit', answer=None)
Observation: unknown
--- step 2 ---
Thought: The lookup failed; I can compute by converting 100°C (boiling point at 1 atm) to °F.
Action: calc
Action Input: 100 * 9/5 + 32
parsed: Step(kind='action', thought='The lookup failed; I can compute by converting 100°C (boiling point at 1 atm) to °F.', tool='calc', tool_input='100 * 9/5 + 32', answer=None)
Observation: 212
--- step 3 ---
Thought: The calculation confirms the standard boiling point of water at 1 atmosphere.
Final Answer: 212 °F
parsed: Step(kind='final', thought='The calculation confirms the standard boiling point of water at 1 atmosphere.', tool=None, tool_input=None, answer='212 °F')
=== final answer ===
212 °F
The loop you’ll see inside every agent SDK is doing exactly this, three model turns picking between two different tools to reach one answer, wrapped in a class hierarchy and a retry layer.
The reasoning mechanism is the same across major providers. The surface for getting it out of them is not.
OpenAI exposes reasoning through plain text generation plus a structured tool-calling JSON schema. ReAct-style “Thought / Action / Observation” prompts work because the model is happy to follow text formats. For GPT-5’s reasoning family, there are also dedicated “reasoning_effort” knobs that trade latency for additional internal scratchpad tokens you typically don’t see.
Anthropic Claude exposes thinking explicitly through extended thinking blocks. The thinking content is separate from the visible response. Tool use is structured (tool_use content blocks) and the model returns explicit stop_reason values (tool_use, end_turn, etc.) that make loop control cleaner. If you want the model’s reasoning as a first-class artifact rather than something you teach it to emit, Claude makes that easy.
Gemini uses a parts-based content model where text, tool calls, and (with thinking enabled) thought summaries can all be parts of a single response. Tool calling uses function declarations similar in spirit to OpenAI’s, and thinking budgets behave like Anthropic’s effort knob.
Don’t read this as a comparison shopping guide. The mental model is the same in every direction: you generate tokens, parse them, and feed the next turn back in. The wrappers change; the loop doesn’t.
If you’ve worked with traditional software, you’ve felt the difference between code and configuration. Code defines behavior; configuration adjusts it. Prompts blur that line in a way that takes some getting used to.
Three prompt decisions that are architectural decisions:
Role framing decides the agent’s capability ceiling. “You are an assistant that helps users plan trips” and “You are a senior travel agent who has booked five thousand corporate trips and double-checks every detail” produce qualitatively different agents from the same model. Treat the role line the way you’d treat a class definition.
Output format decides what is downstream-parseable. Free-form answers are a parser’s nightmare. Structured formats (JSON schemas, ReAct grammars, XML tags) are the contract between the model and the rest of your system. Specify the contract.
Stopping conditions decide what “done” means. A real agent needs a defined point at which it gives up, a defined point at which it asks for help, and a defined path to commit to a final answer. If your prompt doesn’t spell those out, the model will pick. You won’t like its picks under load.
Prompts age the way config files age. The first version is fine. The fourth version is load-bearing for behavior nobody remembers writing down. Treat them with that respect.
The next post goes outside the reason phase, into the act phase. We’ll look at how tools turn an agent from “model that generates JSON” into something that can read an API, query a database, and send a message. Each major SDK exposes the same idea with different syntax, but the loop underneath is the same one we just built.
If you have a reasoning failure mode you’ve hit in your own agents, write to me at sumit at allthingsagentic dot org . The next post’s examples lean toward what people are actually struggling with in production.