As Artificial Intelligence (AI) research has matured as a field, two things have become infinitely clear:
- Enabling the large language model (LLM) to generate multiple reasoning paths and selecting the most consistent response outperforms using the initial answer generated. This is called self-consistency.
- Reasoning with LLMs involves more than just text. You can delegate work to tools (search, code, calculators) and use Chain of Thought (CoT) to sequence these actions. This underpins PAL and ReAct (more on these later).
The "reasoning models" we have today (like OpenAI's o1 series) formalise this idea clearly: spend more compute time and resources to allow the LLM to think, before answering or solving the question. These models are trained to plan, reason, and correct themselves during that thinking phase. This is called Chain of Thought prompting. In this article, we look at this prompting technique and how it is key to shaping smarter AI Agents.
What is “Chain-of-Thought”?
Chain of Thought is a prompt engineering strategy that instructs a model to show its reasoning steps before providing the final answer. This approach increases LLMs' accuracy by emulating human problem-solving. Consider an ambiguous question followed by smaller, focused questions guiding your reasoning. Since these questions are linked, they steer you toward a solution. In CoT, your prompt leads the AI to construct a response for the task methodically. The output flows logically through connected questions and intermediate steps to the answer.
One reason CoT is moving the needle on math, logic, and multi-step reasoning is that when a model uses it, it stops guessing and starts decomposing. The CoT effect was first noticed with the use of a few-shot exemplar that demonstrated step-by-step reasoning in 2022. Then the simple zero-shot trick using phrasing like "Let's think step by step" was later developed, and this enabled CoT reasoning without needing any examples.
So instead of a simple answer, using COT on a model produces intermediate reasoning steps that make the final answer more reliable.
Standard prompting vs. CoT prompting according to Yao et al. (2023)
Why agents need CoT
Agents work in loops. They observe → think → act → repeat. Without that explicit stage where they think, they hallucinate, or lock onto the first plan they deem sufficient. CoT prompting comes in very handy here. This technique is important because it improves—and to a significant level—the reasoning capabilities of the language model. An improvement that leads to more accurate and reliable outputs for complex tasks. Without CoT, LLMs tend to shortcut answers. With CoT, they mirror human-like reasoning processes by breaking down intricate problems into manageable steps.
What CoT gives you
With CoT, LLMs can:
- Improve Problem Solving: LLMs decompose complex problems into doable chunks. These intermediate steps allow for increased accuracy in reasoning, especially with handling multifarious tasks.
- Be Versatile: CoT has recorded success in its application to a wide range of tasks that require reasoning. This makes the technique a versatile one for various applications and models.
- Deliberate Exploration: The chain of thought technique encourages the consideration of multiple branches to a problem, and allows for backtracking if needed (Tree-of-Thoughts).
- Improved Transparency: Perhaps better than any other prompting technique, CoT allows users a better understanding and debugging of the reasoning paths when they go wrong because of the model's documented thought process as it solves a problem.
- Lesser Errors: Using CoT to guide models through structured thought processes helps in reducing logical/illogical errors. When these types of errors are done away with, the overall accuracy of the model's responses is improved considerably.
- Tool Discipline: CoT interleave reasoning with actions (browse, run code), instead of free-wheeling text responses.
- Cost Efficiency: This one is very important because improving model performance can be expensive, especially with fine-tuning (which is probably the standard in the industry). CoT is implemented easily and, as a result, very cost-effective.
CoT Areas of Application
I have a theory that CoT is probably one of the most important techniques for prompting AI because of the ripple effect it causes when done right. Take mathematics and arithmetic problems, for instance. These require a good level of logical thinking, and if CoT is used right, many mathematical problems can be solved. With maths now having a defined logical problem-solving technique, a lot of other technical fields are easier to debug and improve as a result.
Other everyday fields where CoT can be applied to are:
- Complex decision-making fields and scenarios.
- Fields that require multi-step reasoning, like finance, physics, and statistics.
- Explaining concepts
- Planning tasks
- Debugging code step-by-step
LLMs are already trained on large amounts of data, which gives them extensive knowledge. What CoT prompting does is leverage this knowledge to push the boundaries of AI problem-solving capabilities.
The core techniques (and when to use them)
1) Zero-shot CoT
This technique (Kojima et al. 2022) involves using the instruction "Let's think step by step" in your prompt to trigger the model's reasoning process, without providing it with any reasoning chains. By appending that instruction, the LLM is given autonomy to break down a problem and use its own step-by-step thinking process to arrive at a more accurate and robust answer.
Kojima et al. (2022)
Here's a simple problem to expand on this idea:
Prompt:
I bought 10 apples when I went to the market. I gave 2 apples to my brother and 2 to my friend. I then went out and bought 5 more apples and ate 1. How many apples are remaining with me?
Output:
11 apples
This answer is incorrect. Let's try with a special Zero-shot CoT-infused prompt.
Prompt:
I bought 10 apples when I went to the market. I gave 2 apples to my brother and 2 to my friend. I then went out and bought 5 more apples and ate 1. How many apples are remaining with me? Let's think step by step.
Output:
First, you start with 10 apples. You gave away 2 apples to the neighbor and 2 to the repairman, so you had 6 left. Then yoou bought 5 more apples, so now you have 11 apples. Finally, you ate 1 apple, so you would have 10 apples.
Use Zero-shot: When you want a fast uplift on non-trivial tasks with minimal setup.
2) Few-shot CoT
This technique provides an LLM with a small number of examples (shots) and their step-by-step reasoning process. LLMs still fall short on more complex tasks when the Zero-shot technique is applied, and by enabling in-context learning where demonstrations are provided in the prompt, the model is steered towards better performance. These demonstrations then serve as conditions for subsequent 'shots' where the model is required to generate a response. According to (Touvron et al. 2023), Few-shot properties first appeared when models were scaled to a sufficient size.
The example below, as presented in Brown et al. 2020, demonstrates few-shot prompting.
Prompt:
A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses the word whatpu is: We were traveling in Africa and we saw these very cute whatpus.
To do a "farduddle" means to jump up and down really fast.
An example of a sentence that uses the word farduddle is:
Output:
The children began to farduddle with excitement when they heard the ice cream truck coming down the street.
The task in the example is to correctly use a new word in a sentence, and we can observe that the model learned how to perform the task by being provided with just one example (i.e., 1-shot). More difficult tasks might require experimenting with an increased number of demonstrations (e.g., 2-shot, 7-shot, 10-shot, etc.).
Use Few-shot: When structure matters. You can show one or two worked examples with the format you desire.
3) Self-consistency (CoT & sampling)
Proposed by Wang et al. (2022), this technique generates multiple reasoning paths through Few-shot CoT before picking the most consistent answer. Self-consistency replaces greedy decoding used in CoT prompting, where models are quick to arrive at an answer they think is the easier solution to a question. Self-consistency helps with CoT prompting performances on tasks that involve arithmetic or commonsense reasoning.
According to the paper, this method contains three steps:
- Prompting a language model using chain-of-thought (CoT) prompting
- Replacing the “greedy decode” in CoT prompting by sampling from the model’s decoder to generate a diverse set of reasoning paths, and
- Marginalizing the reasoning paths and choosing the most consistent answer in the final answer set.
Wang et al. (2022)
Self-consistency voting (pseudo-JS)
async function solveWithSC(prompt, samples = 5) {
const runs = await Promise.all([…Array(samples)].map(() =>
callModel({ prompt, temperature: 0.7 }) // each returns {thought, answer}
));
const tally = new Map();
for (const r of runs) tally.set(r.answer, (tally.get(r.answer) || 0) + 1);
const [bestAnswer] = […tally.entries()].sort((a,b)=>b[1]-a[1])[0];
const representative = runs.find(r => r.answer === bestAnswer);
return { answer: bestAnswer, rationale: representative.thought.slice(0, 400) };
}
Why this works: You marginalize over different reasoning paths and keep the majority answer.
Use self-consistency: When correctness matters more than latency.
4) Program-Aided Language (PAL)
Instead of relying on text reasoning, LLMs can write code snippets to handle calculations. This is the premise on which Program-Aided language models are based. The method uses LLMs to read natural language problems and generate programs as the intermediate reasoning steps (Gao et al., 2022). This is different from chain-of-thought prompting because it uses a programmatic runtime like a Python interpreter to generate solutions, instead of the typical free-form text of CoT.
Gao et al., (2022)
Let's look at how this model generates Python code as its reasoning step, before executing it to solve a Math word problem.
import openai
import os
from dotenv import load_dotenv
from langchain.llms import OpenAI
# Load environment variables
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
# Setup model instance
llm = OpenAI(model_name="text-davinci-003", temperature=0)
# Define the problem
question = "If Alice has 12 apples and gives 3 to Bob, then buys 5 more, how many apples does she have?"
# Program-aided prompt with exemplars
PAL_PROMPT = """
# Q: Tom has 10 books. He buys 4 more and then gives 2 away. How many books does he have?
books = 10
books += 4
books -= 2
books
# Q: Sarah baked 20 cookies. She ate 5 and then baked 10 more. How many cookies does she have now?
cookies = 20
cookies -= 5
cookies += 10
cookies
# Q: {question}
""".strip() + "\n"
# Get the LLM output (Python code snippet)
llm_out = llm(PAL_PROMPT.format(question=question))
print("Generated code:\n", llm_out)
# Execute the generated Python code
local_vars = {}
exec(llm_out, {}, local_vars)
print("Answer:", list(local_vars.values())[-1])
What happened here:
- Natural language is converted into short Python programs.
- The model generated Python code for the new problem (Alice's apples).
- That code is executed directly in Python (exec) to get the final answer.
Output:
Generated code:
apples = 12
apples -= 3
apples += 5
apples
Answer: 14
Use PAL: When arithmetic or procedural steps are error-prone in natural language.
5) ReAct (Reason + Act)
This is a CoT framework that alternates between the reasoning ("thought") and the action-taking ("tool use") process in AI models. Introduced in Yao et al. (2022), the model generates reasoning traces that prompt, track, and update action plans. With ReAct, the LLM interacts with external tools to collect extra information that it uses as a guide towards more reliable responses.
ReAct improves the interpretation and trustworthiness of LLMs. Over time, it's been observed that the best approach uses a combination of ReAct and CoT that uses both internal knowledge and external information (e.g. Wikipedia) obtained during reasoning. The figure is from Yao et al. (2022), demonstrating ReAct and the steps involved in performing question answering.
The example above shows that the model doesn't just hallucinate. It reasons out loud, calls tools when needed, and then ties everything back into a final answer.
Use ReAct: When the agent must both think and use tools (search, DB, APIs) in the same flow.
6) Tree-of-Thoughts (ToT)
Proposed by Yao et al. (2023) and Long (2023), Tree-of-Thought is a framework that is designed to model human cognitive abilities to problem-solve, which enables LLMs to explore various solutions in a structured manner, similar to a tree's branches. The ToT approach enables an LLM to self-evaluate progress made towards solving a problem via deliberate reasoning. This ability of the model to generate and evaluate thoughts is then combined with search algorithms to enable a systematic approach to thoughts, with forward-looking and backtracking applied.
Yao et al. (2023)
The ToT framework is composed of key components like:
- Thought decomposition: Breaks the problem into smaller chunks (thoughts), pieced together to form a solution. This decomposition depends on the nature of the problem, while making sure that thoughts are significant and feasible enough for evaluation.
- State evaluation: Generated thoughts are evaluated to ensure progress towards a solution. There are two strategies for this: value and vote. Value involves assigning scalar (e.g., number ratings) and classification (likely, surely, or impossible) values to each state. Vote compares different solutions before selecting the most promising one.
- Thought generation: This step determines how thoughts are generated. Sampling and Proposing are two techniques for achieving this. The former involves generating several thoughts with the same prompt independently, like in creative writing, where multiple independent plot ideas can be generated. The latter generates thoughts sequentially. Each thought is built on the previous one, which helps to avoid duplication of thought.
Use ToT: When you need deliberate exploration and backtracking (planning, puzzles, creative search). Branch multiple partial thoughts, score them, expand on the promising ones, and stop when a good solution emerges.
Failure modes (and fixes)
- Verbose rambles: You can cap rationale length or ask for bullet-point steps.
- Tool-spam: Penalize unnecessary actions in your prompts or reward model states that reach correct answers with fewer calls.
- Shallow CoT: Switch to few-shot exemplars for structure if the chain repeats the question.
- Slow answers: Use CoT only after a difficulty check or use a smaller N for self-consistency.
What to measure in your own tests
- Monitor the accuracy on your real tasks (not just benchmarks).
- Look out for latency and token cost with and without CoT.
- Monitor tool correctness. For instance, did ReAct choose the right tool at the right time?
- Stability: Is there variance across runs (reduced by self-consistency)?
Final Thoughts
More models are being trained to think more before speaking, and not just to predict faster. This is evidenced by longer think time in newer models. OpenAI's o1 family and the latest GPT-5 are good examples. They are designed with reinforcement learning that optimizes internal reasoning, with explicit “thinking time” trade-offs. This is a clear shift from “bigger models” to “better thought.”
In newer models, features matter less than outcomes, and CoT is valuable because it's powerful enough to change outcomes. It produces better plans, sharper tool use, and more stable answers. Be intentional about how you use it, measure it, and keep it lean.
References
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, Yusuke Iwasawa
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, Graham Neubig
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao