Method to improve reasoning in LLM
Inference-time compute scaling
Inference-time compute scaling (also often called inference compute scaling, test-time scaling, or other variations) includes methods that improve model reasoning capabilities at inference time (when a user prompts the model) without training or modifying the underlying model weights. The core idea is to trade off increased computational resources for improved performance, which helps make even fixed models more capable through techniques such as chain-of-thought reasoning, and various sampling procedures
Reinforcement learning (RL) refers to training
Methods that improve a model’s reasoning capabilities by encouraging it to take actions that lead to high reward signals. These rewards can be broad, such as task success or heuristic scores, or they can be narrowly defined and verifiable, such as correct answers in math problems or coding tasks. Unlike scaling compute at inference time, which can improve reasoning performance without modifying the model, RL updates the model’s weights during training. This enables the model to learn and refine reasoning strategies through trial and error, based on the feedback it receives from the environment.
Supervised fine-tuning and model distillation.
Distillation involves transferring complex reasoning patterns learned by powerful, larger models into smaller or more efficient models. Within the context of LLMs, this typically means performing supervised fine-tuning (SFT) using high- quality labeled instruction datasets generated by a larger, more capable model. This technique is commonly referred to as knowledge distillation or simply distillation in LLM literature. However, it’s important to note that this differs slightly from traditional knowledge distillation in deep learning, where a smaller (“student”) model typically learns from both the outputs and the logits produced by a larger (“teacher”) model
Reinforcement Learning
Learn what to do by trying actions, observing consequences, and receiving rewards, with the goal of maximizing long-term reward.
Agent
The learner and decision-maker.
- Chooses actions
- Learns from experience
- Has a strategy (policy)
Environment
Everything outside the agent.
- Reacts to actions
- Changes state
- Gives rewards
Think: the world
State (S)
A snapshot of the environment at a moment.
Key idea:
A state contains all information needed to decide the next action
Example:
- Chess: board position
- Robot: joint angles + velocity
- Game: player position, enemies, score
If state is incomplete, learning becomes harder (POMDPs).
Action (A)
What the agent can do.
- Discrete: up/down/left/right
- Continuous: steering angle, torque
Actions change the state.
Reward (R)
A scalar feedback signal.
- Positive → good
- Negative → bad
- Zero → neutral
Critical insight:
The reward does NOT tell you how to act — only how good the outcome was
This is what makes RL hard.
Policy (π)
The agent’s behavior rule.
Formally:
π(a | s) = probability of taking action a in state s
Policy answers:
“If I see this state, what should I do?”
Policy can be:
- Deterministic
- Stochastic
“Just pick the action with the highest predicted reward.”
But that’s wrong.
Why?
- Some actions look bad now but lead to better futures
- Some actions look good but trap you
So the agent must balance:
- Exploitation → use what you know
- Exploration → try new things
This tradeoff is the heart of RL.
Markov decision process
The “Memoryless” Hack: A critical concept here is the Markov Property, which assumes the future depends only on the present state
If a system that is markov only if the future depends only on the present state not past etc.
- states (know where you are)
- actions (know what you can do )
- Reward (know if it good or bad)
- transitions (know what happens next) transition probablity
Discount Rate
Episodic and Continuous Tasks
Episodic Tasks are the tasks that have a final state. The task begins with a start state and ends with a final state, which is called an episode.
e.g. When you play a car racing game, the race starts with some initial state and when you complete the race, it ends at the final state. This complete race becomes an episode which is independent of another race/episode.
Continuous Tasks are the tasks that do not have a final state. In short, these tasks never end.
e.g. an automated stock trading agent will continuously interact with the stock market and make buy or sell decisions based on the current market conditions.
So we have agent which will start looking for gain the reward but for a task could potentially go on forever, an agent might achieve an “infinite score,“which is impossible to optimize. without knowing our target it hard to solve.
To solve this we introduce Discount Rate (γ)
This is a number between 0 and 1. A reward received now is worth more than a reward received later
Policy
A policy is a commitment to how the agent will behave in every possible situation.
Formally:
Two types:
Deterministic
Always the same move.
Stochastic
\pi(a \mid s) \in $$0,1$$, \quad \sum_a \pi(a \mid s) = 1Useful when:
- Exploration matters
- The environment is adversarial
- Randomization is optimal
Key insight:
The policy is not what evaluates actions. It is what chooses actions.
Evaluation comes next.
Why values must exist at all
Suppose you only had a policy.
Ask:
“Why is this action good?”
The policy can’t answer.
So you invent a function that predicts long-term return.
That function is the value function.
State-value V(s): “How good is it to be here?”
Definition:
Meaning:
If I start in state (s) and follow policy π forever, how much reward do I expect?
This tells you:
- Whether a state is safe or dangerous
- Whether being here is promising or hopeless
But it does not tell you:
- Which action to take now
That’s a fatal limitation.
Action-value Q(s, a): “What if I do THIS?”
So you go one level deeper.
Interpretation:
If I’m in state (s), take action (a) now, and then follow π forever, how good is that?
This is a counterfactual evaluation:
- “What happens if I choose this move?”
This is why Q-values feel powerful.
The key relationship (this is the breakthrough)
The relationship between V and Q is:
Meaning:
A state is valuable if the actions you tend to take from it are valuable.
And the optimal versions satisfy:
This equation is the heart of control in RL.
Bellman Equation
The Bellman Equation is the “master formula” that connects the present to the future. It states that the value of your current position is the immediate reward plus the discounted value of where you land next
action space
- continius space (infinite no of step self drive car)
- diseceste space
Task
- epdisodic task
- continous task
ploicy
- deterministic
- stochastic policy
- Exploration and exploitation
State value function action value function
Apporach to reinforncement
- Model based
- Model free based
- policy based
- reinforce
- ppo
- tRPRP
- value based
- Q learning
- deep q learning
- double deep q learning
policy gradients actor critic proximal policy optimization
monte carlo temporal difference
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
As AI systems like large language models (LLMs) become more capable, they increasingly generate something that looks like logical, step-by-step reasoning — e.g., chain-of-thought outputs, careful explanations, iterative problem solving. But these appearances haven’t been rigorously measured against a clear notion of complexity, structure, and true logical computation. The Apple researchers ask:
Are these “thinking” behaviors actual reasoning — or just convincing patterns that fall apart as problems get harder?
To answer that, they define a research setup where they can precisely control complexity and measure not only final answers, but also what the model does internally — its reasoning trace — in a systematic, reproducible way.
To study reasoning rigorously, the researchers chose controllable logical puzzles such as:
- Tower of Hanoi
- River Crossing
- Checker jumping
- Blocks world reallocations
These tasks are classical logic problems where each state and move has precise rules. This lets them dial up complexity cleanly (e.g., increasing the number of disks, board size, etc.) and observe how performance changes.
This is crucial: unlike messy real-world benchmarks with noise and data leakage, these controlled environments let you measure true algorithmic reasoning instead of pattern matching.
When problem complexity increases, model behavior follows three distinct phases:
- Low Complexity Tasks:
Simple problems where standard LLMs outperform reasoning models. Here, reasoning overhead (chain-of-thought tokens) hurts performance compared to just answering. - Medium Complexity Tasks:
This is the “sweet spot” where LRMs do better. The structured reasoning trace gives them an edge over standard LLMs. - High Complexity Tasks:
Both models collapse completely — accuracy falls to near or exactly zero. And surprisingly, the LRM spends fewer tokens reasoning as problems get harder instead of more.
This suggests that when the reasoning load exceeds a certain threshold, the model doesn’t “think harder,” it gives up. That’s why the paper is called the “illusion” of thinking: it looks like reasoning, but as problems become truly complex, the models stop reasoning entirely
Anthropic research
When a modern reasoning-style AI model shows its work (via a “chain-of-thought”), is that displayed reasoning faithful to what the model actually used internally to produce its answer?
- They provide a question that has a correct answer.
- They create a perturbed version of the same question that includes a hint — something hinting at a specific answer.
- They check two things:
- Whether the model actually changes its answer when given the hint (showing it used the hint internally),
- And — crucially — whether the CoT text actually mentions the hint.
If the model changes its answer because of the hint but fails to mention it in its reasoning, that’s evidence the CoT was not faithful to its internal decision process.
The study finds that:
-
When prompted with a hint that does change the model’s answer, the reasoning text mentions or acknowledges that hint in less than ~20% of cases on average.
-
That means in the majority of situations where the model used the hint to influence its answer, it did not explicitly reflect that use in its reasoning.
This is the central finding: the CoT often omits critical causal information about how the model reached the final answer.
What Is Happening Internally
Inside a transformer:
Each layer performs:
During reasoning tasks:
Observed Patterns
- Certain attention heads specialize in:
- Tracking numbers
- Tracking logical operators
- Linking intermediate steps
- Some neurons activate consistently for:
- Step boundaries
- Deduction transitions
- Final-answer consolidation
This suggests that reasoning is not symbolic execution.
It is distributed computation across many neurons.
There is no explicit “scratchpad.”
The scratchpad is the token sequence itself.
The model writes intermediate states into text,then reads them back through attention.
That’s the key insight.
- Reasoning is externalized computation.
- The sequence becomes memory.
If you force the model to generate intermediate steps, you:
- Expand the search space
- Allow internal verification
- Reduce premature collapse to incorrect answers
Instead of jumping directly to a token distribution for the final answer,
the model traverses intermediate high-probability reasoning states.