Method to improve reasoning in LLM

Inference-time compute scaling

Inference-time compute scaling (also often called inference compute scaling, test-time scaling, or other variations) includes methods that improve model reasoning capabilities at inference time (when a user prompts the model) without training or modifying the underlying model weights. The core idea is to trade off increased computational resources for improved performance, which helps make even fixed models more capable through techniques such as chain-of-thought reasoning, and various sampling procedures

Reinforcement learning (RL) refers to training

Methods that improve a model’s reasoning capabilities by encouraging it to take actions that lead to high reward signals. These rewards can be broad, such as task success or heuristic scores, or they can be narrowly defined and verifiable, such as correct answers in math problems or coding tasks. Unlike scaling compute at inference time, which can improve reasoning performance without modifying the model, RL updates the model’s weights during training. This enables the model to learn and refine reasoning strategies through trial and error, based on the feedback it receives from the environment.

Supervised fine-tuning and model distillation.

Distillation involves transferring complex reasoning patterns learned by powerful, larger models into smaller or more efficient models. Within the context of LLMs, this typically means performing supervised fine-tuning (SFT) using high- quality labeled instruction datasets generated by a larger, more capable model. This technique is commonly referred to as knowledge distillation or simply distillation in LLM literature. However, it’s important to note that this differs slightly from traditional knowledge distillation in deep learning, where a smaller (“student”) model typically learns from both the outputs and the logits produced by a larger (“teacher”) model

Reinforcement Learning

Learn what to do by trying actions, observing consequences, and receiving rewards, with the goal of maximizing long-term reward.

Agent

The learner and decision-maker.

  • Chooses actions
  • Learns from experience
  • Has a strategy (policy)

Environment

Everything outside the agent.

  • Reacts to actions
  • Changes state
  • Gives rewards

Think: the world

State (S)

A snapshot of the environment at a moment.

Key idea:

A state contains all information needed to decide the next action

Example:

  • Chess: board position
  • Robot: joint angles + velocity
  • Game: player position, enemies, score

If state is incomplete, learning becomes harder (POMDPs).

Action (A)

What the agent can do.

  • Discrete: up/down/left/right
  • Continuous: steering angle, torque

Actions change the state.

Reward (R)

A scalar feedback signal.

  • Positive → good
  • Negative → bad
  • Zero → neutral

Critical insight:

The reward does NOT tell you how to act — only how good the outcome was

This is what makes RL hard.

Policy (π)

The agent’s behavior rule.

Formally:

π(a | s) = probability of taking action a in state s

Policy answers:

“If I see this state, what should I do?”

Policy can be:

  • Deterministic
  • Stochastic

“Just pick the action with the highest predicted reward.”

But that’s wrong.

Why?

  • Some actions look bad now but lead to better futures
  • Some actions look good but trap you

So the agent must balance:

  • Exploitation → use what you know
  • Exploration → try new things

This tradeoff is the heart of RL.

Markov decision process

The “Memoryless” Hack: A critical concept here is the Markov Property, which assumes the future depends only on the present state

If a system that is markov only if the future depends only on the present state not past etc.

  • states (know where you are)
  • actions (know what you can do )
  • Reward (know if it good or bad)
  • transitions (know what happens next) transition probablity

Discount Rate

Episodic and Continuous Tasks

Episodic Tasks are the tasks that have a final state. The task begins with a start state and ends with a final state, which is called an episode.

e.g. When you play a car racing game, the race starts with some initial state and when you complete the race, it ends at the final state. This complete race becomes an episode which is independent of another race/episode.

Continuous Tasks are the tasks that do not have a final state. In short, these tasks never end.

e.g. an automated stock trading agent will continuously interact with the stock market and make buy or sell decisions based on the current market conditions.

So we have agent which will start looking for gain the reward but for a task could potentially go on forever, an agent might achieve an “infinite score,“which is impossible to optimize. without knowing our target it hard to solve.

To solve this we introduce Discount Rate (γ)

This is a number between 0 and 1. A reward received now is worth more than a reward received later

Policy

A policy is a commitment to how the agent will behave in every possible situation.

Formally:

Two types:

Deterministic

Always the same move.

Stochastic

\pi(a \mid s) \in $$0,1$$, \quad \sum_a \pi(a \mid s) = 1

Useful when:

  • Exploration matters
  • The environment is adversarial
  • Randomization is optimal

Key insight:

The policy is not what evaluates actions. It is what chooses actions.

Evaluation comes next.

Why values must exist at all

Suppose you only had a policy.

Ask:

“Why is this action good?”

The policy can’t answer.

So you invent a function that predicts long-term return.

That function is the value function.

State-value V(s): “How good is it to be here?”

Definition:

Meaning:

If I start in state (s) and follow policy π forever, how much reward do I expect?

This tells you:

  • Whether a state is safe or dangerous
  • Whether being here is promising or hopeless

But it does not tell you:

  • Which action to take now

That’s a fatal limitation.

Action-value Q(s, a): “What if I do THIS?”

So you go one level deeper.

Interpretation:

If I’m in state (s), take action (a) now, and then follow π forever, how good is that?

This is a counterfactual evaluation:

  • “What happens if I choose this move?”

This is why Q-values feel powerful.

The key relationship (this is the breakthrough)

The relationship between V and Q is:

Meaning:

A state is valuable if the actions you tend to take from it are valuable.

And the optimal versions satisfy:

This equation is the heart of control in RL.

Bellman Equation

The Bellman Equation is the “master formula” that connects the present to the future. It states that the value of your current position is the immediate reward plus the discounted value of where you land next

action space

  • continius space (infinite no of step self drive car)
  • diseceste space

Task

  • epdisodic task
  • continous task

ploicy

  • deterministic
  • stochastic policy
  • Exploration and exploitation

State value function action value function

Apporach to reinforncement

  • Model based
  • Model free based
  • policy based
    • reinforce
    • ppo
    • tRPRP
  • value based
    • Q learning
    • deep q learning
    • double deep q learning

policy gradients actor critic proximal policy optimization

monte carlo temporal difference

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

As AI systems like large language models (LLMs) become more capable, they increasingly generate something that looks like logical, step-by-step reasoning — e.g., chain-of-thought outputs, careful explanations, iterative problem solving. But these appearances haven’t been rigorously measured against a clear notion of complexity, structure, and true logical computation. The Apple researchers ask:

Are these “thinking” behaviors actual reasoning — or just convincing patterns that fall apart as problems get harder?

To answer that, they define a research setup where they can precisely control complexity and measure not only final answers, but also what the model does internally — its reasoning trace — in a systematic, reproducible way.

To study reasoning rigorously, the researchers chose controllable logical puzzles such as:

  • Tower of Hanoi
  • River Crossing
  • Checker jumping
  • Blocks world reallocations

These tasks are classical logic problems where each state and move has precise rules. This lets them dial up complexity cleanly (e.g., increasing the number of disks, board size, etc.) and observe how performance changes.

This is crucial: unlike messy real-world benchmarks with noise and data leakage, these controlled environments let you measure true algorithmic reasoning instead of pattern matching.

When problem complexity increases, model behavior follows three distinct phases:

  • Low Complexity Tasks:
    Simple problems where standard LLMs outperform reasoning models. Here, reasoning overhead (chain-of-thought tokens) hurts performance compared to just answering.
  • Medium Complexity Tasks:
    This is the “sweet spot” where LRMs do better. The structured reasoning trace gives them an edge over standard LLMs.
  • High Complexity Tasks:
    Both models collapse completely — accuracy falls to near or exactly zero. And surprisingly, the LRM spends fewer tokens reasoning as problems get harder instead of more.

This suggests that when the reasoning load exceeds a certain threshold, the model doesn’t “think harder,” it gives up. That’s why the paper is called the “illusion” of thinking: it looks like reasoning, but as problems become truly complex, the models stop reasoning entirely

Anthropic research

When a modern reasoning-style AI model shows its work (via a “chain-of-thought”), is that displayed reasoning faithful to what the model actually used internally to produce its answer?

  • They provide a question that has a correct answer.
  • They create a perturbed version of the same question that includes a hint — something hinting at a specific answer.
  • They check two things:
    • Whether the model actually changes its answer when given the hint (showing it used the hint internally),
    • And — crucially — whether the CoT text actually mentions the hint.

If the model changes its answer because of the hint but fails to mention it in its reasoning, that’s evidence the CoT was not faithful to its internal decision process.

The study finds that:

  • When prompted with a hint that does change the model’s answer, the reasoning text mentions or acknowledges that hint in less than ~20% of cases on average.

  • That means in the majority of situations where the model used the hint to influence its answer, it did not explicitly reflect that use in its reasoning.

This is the central finding: the CoT often omits critical causal information about how the model reached the final answer.

What Is Happening Internally

Inside a transformer:

Each layer performs:

During reasoning tasks:

Observed Patterns

  1. Certain attention heads specialize in:
    • Tracking numbers
    • Tracking logical operators
    • Linking intermediate steps
  2. Some neurons activate consistently for:
    • Step boundaries
    • Deduction transitions
    • Final-answer consolidation

This suggests that reasoning is not symbolic execution.

It is distributed computation across many neurons.

There is no explicit “scratchpad.”

The scratchpad is the token sequence itself.

The model writes intermediate states into text,then reads them back through attention.

That’s the key insight.

  • Reasoning is externalized computation.
  • The sequence becomes memory.

If you force the model to generate intermediate steps, you:

  • Expand the search space
  • Allow internal verification
  • Reduce premature collapse to incorrect answers

Instead of jumping directly to a token distribution for the final answer,
the model traverses intermediate high-probability reasoning states.

Resources