Goyalayus

Original Notion page

This is a cleaned and expanded version of my reinforcement learning outline. The original Notion page had the spine: Bellman equations, Monte Carlo, TD learning, Q-learning, DQN, policy gradients, actor-critic, PPO, RLHF, and GRPO. This page fills in the actual connective tissue.

Reinforcement Learning

Reinforcement learning is the setting where an agent learns by interacting with an environment.

At every step:

The environment is in a state $s_{t}$ .
The agent chooses an action $a_{t}$ .
The environment returns a reward $r_{t}$ and moves to a new state $s_{t + 1}$ .
The agent updates its behavior so future actions get better long-term reward.

The important phrase is long-term reward. RL is not just supervised learning with rewards. In supervised learning, the correct answer is usually attached to each example. In RL, the feedback can be delayed, sparse, noisy, and partly caused by actions taken many steps earlier.

A compact objective is:

G_{t} = r_{t} + γ r_{t + 1} + γ^{2} r_{t + 2} + \dots

where $G_{t}$ is the return from time $t$ and $γ$ is the discount factor. If $γ$ is close to 0, the agent is myopic. If $γ$ is close to 1, the agent cares about far future rewards.

The MDP View

Most basic RL assumes a Markov Decision Process.

An MDP has:

states $S$
actions $A$
transition dynamics $P (s^{'} ∣ s, a)$
reward function $R (s, a)$ or $R (s, a, s^{'})$
discount factor $γ$

The Markov assumption says the current state contains all information needed for predicting the next transition. The future depends on the present, not on the entire history directly.

In practice, this assumption is often false or approximate. LLM agents, robots, trading systems, games with hidden information, and recommendation systems often observe only part of the state. But the MDP view is still the clean base layer.

Policies, Values, and Q-Values

A policy tells the agent how to act.

A deterministic policy is:

a = π (s)

A stochastic policy is:

π (a ∣ s)

meaning the probability of taking action $a$ in state $s$ .

The value function measures how good a state is under a policy:

V^{π} (s) = E_{π} [G_{t} ∣ s_{t} = s]

The action-value function measures how good it is to take action $a$ in state $s$ , then follow policy $π$ :

Q^{π} (s, a) = E_{π} [G_{t} ∣ s_{t} = s, a_{t} = a]

These two functions are the core abstractions. A lot of RL is basically different ways of estimating $V$ , estimating $Q$ , or directly improving $π$ .

Bellman Equations

The Bellman equation is the recursion that makes RL work.

The value of a state is the immediate reward plus the discounted value of the next state:

V^{π} (s) = E_{a \sim π, s^{'} \sim P} [r (s, a) + γ V^{π} (s^{'})]

For Q-values:

Q^{π} (s, a) = E_{s^{'} \sim P} [r (s, a) + γ E_{a^{'} \sim π} Q^{π} (s^{'}, a^{'})]

For the optimal policy, the agent chooses the best next action:

Q^{*} (s, a) = E_{s^{'} \sim P} [r (s, a) + γ a^{'} max Q^{*} (s^{'}, a^{'})]

This one equation explains Q-learning. Estimate $Q^{*}$ well enough, then act greedily with respect to it.

Tic-Tac-Toe With Bellman Equations

Tic-tac-toe is a nice small example because the full state space is manageable.

A state is the board. An action is placing your mark in an empty square. Rewards can be:

+1 for win
0 for draw
-1 for loss

Suppose a position can lead to three possible next states. The value of the current state is not guessed independently. It is backed up from the values of next states.

If one move immediately wins, its Q-value should become close to +1. If another move lets the opponent force a win, its Q-value should become negative. Learning means propagating final win/loss information backward through earlier board states.

That backward propagation is the Bellman idea.

Model-Based vs Model-Free RL

In model-based RL, the agent learns or uses a model of the environment:

P (s^{'} ∣ s, a), R (s, a)

Then it can plan by simulating outcomes. Chess engines and many robotics systems lean on this idea.

In model-free RL, the agent does not explicitly learn the transition model. It learns values or policies directly from experience.

Model-based methods can be sample efficient because they reuse the model for planning. But learned models can be wrong, and planning through a wrong model compounds errors.

Model-free methods are often simpler and can scale well, but they usually need a lot of experience.

Monte Carlo Learning

Monte Carlo methods learn from complete episodes.

You run an episode until it ends, observe the final return, then update the value estimates for states/actions visited in that episode.

For a state $s$ visited at time $t$ :

V (s) \leftarrow V (s) + α (G_{t} - V (s))

The term $G_{t} - V (s)$ is the error between what actually happened and what you expected.

Monte Carlo is conceptually simple because it does not bootstrap. It waits for the real return. The downside is that it needs episodes to finish, and the return can have high variance.

Temporal Difference Learning

TD learning updates before the episode ends.

Instead of waiting for the full return, it uses a one-step target:

r_{t} + γ V (s_{t + 1})

The TD update is:

V (s_{t}) \leftarrow V (s_{t}) + α [r_{t} + γ V (s_{t + 1}) - V (s_{t})]

The bracketed part is the TD error:

δ_{t} = r_{t} + γ V (s_{t + 1}) - V (s_{t})

TD learns from incomplete experience by bootstrapping from its own estimates.

Monte Carlo has high variance and low bias. TD has lower variance but more bias because it trusts the current value function.

Q-Learning

Q-learning is off-policy TD control.

The update is:

Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [r_{t} + γ a^{'} max Q (s_{t + 1}, a^{'}) - Q (s_{t}, a_{t})]

The key part is $max_{a^{'}} Q (s_{t + 1}, a^{'})$ . Even if the agent explored and took a random action, the update assumes the future will follow the best known action.

That makes Q-learning off-policy: it can learn the greedy policy while behaving with an exploratory policy.

The usual behavior policy is epsilon-greedy:

with probability $ϵ$ , take a random action
otherwise, take the action with highest Q-value

This handles exploration while still improving toward exploitation.

Value Estimation

Value estimation is hard because values are moving targets.

If $V (s)$ depends on $V (s^{'})$ , and $V (s^{'})$ is also being learned, then the learning target changes as the model changes. This is why RL often feels less stable than supervised learning.

In supervised learning, the label is usually fixed. In TD learning, the label is partly produced by the current model itself.

That is powerful, but dangerous.

DQN

Deep Q-Networks replace the Q-table with a neural network:

Q_{θ} (s, a)

This lets Q-learning work with huge state spaces like images or long vectors.

But vanilla neural Q-learning is unstable because:

Consecutive experiences are highly correlated.
The target depends on the same network being trained.
A small Q overestimate can be reinforced repeatedly.

DQN introduced two stabilizers.

Experience Replay

Store transitions in a replay buffer:

(s_{t}, a_{t}, r_{t}, s_{t + 1}, d o n e)

Then train on random minibatches from the buffer. This breaks correlation between adjacent samples and improves data reuse.

Target Network

Use a separate network for the target:

y = r + γ a^{'} max Q_{θ^{-}} (s^{'}, a^{'})

The online network $Q_{θ}$ is updated frequently. The target network $Q_{θ^{-}}$ is updated slowly or periodically. This makes the target less chaotic.

Stable DQN

Stability improvements include:

target networks
replay buffers
reward clipping
gradient clipping
Double DQN
dueling networks
prioritized replay

Double DQN addresses overestimation bias. Vanilla DQN uses the same max operation to select and evaluate an action. Double DQN selects with the online network but evaluates with the target network:

y = r + γ Q_{θ^{-}} (s^{'}, ar g a^{'} max Q_{θ} (s^{'}, a^{'}))

This small change often matters a lot.

Policy Gradients

Value-based methods learn how good actions are. Policy-gradient methods directly learn the policy.

The objective is expected return:

J (θ) = E_{τ \sim π_{θ}} [R (τ)]

The policy-gradient theorem gives:

\nabla_{θ} J (θ) = E [\nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) G_{t}]

Intuition: increase the probability of actions that led to good returns, decrease the probability of actions that led to bad returns.

REINFORCE

REINFORCE is the simplest policy-gradient algorithm.

Sample trajectories from the current policy.
Compute returns.
Update policy using:

\nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) G_{t}

The problem is variance. A trajectory can get high reward for reasons unrelated to a specific action. So updates can be noisy.

A baseline reduces variance:

\nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) (G_{t} - b (s_{t}))

If the baseline is $V (s_{t})$ , then $G_{t} - V (s_{t})$ is called the advantage.

Advantage

The advantage tells us whether an action was better than expected:

A (s, a) = Q (s, a) - V (s)

If $A > 0$ , the action was better than the average action in that state. If $A < 0$ , it was worse.

This is better than using raw return because some states are naturally good or bad. Advantage centers the learning signal around expectation.

Actor-Critic

Actor-critic methods combine policy learning and value learning.

The actor is the policy $π_{θ} (a ∣ s)$ .
The critic estimates $V_{ϕ} (s)$ or $Q_{ϕ} (s, a)$ .

The critic provides the advantage estimate. The actor updates the policy using that advantage.

A common actor update is:

\nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) A_{t}

The critic is trained to reduce value prediction error:

(V_{ϕ} (s_{t}) - t a r g e t)^{2}

Actor-critic is more sample efficient than REINFORCE but introduces stability issues because now two learned systems depend on each other.

Stability Issues in Actor-Critic

Actor-critic can fail because:

The critic is wrong, so the actor is optimized in the wrong direction.
The actor changes the data distribution, making the critic stale.
Large policy updates destroy useful behavior.
Advantage estimates can be noisy.

This is similar in spirit to DQN instability: the learning target is moving, and the model’s own predictions shape future training.

PPO

Proximal Policy Optimization tries to prevent destructive policy updates.

It compares the new policy to the old policy using the probability ratio:

r_{t} (θ) = \frac{π _{θ} ( a _{t} ∣ s _{t} )}{π _{θ_{o l d}} ( a _{t} ∣ s _{t} )}

The clipped PPO objective is:

L^{C L I P} (θ) = E [min (r_{t} (θ) A_{t}, c l i p (r_{t} (θ), 1 - ϵ, 1 + ϵ) A_{t})]

The clipping stops the new policy from moving too far from the policy that generated the data.

PPO became popular because it is relatively simple and robust. It is not magic. It still depends heavily on reward design, advantage estimation, batch size, KL control, and data quality.

RLHF

RLHF means Reinforcement Learning from Human Feedback.

The common pipeline is:

Start with a pretrained language model.
Supervised fine-tune it on demonstrations or instruction data.
Collect human preferences between model outputs.
Train a reward model to predict those preferences.
Optimize the language model against the reward model, often with PPO.

The reward model replaces the environment reward. Instead of “win the game,” the model receives reward for outputs humans prefer.

But this introduces reward-model hacking. The policy can learn outputs that score well under the reward model without actually being better.

This is the same old RL problem in a language-model costume: any reward that can be exploited will be exploited.

GRPO

GRPO, or Group Relative Policy Optimization, is used in recent reasoning-model training pipelines.

Instead of relying on a separate value model as critic, GRPO samples a group of outputs for the same prompt and compares them relative to each other.

For one prompt, sample multiple completions:

y_{1}, y_{2}, \dots, y_{G}

Each gets a reward:

r_{1}, r_{2}, \dots, r_{G}

Then normalize within the group:

A_{i} = \frac{r _{i} - m e an ( r )}{s t d ( r )}

So the learning signal is not “was this absolutely good?” but “was this better than other samples from the same model on the same prompt?”

This is useful for reasoning tasks because you can sample many attempts, reward correct ones, and push the model toward the trajectories that worked.

But there is a catch: if the model is not actually reasoning and is just guessing, then GRPO can reinforce lucky guesses. This matches the failure mode from the Wordle training run: the model can appear to reason in its trace while the actual answer is still a guess.

Sparse Rewards and Reward Hacking

Sparse rewards mean the model receives useful signal rarely.

In Wordle, rewarding only the correct final answer is too sparse. Most trajectories fail, so the model has little information about what part of the behavior was better.

Shaped rewards help:

reward valid format
reward using allowed words
reward respecting previous feedback
reward green/yellow letter consistency
punish repeated invalid guesses

But reward shaping creates loopholes. If you reward “contains common letters,” the model may spam common letters. If you reward matching history, it may copy words from history. If you reward format too much, it may over-optimize formatting instead of solving.

The core lesson: reward design is programming the game the model will actually play.

Practical Mental Model

A useful map:

Bellman equations: recursive structure of value.
Monte Carlo: learn from complete returns.
TD learning: learn from bootstrapped one-step targets.
Q-learning: learn action values off-policy.
DQN: Q-learning with neural networks.
REINFORCE: direct policy gradient from sampled returns.
Actor-critic: policy gradient plus learned value baseline.
PPO: actor-critic with constrained policy updates.
RLHF: optimize language models with learned human preference rewards.
GRPO: compare multiple sampled completions within a prompt group.

What Matters in Real Training

The theory is clean. Real training is mostly about the ugly details:

Is the reward actually aligned with the behavior you want?
Is the reward too sparse?
Can the environment be hacked?
Are rollouts diverse enough?
Is the base model capable of the skill before RL?
Is the batch size large enough for stable learning?
Are you measuring pass@k, maj@k, or something else?
Are you improving reasoning or just selecting lucky samples?

For language models, the base model matters a lot. RL is better at amplifying behaviors already present than creating a totally missing capability from nothing.

If the SFT model has almost zero chance of solving a task, RL may just teach it to exploit the reward or produce more confident nonsense. If the SFT model sometimes solves the task, RL can shift probability mass toward successful behavior.

That is the practical bridge from RL theory to post-training.