Reinforcement Learning Explained
Reinforcement learning teaches an agent to act well in an environment by trial-and-error. AlphaGo, Atari-playing AIs, ChatGPT's fine-tuning (RLHF), and robotic arms all use some form of RL. At the core are just two equations: the Bellman equation and Q-learning.
Agent–Environment Loop
At every time step, the agent observes the state of the environment, chooses an action, and receives a reward and the next state.
- State sₜ: A complete description of the situation (chess board positions, robot joint angles, pixel array).
- Action aₜ: What the agent does (move left, apply torque, write a word).
- Reward rₜ: A scalar signal — +1 for winning, −1 for losing, 0 otherwise, or some continuous value.
- Policy π(s): The agent's strategy — a function mapping state to action.
The goal is to find the policy π* that maximises the total accumulated reward over time.
Markov Decision Processes
The formal framework for RL is the Markov Decision Process (MDP), defined by a tuple (S, A, P, R, γ):
- S: State space
- A: Action space
- P(s'|s,a): Transition probability — probability of ending in state s' after taking action a in state s
- R(s,a): Reward function
- γ (gamma): Discount factor, 0 ≤ γ < 1
The Markov property says the next state depends only on the current state and action, not on the history. In practice, the agent often doesn't know P or R — it must learn from experience.
Returns and Discounting
The return Gₜ is the total reward from time t onwards. We don't weight future rewards equally — a reward now is better than the same reward far in the future. The discounted return is:
With γ = 0.99, a reward 100 steps away is worth only 0.99¹⁰⁰ ≈ 0.37 of a reward right now. γ = 0 means the agent is completely myopic; γ → 1 means it plans very far ahead. Typical values: 0.95–0.999.
Value Functions and Policy
The state-value function V(s) estimates the expected return when starting from state s under policy π:
The action-value function Q(s, a) estimates the expected return when taking action a in state s, then following π:
If you know Q*(s,a) — the optimal Q-values — the optimal policy is simply: always pick the action with the highest Q-value: π*(s) = argmaxa Q*(s,a).
The Bellman Equation
The Bellman optimality equation for Q* expresses the recursive relationship between Q-values:
This says: the value of (state s, action a) is the immediate reward plus the discounted value of the best action from the next state. This self-consistency condition uniquely determines Q*.
Q-Learning
Q-learning is a model-free algorithm that uses sampled experience to converge to Q* without knowing the environment's transition probabilities. The update rule, applied after each (s, a, r, s') transition:
The part in brackets is the TD error (temporal difference error) — how wrong the current Q-value is relative to the Bellman target. α is the learning rate.
For small discrete state/action spaces, Q-values are stored in a table. Example for a simple 2×2 grid with 4 movement actions:
| State | Left | Right | Up | Down |
|---|---|---|---|---|
| s₀ | 0.0 | 0.8 | 0.2 | 0.1 |
| s₁ | 0.3 | 0.1 | 0.9 | 0.4 |
| s₂ | 1.0 | 0.0 | 0.6 | 0.2 |
The agent picks the highlighted (highest-Q) action for each state. After many episodes, Q-learning converges to the optimal values for any finite MDP with enough exploration.
Exploration vs Exploitation
A pure greedy agent always picks the highest-Q action. But what if the Q-values are wrong early in training? It might miss better alternatives. The agent needs to explore.
ε-greedy
With probability ε, take a random action; otherwise take the greedy action. ε is typically annealed from 1.0 → 0.05 over training.
Deep Q-Networks (DQN)
For large or continuous state spaces (e.g., raw pixels from an Atari game), a Q-table has too many entries to store. A Deep Q-Network (DQN) replaces the table with a neural network: Q(s, a; θ) ≈ Q*(s, a).
The network takes the state as input and outputs one Q-value for each possible action. At DeepMind's 2015 breakthrough, the input was four stacked 84×84 grayscale frames; a CNN + two fully-connected layers output 18 Q-values (one per Atari button combination).
Two stability tricks DQN introduced
- Replay buffer: Store past transitions in a buffer and sample mini-batches randomly during training. Breaks temporal correlations that destabilise gradient descent.
- Target network: A second copy of the network with frozen weights used to compute the Bellman target. Updated every N steps. Prevents the target from moving every step.
Beyond Q-Learning
- Policy Gradient (REINFORCE): Directly optimise the policy parameters by following the gradient of expected return. Works for continuous action spaces.
- Actor-Critic (A3C, SAC, PPO): Combine a policy network (actor) with a value network (critic). PPO is the workhorse of RLHF for language model fine-tuning.
- AlphaZero: Uses Monte Carlo Tree Search (MCTS) guided by a neural network that estimates both value and policy. No hand-crafted features — learned everything from self-play.
- Model-based RL (MuZero, Dreamer): The agent learns a model of the environment and plans within that model, achieving higher sample efficiency.