🎮 Reinforcement Learning · AI

📅 March 2026 ⏱ ~9 min read 🟡 Intermediate

Reinforcement Learning Explained

Reinforcement learning teaches an agent to act well in an environment by trial-and-error. AlphaGo, Atari-playing AIs, ChatGPT's fine-tuning (RLHF), and robotic arms all use some form of RL. At the core are just two equations: the Bellman equation and Q-learning.

Agent–Environment Loop

At every time step, the agent observes the state of the environment, chooses an action, and receives a reward and the next state.

Agent

→ aₜ →

Environment

→ sₜ₊₁, rₜ →

Agent

State sₜ: A complete description of the situation (chess board positions, robot joint angles, pixel array).
Action aₜ: What the agent does (move left, apply torque, write a word).
Reward rₜ: A scalar signal — +1 for winning, −1 for losing, 0 otherwise, or some continuous value.
Policy π(s): The agent's strategy — a function mapping state to action.

The goal is to find the policy π* that maximises the total accumulated reward over time.

Markov Decision Processes

The formal framework for RL is the Markov Decision Process (MDP), defined by a tuple (S, A, P, R, γ):

S: State space
A: Action space
P(s'|s,a): Transition probability — probability of ending in state s' after taking action a in state s
R(s,a): Reward function
γ (gamma): Discount factor, 0 ≤ γ < 1

The Markov property says the next state depends only on the current state and action, not on the history. In practice, the agent often doesn't know P or R — it must learn from experience.

Returns and Discounting

The return Gₜ is the total reward from time t onwards. We don't weight future rewards equally — a reward now is better than the same reward far in the future. The discounted return is:

Gₜ = rₜ + γ·rₜ₊₁ + γ²·rₜ₊₂ + ... = Σₖ γᵏ rₜ₊ₖ

With γ = 0.99, a reward 100 steps away is worth only 0.99¹⁰⁰ ≈ 0.37 of a reward right now. γ = 0 means the agent is completely myopic; γ → 1 means it plans very far ahead. Typical values: 0.95–0.999.

Value Functions and Policy

The state-value function V(s) estimates the expected return when starting from state s under policy π:

V^π(s) = 𝔼_π[Gₜ | sₜ = s]

The action-value function Q(s, a) estimates the expected return when taking action a in state s, then following π:

Q^π(s, a) = 𝔼_π[Gₜ | sₜ = s, aₜ = a]

If you know Q*(s,a) — the optimal Q-values — the optimal policy is simply: always pick the action with the highest Q-value: π*(s) = argmax_a Q*(s,a).

The Bellman Equation

The Bellman optimality equation for Q* expresses the recursive relationship between Q-values:

Q*(s, a) = 𝔼[ r + γ · max_a' Q*(s', a') ]

This says: the value of (state s, action a) is the immediate reward plus the discounted value of the best action from the next state. This self-consistency condition uniquely determines Q*.

Why it matters: The Bellman equation turns the problem of finding the optimal policy into a fixed-point iteration. We can start with any Q-values and repeatedly apply the Bellman update — under certain conditions this converges to Q*.

Q-Learning

Q-learning is a model-free algorithm that uses sampled experience to converge to Q* without knowing the environment's transition probabilities. The update rule, applied after each (s, a, r, s') transition:

Q(s,a) ← Q(s,a) + α · [ r + γ · max_a'Q(s',a') − Q(s,a) ]

The part in brackets is the TD error (temporal difference error) — how wrong the current Q-value is relative to the Bellman target. α is the learning rate.

For small discrete state/action spaces, Q-values are stored in a table. Example for a simple 2×2 grid with 4 movement actions:

State	Left	Right	Up	Down
s₀	0.0	0.8	0.2	0.1
s₁	0.3	0.1	0.9	0.4
s₂	1.0	0.0	0.6	0.2

The agent picks the highlighted (highest-Q) action for each state. After many episodes, Q-learning converges to the optimal values for any finite MDP with enough exploration.

Exploration vs Exploitation

A pure greedy agent always picks the highest-Q action. But what if the Q-values are wrong early in training? It might miss better alternatives. The agent needs to explore.

ε-greedy

With probability ε, take a random action; otherwise take the greedy action. ε is typically annealed from 1.0 → 0.05 over training.

import random

def select_action(Q, state, epsilon):
    if random.random() < epsilon:
        return random.choice(actions)  # explore
    return argmax(Q[state])              # exploit

Deep Q-Networks (DQN)

For large or continuous state spaces (e.g., raw pixels from an Atari game), a Q-table has too many entries to store. A Deep Q-Network (DQN) replaces the table with a neural network: Q(s, a; θ) ≈ Q*(s, a).

The network takes the state as input and outputs one Q-value for each possible action. At DeepMind's 2015 breakthrough, the input was four stacked 84×84 grayscale frames; a CNN + two fully-connected layers output 18 Q-values (one per Atari button combination).

Two stability tricks DQN introduced

Replay buffer: Store past transitions in a buffer and sample mini-batches randomly during training. Breaks temporal correlations that destabilise gradient descent.
Target network: A second copy of the network with frozen weights used to compute the Bellman target. Updated every N steps. Prevents the target from moving every step.

# DQN update (pseudocode)
for (s, a, r, s_next) in sample_batch(replay_buffer):
    target = r + GAMMA * max(target_net(s_next))
    prediction = online_net(s)[a]
    loss = mse(prediction, target)
    backprop(loss)

Beyond Q-Learning

Policy Gradient (REINFORCE): Directly optimise the policy parameters by following the gradient of expected return. Works for continuous action spaces.
Actor-Critic (A3C, SAC, PPO): Combine a policy network (actor) with a value network (critic). PPO is the workhorse of RLHF for language model fine-tuning.
AlphaZero: Uses Monte Carlo Tree Search (MCTS) guided by a neural network that estimates both value and policy. No hand-crafted features — learned everything from self-play.
Model-based RL (MuZero, Dreamer): The agent learns a model of the environment and plans within that model, achieving higher sample efficiency.

🗺️ Open Pathfinding →