How DNA Works — From Double Helix to Protein

Structure of DNA

DNA (deoxyribonucleic acid) is a polymer — a long chain of repeating units called nucleotides. Each nucleotide consists of three parts:

A deoxyribose sugar (5-carbon)
A phosphate group
One of four nitrogenous bases: Adenine (A), Thymine (T), Guanine (G), Cytosine (C)

Nucleotides link together via phosphodiester bonds between the phosphate of one and the sugar of the next, forming the "backbone" of one strand. Two antiparallel strands coil around each other held together by hydrogen bonds between the bases — this is the famous double helix, first described by Watson and Crick in 1953 based on Rosalind Franklin's X-ray crystallography data.

Double helix dimensions: One full helical turn spans 10 base pairs (bp) and about 3.4 nm in length. The helix diameter is ~2 nm. A human genome contains ~3.2 billion base pairs per haploid set — uncoiled, one set would stretch ~1 m.

The Base-Pairing Rules

The two DNA strands are complementary: the sequence of one strand determines the sequence of the other, via strict base-pairing rules first deduced from Erwin Chargaff's measurements in 1950:

A — T (Adenine–Thymine)

Paired by 2 hydrogen bonds. Adenine (purine, double-ring) pairs only with Thymine (pyrimidine, single-ring).

G — C (Guanine–Cytosine)

Paired by 3 hydrogen bonds. Guanine (purine) pairs only with Cytosine (pyrimidine). This extra bond makes G-C pairs stronger.

Because the base-pairing rules are so rigid, knowing the sequence of one strand tells you the sequence of the other exactly. This property is what makes DNA replication and information transfer possible.

DNA Replication

Before a cell divides, it must copy all of its DNA so each daughter cell gets a complete genome. DNA replication is semi-conservative: each new double helix consists of one original strand and one newly synthesised strand.

1

Unwinding — Helicase

The enzyme helicase breaks the hydrogen bonds between base pairs, unzipping the double helix at a "replication fork". Energy comes from ATP hydrolysis.

2

Priming — Primase

Primase synthesises a short RNA primer (~10 nucleotides) that provides the 3'-OH end DNA polymerase needs to start building.

3

Synthesis — DNA Polymerase III

DNA Pol III reads the template strand 3'→5' and adds complementary nucleotides 5'→3' at about 1,000 bases per second. The leading strand is synthesised continuously; the lagging strand in short Okazaki fragments.

4

Sealing — DNA Ligase

Ligase joins Okazaki fragments and removes RNA primers, replacing them with DNA. Error rate: ~1 mistake per 10⁹–10¹⁰ base pairs thanks to proofreading.

Speed of replication: Human DNA polymerase adds ~1,000 nucleotides/second. The full human genome (3.2 billion bp) has ~30,000 replication origins firing simultaneously — it still takes about 8 hours. E. coli copies its 4.6 million bp genome in 40 minutes from a single origin.

Transcription — DNA to mRNA

Cells don't use DNA directly to make proteins. Instead, the relevant section of DNA is first copied into a single-stranded messenger RNA (mRNA) molecule — a process called transcription, performed by RNA polymerase.

RNA differs from DNA in two ways: it uses ribose (not deoxyribose) as its sugar, and instead of Thymine (T) it has Uracil (U), which pairs with Adenine.

DNA template: 3'–TACGCATGG–5' ↓ RNA Polymerase mRNA transcript: 5'–AUGCGUACC–3' Rule: T → A, A → U, G → C, C → G

In eukaryotes (animals, plants, fungi), the pre-mRNA is processed in the nucleus: introns (non-coding segments) are spliced out, a 5'-cap and poly-A tail are added, and the mature mRNA exits to the cytoplasm.

Translation — mRNA to Protein

Ribosomes read the mRNA strand in groups of three nucleotides called codons. Each codon specifies one amino acid, or a start or stop signal. This mapping is the genetic code.

Codon (mRNA)	Amino acid	Note
AUG	Methionine	Start codon — translation begins here
UUU / UUC	Phenylalanine
GAA / GAG	Glutamic acid
GGU / GGC / GGA / GGG	Glycine	Four synonymous codons
CCU / CCC / CCA / CCG	Proline
UAA / UAG / UGA	(stop)	Terminates translation

Transfer RNA (tRNA) molecules carry the correct amino acid and have an anticodon that base-pairs with the mRNA codon. The ribosome catalyses the formation of peptide bonds between successive amino acids, building the polypeptide chain which then folds into a functional protein.

1

Initiation

Ribosome assembles on mRNA at the start codon (AUG). First tRNA (carrying Methionine) docks in the P-site.

2

Elongation

Next tRNA enters the A-site. Peptide bond forms between amino acids. Ribosome translocates one codon. Uncharged tRNA exits. Repeat ~20 amino acids/second.

3

Termination

Stop codon is reached. Release factor triggers hydrolysis of the last tRNA–polypeptide bond. The ribosome disassembles. The protein chain then folds spontaneously (often with chaperone help).

Genes and the Genome

A gene is a sequence of DNA that encodes a functional molecule — usually a protein, sometimes a functional RNA. The complete set of DNA in an organism is its genome.

Key numbers for the human genome:

3.2 billion base pairs (per haploid set)
~20,000 protein-coding genes — only ~1.5% of the total DNA
~48% transposable elements (mobile DNA sequences)
~8% regulatory and structural non-coding RNA genes
The remaining ~40%: introns, repetitive sequences, and regions with still-unclear functions

"Junk DNA" isn't junk: The human genome project initially called non-coding DNA "junk". Follow-up projects like ENCODE (2012) showed at least 80% of the genome has some biochemical activity — regulatory elements, chromatin structure anchors, and long non-coding RNAs that influence gene expression.

Mutations

A mutation is a permanent change in the DNA sequence. Mutations are the raw material of evolution — without them, all life would be genetically identical. Most are neutral; some are harmful; rare ones are beneficial.

Types of mutation

Substitution: One base swapped for another. A synonymous substitution may not change the amino acid (due to codon redundancy). A missense substitution changes the amino acid. A nonsense substitution creates a premature stop codon.
Insertion / Deletion (indel): One or more bases added or removed. If not in multiples of 3, causes a frameshift mutation that scrambles all downstream codons — usually catastrophic.
Chromosomal rearrangements: Large segments duplicated, inverted, or moved to a different chromosome.

Original: ...AAU GAG CCG UGA... Asn Glu Pro STOP Substitution (G→A at position 5): Mutant: ...AAU AAG CCG UGA... Asn Lys Pro STOP ← one amino acid changed

CRISPR — Editing the Code

CRISPR-Cas9 (Clustered Regularly Interspaced Short Palindromic Repeats) is a molecular tool repurposed from bacterial immune systems that allows scientists to cut DNA at a precise location and edit the sequence. Jennifer Doudna and Emmanuelle Charpentier were awarded the 2020 Nobel Prize in Chemistry for its development as a gene-editing tool.

A guide RNA (gRNA) is designed to match the target DNA sequence. The Cas9 protein follows the gRNA, finds the matching sequence in the genome, and cuts both strands of the double helix. The cell then repairs the break using one of two pathways:

NHEJ (non-homologous end joining): error-prone repair that typically disrupts (knocks out) the gene.
HDR (homology-directed repair): if a repair template is supplied, the gene can be precisely corrected or replaced.

Medical applications (2025–26): The first CRISPR therapy Casgevy (exagamglogene autotemcel) was approved for sickle cell disease and β-thalassaemia in the UK (2023) and US (2023). Dozens more trials are underway for hereditary blindness, cancers, and HIV.

Try It Yourself

The cellular automata simulation shows how complex self-replicating patterns can emerge from simple binary rules — a beautiful analogy for how genetic information unfolds:

🎮 Game of Life — Self-Replicating Patterns →

Reaction-diffusion models the kind of chemical signalling that controls gene expression in developing embryos (Turing patterns):

🧪 Reaction-Diffusion Simulation →