Psychoacoustics: How We Hear Sound

A 440 Hz vibration moves through the air, enters your ear canal, vibrates the eardrum, moves three tiny bones, shifts fluid in a coiled tube, bends 15,000 hair cells, and triggers a neural code that your brain interprets as "the note A". Psychoacoustics is the science connecting this physical process to subjective perception — and its findings power everything from MP3 compression to concert hall design.

1. The Ear as Spectrum Analyser

The cochlea performs a biological Fourier-like frequency analysis. Its basilar membrane is a tapered structure: wide and flexible at the apex (responds to low frequencies), narrow and stiff at the base (responds to high frequencies):

Tonotopic map (characteristic frequencies along basilar membrane): Base (basal): ~20,000 Hz Middle: ~1,000 Hz Apex: ~20 Hz Approximately: log-spaced — each octave takes equal membrane length (~3.5 mm) Total membrane: ~35 mm \to ~10 octaves \to ~3.5 mm per octave Inner hair cells (IHC): primary auditory receptors (~3,500 per cochlea) Deflection of stereocilia \to opens K⁺ and Ca²⁺ channels (tip links) \to receptor potential \to glutamate release \to spiral ganglion neuron firing Each IHC contacts ~10-15 afferent fibres \to high fidelity encoding Outer hair cells (OHC): amplifier cells (~12,000 per cochlea) Prestin protein in lateral wall \to electromotility (up to 70,000 MHz!) Active process \to amplifies basilar membrane motion by ~40 dB (100\times) Lost in noise-induced hearing loss first \to sensitivity + frequency resolution decline simultaneously

2. Loudness Perception

Sound pressure level (dB SPL): L_p = 20 \cdot log₁₀(p / p_ref) where p_ref = 20 μPa (threshold of hearing at 1 kHz) Key levels: 0 dB SPL: threshold of hearing at 1 kHz 20 dB SPL: whisper 60 dB SPL: normal conversation 85 dB SPL: hearing damage with prolonged exposure (NIHL) 120 dB SPL: threshold of pain 194 dB SPL: theoretical maximum in air (overpressure = atmospheric pressure) Fletcher-Munson equal-loudness contours (1933), standardised as ISO 226: At 1 kHz: loudness level (phon) = dB SPL by definition. At other frequencies: more dB needed to achieve same perceived loudness. At 1 kHz (threshold): 0 dB SPL = 0 phon At 100 Hz (threshold): ~40 dB SPL required to reach threshold \to We're much less sensitive to low frequencies at low volumes. Practical consequence: bass boosting at low listening levels ("loudness" button on amplifiers) compensates for reduced sensitivity at low frequencies. Sone scale (perceived loudness magnitude): 1 sone = loudness of 1 kHz tone at 40 dB SPL Doubling loudness (phons +10): sones double Approximate: S = 2^((P-40)/10) where P = loudness in phons

3. Pitch Perception

Pitch is the perceptual correlate of fundamental frequency — yet the relationship is not simple:

Place theory (von Helmholtz, 1863): Pitch determined by WHERE on the basilar membrane maximum vibration occurs. Explains frequency discrimination at high frequencies well (>5 kHz). But place alone predicts discrimination far worse than observed at low frequencies.
Temporal (timing) theory: For low frequencies (<4-5 kHz), hair cell firing is phase-locked to the stimulus — firing preferentially at certain phases. The brain reads the inter-spike interval pattern → extracts period → determines pitch. Explains "missing fundamental" illusion (complex tone with f₀ removed, but harmonics present — pitch is still perceived at f₀).
Modern synthesis: Duplex theory — both place and timing contribute. Low frequencies: timing dominant. High frequencies: place dominant. Middle frequencies: both contribute.

Mel scale (pitch perception is compressive and non-linear): Mels approximate equal perceived pitch intervals. m = 2595 \cdot log₁₀(1 + f/700) Pitch increases logarithmically with frequency (each piano octave = 2\times frequency). JND (just noticeable difference) in frequency: At 1 kHz: Δf \approx 3 Hz (0.3%) At 8 kHz: Δf \approx 40 Hz (0.5%) Trained musicians: ~1-2 cents (1 cent = 1/100th of a semitone \approx 0.06% at 1 kHz)

4. Critical Bands and Masking

Critical bandwidth (CBW): The frequency range over which masking and certain perceptual grouping effects operate. Related to the integrating bandwidth of the basilar membrane filter. Bark scale (Zwicker 1961): 24 critical bands across the audible range. Each critical band spans: ~100 Hz at low frequencies (below 500 Hz) ~20% of centre frequency at higher frequencies Bark formula: z(f) = 13\cdotarctan(0.76f/kHz) + 3.5\cdotarctan((f/7.5kHz)²) Simultaneous masking: A masker tone at frequency f_m masks (makes inaudible) nearby tones. Masking most effective for tones WITHIN the same critical band. "Upward spread of masking": lower-frequency tones mask higher ones more easily than the reverse (asymmetric spreading). Temporal masking: Pre-masking: masker presented AFTER target but still masks it (retroactive) Duration: ~20 ms (forward-in-time brain processing) Post-masking (forward): masker silence leaves residual masking for ~100-200 ms Application — MP3 / AAC psychoacoustic compression: Perceptual model identifies tones/noise below masking threshold. These are below hearing threshold \to can use fewer bits to encode them. Typical 128 kbps MP3 achieves ~1:11 compression ratio with minimal perceptible quality loss (eliminates psychoacoustically inaudible content)

5. Binaural Hearing and Localisation

Two ears provide multiple acoustic cues for sound localisation in three dimensions:

Interaural Time Difference (ITD): A sound from the right arrives ~700 μs earlier at the right ear than the left (for 90° azimuth). The auditory brainstem (superior olivary complex, Jeffress delay-line model) detects ITDs as small as 10–20 μs. Dominant cue for azimuth at low frequencies (<1500 Hz).
Interaural Level Difference (ILD): The head shadows higher frequencies → amplitude difference between ears. Dominant cue for azimuth at high frequencies (>2000 Hz).
Head-Related Transfer Function (HRTF): The pinna (outer ear) acts as a direction-dependent filter. Spectral coloration from the pinna provides elevation cues and front/back disambiguation. Personalised HRTFs enable convincing 3D audio (spatial audio in headphones, Apple AirPods Spatial Audio).

"Cone of confusion": Points equidistant on an imaginary cone around the interaural axis all produce the same ITD and ILD — the "cone of confusion". The pinna's spectral cues resolve this ambiguity. Without functioning pinna (or with plugged ears), front-back confusion and elevation errors increase dramatically. This is why in-ear vs over-ear headphones differ in spatial audio quality.

6. The Cocktail Party Effect

At a noisy party with many conversations, you can focus on and follow one speaker while filtering out others — even when the acoustics favour no individual voice in isolation. This remarkable ability involves multiple perceptual mechanisms:

Spatial attention: Binaural cues (ITD/ILD) segregate sources by location. Sounds from different directions activate different neural populations → attention can select by spatial stream.
Auditory stream segregation (ASA): Bregman (1990) showed that simultaneous sounds group into perceptual "streams" based on frequency proximity, temporal coherence, timbre similarity, and spatial origin. Once a target stream is formed, competing streams are attenuated by top-down attention.
Top-down prediction: Linguistic knowledge, expected prosody, and semantic context provide strong predictions that enhance target detection in noise (noise-filled gaps are perceptually completed using context — "phonemic restoration").
Neural mechanisms: Auditory attention selectively enhances neural responses to attended sounds in auditory cortex (~10 dB equivalent of SNR improvement). The frontal eye fields and parietal cortex exert top-down control via corticofugal projections to medial geniculate body.

7. Auditory Illusions

Shepard tone: A superposition of tones an octave apart, all ramped in amplitude. As the tones slowly rise in frequency, the amplitude envelope is fixed — so when they start going out of the high range, they're inaudible, while new tones appear at the bottom. Perceptual result: the pitch appears to rise infinitely. Christopher Nolan used it throughout "Dunkirk" to create unending tension.
Missing fundamental: Complex tone consisting of harmonics 200, 300, 400, 500 Hz (but NOT 100 Hz). Perceived pitch: 100 Hz. The auditory system infers the fundamental from the harmonic pattern — not from direct stimulation. Used intentionally by telephone engineers (voice compressed to 300–3400 Hz still carries pitch information via harmonics).
Tritone paradox (Deutsch): Two tones a tritone apart (½ octave) — some people perceive the pattern as ascending, others as descending. The percept depends on the listener's learned tonal region — revealing language-accent effects on pitch perception.
Haas effect / Precedence effect: When an identical sound reaches both ears but one version is delayed by 1–40 ms, the perceived sound comes from the direction of the first arrival only (even if the second is slightly louder — up to 10 dB). Exploited in PA systems to maintain perceived sound from the stage while front fills reinforce volume.