Timbre: The Colour of Sound

A violin and a flute playing the same A4 (440 Hz) sound completely different — same pitch, same loudness, yet instantly distinguishable. That quality is timbre: everything about a sound except its pitch, loudness, and duration. Understanding timbre means understanding Fourier analysis, resonant cavities, temporal envelopes, and how the auditory system maps spectral structure to perceived identity.

1. The Harmonic Series and Overtones

A stretched string of length L clamped at both ends supports standing waves only at specific wavelengths. The allowed frequencies are integer multiples of the fundamental:

Harmonics of a stretched string or air column (open pipe): f_n = n \cdot f₀ for n = 1, 2, 3, 4, ... f₀ = fundamental frequency (first harmonic) f₁ = f₀ \leftarrow fundamental f₂ = 2f₀ \leftarrow octave above f₃ = 3f₀ \leftarrow perfect fifth above the octave f₄ = 4f₀ \leftarrow two octaves above f₅ = 5f₀ \leftarrow major third above that String wave speed: v = \sqrt(T/μ) T = tension, μ = mass per unit length Fundamental: f₀ = v/2L = (1/2L) \cdot \sqrt(T/μ) (string of length L) Closed pipe (clarinet-like, closed at one end): f_n = n \cdot f₀ but only ODD harmonics: n = 1, 3, 5, 7, ... \to characteristic "hollow" sound; 12th higher than octave (twelfth, not octave) Inharmonicity (piano, bells): Real strings have stiffness \to modes deviate from exact integer ratios. f_n \neq n\cdotf₀ but f_n = n\cdotf₀ \cdot \sqrt(1 + B\cdotn²) where B = stiffness coefficient. Piano B \approx 10⁻⁴ to 10⁻³ (higher for short thick bass strings). Slight inharmonicity contributes to the "richness" of piano tone.

2. Fourier Decomposition of Instrument Sounds

Any periodic waveform can be decomposed into a sum of sinusoids at harmonically related frequencies — Fourier's theorem. For a periodic sound with period T = 1/f₀:

Fourier series of periodic waveform x(t): x(t) = A₀/2 + Σ_{n=1}^{∞} [Aₙ cos(2πnf₀t) + Bₙ sin(2πnf₀t)] = Σ_{n=0}^{∞} Cₙ cos(2πnf₀t + φₙ) Cₙ = amplitude of nth harmonic (spectral amplitude) φₙ = phase of nth harmonic (spectral phase) Power spectrum: |Cₙ|² vs. n — the "fingerprint" of timbre. Characteristic harmonic spectra: Pure sine wave: C₁ only → single line spectrum → flute (approximately) Square wave: a(t) = ±1 with equal duration C_n = 4/(nπ) for odd n only → odd harmonics with 1/n amplitude roll-off Hard, buzzy sound (resembles clarinet's hollow, reedy quality) Sawtooth wave: a(t) = 2(t/T − ⌊t/T⌋) − 1 C_n = 2/(nπ) for ALL n → all harmonics with 1/n amplitude roll-off Rich, sharp sound (resembles violin, bowed string) Triangle wave: C_n = 8/(n²π²) for odd n → odd harmonics with 1/n² roll-off Softer, more sinusoidal character (resembles flute) Violin (bowed string): rich in harmonics up to ~5kHz, spectral envelope shaped by body resonances of the Helmholtz air cavity ~275 Hz, corpus ~440 Hz, and plate resonances 1-2 kHz. ~10-15 significant harmonics at each pitch. Trumpet: essentially ALL harmonics prominent up to very high frequency. Fundamental often weak; 2nd-8th harmonics dominate. Mouthpiece cup + bell flare create characteristic spectral envelope.

3. ADSR Envelopes and Temporal Character

Timbre is not static. The way a sound begins, sustains, and ends — its temporal envelope — is as important to identification as its steady-state spectrum. Remove the attack of a piano note: listeners misidentify it as an organ. The standard ADSR model divides amplitude evolution into four stages:

ADSR envelope: A(t) = Attack: 0 \to Peak amplitude over time T_A (linear or exponential rise) Decay: Peak \to Sustain level in time T_D (exponential fall) Sustain: Maintains level S while key is held [not a time, a level] Release: S \to 0 over time T_R after key release (exponential fall) Typical values (illustrative): Instrument T_A T_D S T_R Piano 1-10ms 100-300ms ~0 300-1000ms (percussive, no sustain) Organ 1-5ms 0ms 1.0 50-100ms (instantaneous to sustain) Violin bowed 50-200ms 0ms 1.0 100-300ms (slow bow attack) Drum <5ms rapid 0 brief (pure percussive) Attack transients — the most critical temporal region for identification: The first ~50-200 ms of a note contains spectral structure that changes rapidly as resonators fill. These inharmonic, noisy transients give: • the "bite" of a clarinet's reed • the bow "scratchiness" of a violin • the hammer click of a piano Experiment: Removing 50ms attack from piano note on average causes misidentification as other instruments ~60% of the time.

4. Formant Frequencies and Vocal Timbre

The human vocal tract is an acoustic resonator whose shape (controlled by tongue, jaw, lips, and velum) selects which overtones of the voice source are amplified. These resonance peaks are called formants:

Source-filter model of speech (Fant 1960): S(f) = G(f) \cdot V(f) G(f) = source spectrum (vocal fold vibrations) = harmonics of f₀ (fundamental) with -12 dB/octave roll-off V(f) = vocal tract transfer function (filter) = set of resonance peaks at formant frequencies F1, F2, F3, ... Formant frequencies depend on vocal tract length L (~17 cm adult male): Fn \approx (2n-1)c / 4L for uniform tube (simplified) F1 \approx 300-900 Hz (jaw height / mouth opening — first resonance) F2 \approx 600-2500 Hz (tongue front-back position) F3 \approx 2000-3500 Hz (tongue tip, secondary constrictions) Vowel formants (approximate, adult male): Vowel F1 (Hz) F2 (Hz) /i/ 270 2290 (high front: "ee" as in "see") /æ/ 660 1720 (open front: "ah" as in "cat") /ɑ/ 730 1090 (open back: "ah" as in "father") /u/ 300 870 (high back: "oo" as in "food") /ε/ 530 1840 (mid front: "e" as in "bed") Singer's formant (~2500-3500 Hz): Trained classical singers cluster F3, F4, F5 resonances at 2-3 kHz. This region has much less energy in orchestra spectra \to singer "cuts through" without louder phonation. Characteristic of trained operatic voices. Absent in choral singers (they sing with orchestra, not against it).

5. Spectral Centroid, Brightness, and Perception

Spectral centroid: the "centre of mass" of a sound's spectrum. Correlates with perceived brightness / sharpness of timbre. SC = Σ(fₙ \cdot |Xₙ|) / Σ|Xₙ| where fₙ = frequency of nth spectral component |Xₙ| = amplitude of nth component High SC (e.g., 1000-3000 Hz): bright, sharp, metallic \to trumpet, oboe, violin Low SC (e.g., 200-500 Hz): dark, warm, mellow \to double bass, tuba, clarinet low Other perceptually relevant spectral features: Spectral flatness (tonal vs. noise-like): SF = geometric_mean(|X|) / arithmetic_mean(|X|) SF = 1: pure white noise; SF \to 0: single sinusoid tone Subconsciously drives perception of "roughness" vs. purity Spectral spread: Variance of spectrum around centroid. High spread: richer, less focused tone. Odd/even harmonic ratio: Instruments with strong odd harmonics: clarinets (closed-pipe \to very hollow character) Instruments with strong even + odd harmonics: violin, trumpet \to "fuller" sound Grey (1977) multidimensional scaling (MDS) study: 16 instrument tones matched for pitch/loudness. Listeners judged similarity. MDS revealed 3 perceptual dimensions: 1. Spectral energy distribution (centroid / brightness) 2. Spectral flux (how fast spectrum changes with time — attack character) 3. Synchrony of attack transients (whether harmonics begin simultaneously)

Roughness and beating: When two frequency components differ by 20–250 Hz (within the same critical band), their interaction produces beating at the difference frequency. Beating that falls in the range of 20–200 Hz creates the sensation of roughness — perceived as harsh or dissonant. Instruments with dense, unresolved partials (e.g. a bowed cymbal) sound rough; those with well-spaced harmonics sound smooth. Roughness is distinct from dissonance (a cognitive phenomenon based on musical expectations).

6. Comparing Instruments: Spectral Signatures

Different instruments produce distinctly different spectral shapes even at the same pitch and loudness:

Flute: Predominately fundamental with strong 2nd harmonic; higher harmonics fall off rapidly. Low spectral centroid for a wind instrument. Rich in the attack only (initially noisy jet sound at embouchure). Predominantly sine-wave-like in steady state.
Clarinet: Strong odd harmonics (1st, 3rd, 5th) due to cylindrical bore closed at one end. Characteristic hollow, reedy quality. Changes character across registers (register key jumps a 12th, not an octave). Very different spectral profile between chalumeau and clarion registers.
Oboe: Double reed source provides extremely rich harmonic content (all harmonics strong). Very high spectral centroid. Nasal, penetrating quality: powerful 3rd–8th harmonics. Body resonances create distinctive formant-like peaks at ~300 Hz and ~1200 Hz.
Violin (bowed): Rich harmonic spectrum shaped by top and back plate resonances of the body. Strong coupling via the bass bar and sound post. Wolf tone: certain pitches coincide with body resonance, creating unwanted feedback oscillation. Vibrato creates frequency modulation ±30–60 cents at 5–7 Hz, spreading energy and creating characteristic shimmer.
Trumpet: Very strong high partials, often out to 10 kHz. Bell flare cutoff frequency: waves below ~750 Hz reflect back internally (self-amplification), above pass out (radiation). Mute changes spectral envelope dramatically: straight mute absorbs high frequencies; wah-wah dynamically shifts formant peak.
Piano: Each string struck by felt hammer; contact time determines which harmonics are excited. Short contact time (harder strike, forte): all harmonics to high frequency. Long contact time (soft, pianissimo): high harmonics attenuated. Also: unison chorus tuning (piano has 2-3 strings per note, detuned by 1-3 cents) creates characteristic "beating" and prolonged decay.

7. Synthesizing Timbre: Additive, Subtractive, and FM

ADDITIVE SYNTHESIS: Build complex waveforms by summing oscillators (each at a harmonic frequency). Total control over each partial's amplitude and phase. x(t) = Σ_{n=1}^{N} Aₙ(t) · sin(2π·n·f₀·t + φₙ) Each Aₙ(t) can have its own ADSR envelope → realistic instrument evolution. Problem: computationally expensive for N = 50+ partials. Used: Hammond organ (tonewheel = additive principle), SONAR, Pixar rendering audio. SUBTRACTIVE SYNTHESIS: Start with a spectrally rich source (sawtooth, noise) Apply band-pass or low-pass filters to shape the spectrum. Classic chain: VCO (oscillator) → VCF (filter) → VCA (amplifier) ↑ pitch ↑ timbre ↑ envelope Filter: resonant low-pass (Moog ladder filter: 4 cascaded RC stages → 24 dB/oct) f_cutoff: controls brightness Q / resonance: feedback at cutoff → sharp peak, even self-oscillation Modular synths (Buchla, Moog) essentially all subtractive architecture. FREQUENCY MODULATION (FM) SYNTHESIS: Chowning (1973), commercialised: Yamaha DX7 (1983) — 300,000+ units sold. Modulator signal modulates the frequency of a carrier: x(t) = A · sin(2π·f_c·t + I·sin(2π·f_m·t)) f_c = carrier frequency (perceived pitch) f_m = modulator frequency I = modulation index = Δf/f_m (controls "brightness" and harmonic content) Resulting spectrum: Bessel functions of the first kind Sidebands at f_c ± n·f_m for n = 0, 1, 2, 3, ... Amplitude of nth sideband: A·|J_n(I)| With I = 0: pure sine wave (carrier only) Increasing I → sidebands emerge → progressively richer, more inharmonic timbres When f_c/f_m = integer ratio: harmonic spectrum (musical pitched sounds) When f_c/f_m ≠ integer: inharmonic spectrum (bell, metallic sounds) DX7: 6 operators (oscillators), arranged in 32 "algorithms" (routing configurations). Produces everything from electric piano (Rhodes) to brass, strings, bells. WAVETABLE SYNTHESIS (modern standard in software synths): Store one cycle of a waveform digitally. Loop it. Morph between wavetables in real time (transition from attack to sustain spectrum). Allows arbitrary spectral shapes. Used in: Xfer Serum, Native Instruments Massive X.

Psychoacoustic relevance: FM synthesis became commercially dominant not because it most faithfully replicated real instruments but because it created perceptually convincing sounds that activated the same auditory pathways as the real thing — exploiting formant regions, attack transient structure, and harmonic density in ways the ear found believable. This illustrates a fundamental principle: the goal of sound synthesis is not physical accuracy but perceptual plausibility.