Signal Processing · Acoustics · Math

📅 Квітень 2026 ⏱ ≈ 14 хв читання 🎯 Beginner–Intermediate

Fourier Transform and Sound

Every sound — a piano note, a human voice, a thunderclap — is a superposition of pure sinusoidal waves. The Fourier transform reveals this hidden frequency structure. It turns waveforms into spectra, unlocks audio compression, enables pitch detection and speech recognition, and powers virtually every digital signal-processing pipeline in existence. This article takes you from the mathematical definition to the Cooley-Tukey FFT and a live browser spectrogram.

1. Sinusoids and the Frequency Domain

A pure tone at frequency f Hz is a sinusoid: x(t) = A cos(2πft + φ). Complex exponential notation unifies amplitude and phase into a single number, using Euler's formula:

e^{i θ} = cos θ + i sin θ x(t) = Re{ A e^{i(2πft + φ)} }

The Fourier transform decomposes an arbitrary signal x(t) into its constituent complex exponentials:

X(f) = ∫₋∞^∞ x(t) e^{-i 2πft} dt // continuous Fourier transform x(t) = ∫₋∞^∞ X(f) e^{+i 2πft} df // inverse: synthesis from frequency components

|X(f)|² is the power spectral density — how much power the signal carries at frequency f. Plotting |X(f)| against f produces the familiar spectrum diagram. In digital audio, signals are sampled at a fixed rate f_s (44 100 Hz for CD audio); this leads to the discrete version.

Nyquist-Shannon theorem: a bandlimited signal with maximum frequency f_max can be perfectly reconstructed from samples taken at rate f_s > 2 f_max. At 44 100 Hz sampling, frequencies up to 22 050 Hz can be represented — covering the entire audible range (20 Hz–20 kHz).

2. The Discrete Fourier Transform (DFT)

Given N complex samples x[0] … x[N-1], the DFT produces N complex frequency-domain values X[0] … X[N-1]:

X[k] = Σₙ₌₀^{N-1} x[n] · e^{-i 2π kn / N} k = 0, 1, …, N-1 x[n] = (1/N) Σₖ₌₀^{N-1} X[k] · e^{+i 2π kn / N} // IDFT

Each output X[k] is the inner product of the input with a complex sinusoid at frequency k/N cycles per sample. For real audio sampled at f_s Hz, the physical frequency of bin k is:

f_k = k \cdot f_s / N // Hz (bins k = 0 \dots N/2 cover 0 Hz to f_s/2)

The naive DFT has O(N²) complexity — unacceptably slow for N = 4096 or larger windows. The FFT computes exactly the same values in O(N log N).

3. Cooley-Tukey Fast Fourier Transform

The key insight (Cooley & Tukey, 1965; also found by Gauss in 1805) is the divide-and-conquer split of a length-N DFT into two length-N/2 DFTs:

X[k] = E[k] + W_N^k · O[k] X[k + N/2] = E[k] − W_N^k · O[k] where E[k] = DFT of even-indexed terms x[0], x[2], x[4], … O[k] = DFT of odd-indexed terms x[1], x[3], x[5], … W_N^k = e^{-i 2πk/N} (twiddle factor)

Applied recursively on N = 2^m samples, this butterfly network requires O(N log₂ N) complex multiplications instead of O(N²) — for N = 4096 that is a 682× speedup.

Butterfly Diagram

Each butterfly stage combines a pair of complex values (a, b) into:

a' = a + W \cdot b b' = a - W \cdot b

The complete radix-2, decimation-in-time (DIT) FFT proceeds with log₂ N stages of N/2 butterflies each. The input indices are bit-reversed before the first stage (because the even/odd split corresponds to taking bits from LSB to MSB).

4. Windowing and Spectral Leakage

Applying the DFT to a finite-length block that abruptly cuts off a signal is equivalent to multiplying the signal by a rectangular window. In the frequency domain, multiplication becomes convolution with the rectangular window's spectrum (a sinc function). The wide sinc sidelobes "leak" energy from strong frequency components into neighbouring bins.

Window functions taper the block smoothly to zero at both ends, suppressing leakage at the cost of slightly wider main lobes (reduced frequency resolution):

Hann: w[n] = 0.5 − 0.5 cos(2πn/(N-1)). Good general-purpose choice.
Hamming: w[n] = 0.54 − 0.46 cos(2πn/(N-1)). Better sidelobe suppression than Hann.
Blackman-Harris: Very low sidelobes (−92 dB). Used when frequency resolution is less critical than sidelobe rejection.
Flat-top: Minimal amplitude error; used in precision amplitude measurement.

Choosing block size N: larger N gives finer frequency resolution (f_s/N Hz per bin) but poorer time resolution. For speech analysis, N = 512–2048 samples (11–46 ms at 44 100 Hz) with 50–75% overlap between successive blocks is typical. Musical pitch detection often uses N = 4096.

5. Short-Time Fourier Transform and Spectrograms

The Short-Time Fourier Transform (STFT) applies the DFT to short, overlapping windows slid along the signal, producing a 2-D time-frequency representation:

STFT{x}(m, k) = Σₙ x[n] · w[n − mH] · e^{-i 2πkn/N} m = frame index (time axis), H = hop size (samples per step), k = frequency bin, w = window function

Plotting |STFT(m, k)|² as an image (time horizontal, frequency vertical, intensity = power) produces a spectrogram. Musical notes appear as horizontal bright lines; formants and harmonics of speech are visible as vertical striping patterns.

The mel spectrogram warps the frequency axis from Hz to the perceptual mel scale (log-spaced at high frequencies), producing compact representations used widely in machine learning models for audio: wave-to-vec, Whisper, and MusicLM all process mel spectrograms.

6. JavaScript: FFT from Scratch

// Radix-2 Cooley-Tukey FFT (in-place, complex input/output)
// re[] and im[] are separate real and imaginary arrays of length N (power of 2)
function fft(re, im) {
  const N = re.length;

  // Bit-reversal permutation
  for (let i = 1, j = 0; i < N; i++) {
    let bit = N >> 1;
    for (; j & bit; bit >>= 1) j ^= bit;
    j ^= bit;
    if (i < j) {
      [re[i], re[j]] = [re[j], re[i]];
      [im[i], im[j]] = [im[j], im[i]];
    }
  }

  // Butterfly stages
  for (let len = 2; len <= N; len <<= 1) {
    const ang = -2 * Math.PI / len;       // e^{-i 2π/len}
    const wRe = Math.cos(ang), wIm = Math.sin(ang);
    for (let i = 0; i < N; i += len) {
      let curRe = 1, curIm = 0;
      for (let k = 0; k < len / 2; k++) {
        const uRe = re[i + k],           uIm = im[i + k];
        const vRe = re[i + k + len / 2], vIm = im[i + k + len / 2];
        const tRe = curRe * vRe - curIm * vIm;
        const tIm = curRe * vIm + curIm * vRe;
        re[i + k]           = uRe + tRe;
        im[i + k]           = uIm + tIm;
        re[i + k + len / 2] = uRe - tRe;
        im[i + k + len / 2] = uIm - tIm;
        const newCurRe = curRe * wRe - curIm * wIm;
        curIm = curRe * wIm + curIm * wRe;
        curRe = newCurRe;
      }
    }
  }
}

// Hann windowing function
function hann(n, N) { return 0.5 * (1 - Math.cos(2 * Math.PI * n / (N - 1))); }

// Magnitude spectrum (first half, in dB) for a block of real audio samples
function spectrum(samples) {
  const N = samples.length;
  const re = Float64Array.from(samples, (v, n) => v * hann(n, N));
  const im = new Float64Array(N);
  fft(re, im);
  const mag = new Float64Array(N / 2);
  for (let k = 0; k < N / 2; k++)
    mag[k] = 20 * Math.log10(Math.hypot(re[k], im[k]) + 1e-10);
  return mag;  // dBFS magnitudes
}

7. Pitch Detection with the Browser Microphone

The Web Audio API provides a real-time audio pipeline accessible in standard browsers. A simple implementation uses FFT-based peak detection supplemented by parabolic interpolation to refine the frequency estimate:

async function startPitchDetector() {
  const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
  const ctx    = new AudioContext();
  const src    = ctx.createMediaStreamSource(stream);
  const N = 4096;
  const analyser = ctx.createAnalyser();
  analyser.fftSize = N;
  src.connect(analyser);

  const buf = new Float32Array(N);

  function tick() {
    analyser.getFloatTimeDomainData(buf);

    // Apply FFT (re-use our routine)
    const re = Float64Array.from(buf, (v, n) => v * hann(n, N));
    const im = new Float64Array(N);
    fft(re, im);

    // Find peak bin (skip DC bin 0)
    let peakBin = 1, peakMag = 0;
    for (let k = 1; k < N / 2; k++) {
      const m = re[k] * re[k] + im[k] * im[k];
      if (m > peakMag) { peakMag = m; peakBin = k; }
    }

    // Parabolic interpolation: refine sub-bin peak
    const a = Math.hypot(re[peakBin - 1], im[peakBin - 1]);
    const b = Math.hypot(re[peakBin],     im[peakBin]);
    const c = Math.hypot(re[peakBin + 1], im[peakBin + 1]);
    const refinedBin = peakBin + (a - c) / (2 * (a - 2 * b + c));
    const hz = refinedBin * ctx.sampleRate / N;

    console.log(`Pitch: ${hz.toFixed(1)} Hz`);
    requestAnimationFrame(tick);
  }
  tick();
}

Autocorrelation (YIN algorithm) is generally more robust for monophonic pitch estimation than FFT peak picking — it handles the "missing fundamental" artefact that occurs when a note's harmonics are present but the fundamental itself is weak.

8. Applications

MP3 and Audio Compression

MP3 uses the Modified Discrete Cosine Transform (MDCT), a variant of the Fourier transform, to convert audio blocks to the frequency domain. A psychoacoustic model then masks frequency components that human hearing cannot discern (masked by louder nearby frequencies), and allocates fewer bits to those components.

Noise Cancellation

Active noise-cancelling headphones analyse the ambient noise spectrum in real time and produce an anti-phase signal to suppress it. The delay budgets and filter design are governed entirely by Fourier analysis: attenuation at each frequency requires the anti-noise phase to be offset by exactly π radians.

Image Compression (JPEG)

JPEG applies a 2-D Discrete Cosine Transform (a real-valued version of the DFT) to 8×8 pixel blocks. Low-frequency DCT coefficients carry most of the visual energy and are retained; high-frequency coefficients are quantised aggressively. The same core idea — sparse representation in the frequency domain — also underpins JPEG 2000, HEIC, and modern neural codecs.

MRI Reconstruction

An MRI scanner measures the Fourier transform of tissue magnetisation (called k-space) directly. Reconstructing the tissue image is simply an inverse 2-D FFT. Compressed sensing theory shows that undersampled k-space can be reconstructed exactly if tissue is sparse in some transform domain — cutting scan time by 4–8× in modern clinical protocols.

〰️ Open Fourier Transform →