How Shazam Works (Probably!) - Computerphile

ComputerphileComputerphile
Education3 min read30 min video
Mar 15, 2021|190,228 views|7,197|389
Save to Pod

Key Moments

TL;DR

Shazam uses Fast Fourier Transform (FFT) to identify songs by analyzing frequency fingerprints and matching them.

Key Insights

1

Shazam identifies songs by analyzing audio frequencies, not just tempo.

2

Fast Fourier Transform (FFT) is a core algorithm for breaking down sound into its component frequencies.

3

The process involves creating a 'fingerprint' of prominent frequencies within short audio slices.

4

A database of these fingerprints from known songs allows for matching recorded audio.

5

The system accounts for variations in recording quality and background noise.

6

Efficient matching algorithms, potentially using hash tables, are crucial for speed.

THE CORE FUNCTIONALITY OF SHAZAM

Shazam is a popular service that identifies music by listening to a short snippet of audio. Users hold up their smartphone, press a button, and the app reveals the song's title and artist. While the exact, current algorithms are proprietary, the fundamental principles have been revealed through research and early implementations, allowing for a strong understanding of how such a service operates.

THE ROLE OF FAST FOURIER TRANSFORM (FFT)

At the heart of audio analysis for services like Shazam is the Fast Fourier Transform (FFT). This mathematical algorithm breaks down a complex audio waveform into its constituent frequencies. Imagine a sound as a combination of many simple sine waves at different frequencies and amplitudes; FFT's job is to identify these individual components and their loudness, effectively deconstructing the sound into its spectral ingredients.

CREATING AUDIO FINGERPRINTS

To identify a song, Shazam doesn't analyze the entire audio stream at once. Instead, it slices the audio into small, typically 100-millisecond chunks. For each chunk, an FFT is applied to determine the prominent frequencies present. The magnitudes of these frequencies are recorded, creating a unique 'fingerprint' for that brief moment in the song. This process is repeated for numerous small segments throughout the track.

IDENTIFYING PROMINENT FREQUENCIES AND BUCKETING

When analyzing an audio slice, the system focuses on the most significant frequencies rather than exhaustively cataloging every single one. It groups frequencies into 'buckets' and identifies the loudest frequency within each. This compression reduces the data significantly. For instance, a range of frequencies might be collapsed into a single data point representing the peak amplitude in that range. This process generates a series of these 'bucketed' prominent frequencies over time.

THE DATABASE AND MATCHING PROCESS

A vast database stores these frequency fingerprints for millions of songs. When a user records a clip, their phone generates a similar fingerprint. The core challenge is efficiently matching the recorded fingerprint against the database. Instead of a direct, exhaustive comparison, the system likely uses an 'anchor point' strategy. It looks for a unique frequency signature from the recorded clip within the database and then searches for a sequence of subsequent matching signatures within a tolerance, indicating a likely song match.

HANDLING REAL-WORLD CONDITIONS

The system must be robust to variations in audio quality, background noise, and the specific recording device. Even with lower-quality microphones or added chatter, the most prominent frequencies that define a song often remain detectable. The matching algorithm compensates for potential discrepancies, such as missing higher or lower frequencies due to recording limitations, by looking for a pattern of points rather than an exact replica, thereby ensuring reliable identification even in noisy environments.

OPTIMIZING FOR SPEED AND EFFICIENCY

Achieving near-instantaneous results requires highly optimized algorithms. The naive approach of comparing entire audio segments would be computationally prohibitive. By reducing the audio data to a series of prominent frequency points and using smart matching logic—potentially involving hash tables for quick database lookups—the system can swiftly compare a short audio clip against millions of song fingerprints, delivering the song identification in seconds.

Common Questions

Shazam analyzes a short audio clip by breaking it down into its component frequencies using FFT. It then creates a unique 'fingerprint' of prominent frequency points and compares this against a massive database of song fingerprints to find a match.

Topics

Mentioned in this video

More from Computerphile

View all 82 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free