Audio Fingerprinting - A Decomposing Study

Hultman, Victor; Gälldin, Niklas

Audio Fingerprinting - A Decomposing Study

Mark

Hultman, Victor ^LU and Gälldin, Niklas ^LU (2024) EITM01 20241
Department of Electrical and Information Technology

Abstract: Audio fingerprinting is a widely employed technique that involves generating unique fingerprints for given audio signals that later can be used for identification. A well-known example of this is the Shazam application where the concept is to match a short song snippet with a database to find the name of the song and artist. Generally, the audio fingerprints are created by applying a time-frequency transform on the audio signal and extracting the most prominent features in the time-frequency domain. There are different transforms with different properties but the standard choice is the short-time Fourier transform (STFT). This study compares the performance of the STFT with the Hyper Localized Wavelet Transform (HLT) within an audio... (More); Audio fingerprinting is a widely employed technique that involves generating unique fingerprints for given audio signals that later can be used for identification. A well-known example of this is the Shazam application where the concept is to match a short song snippet with a database to find the name of the song and artist. Generally, the audio fingerprints are created by applying a time-frequency transform on the audio signal and extracting the most prominent features in the time-frequency domain. There are different transforms with different properties but the standard choice is the short-time Fourier transform (STFT). This study compares the performance of the STFT with the Hyper Localized Wavelet Transform (HLT) within an audio fingerprinting pipeline, focusing on three key metrics: correctly identifying songs (accuracy), robustness towards noise, and memory. Results indicate that while the STFT and the HLT demonstrate comparable accuracy, the latter exhibits superior noise robustness with a smaller memory usage. The STFT was found to generate approximately 1.23 times more data when creating the fingerprint database compared to the HLT. (Less)
Abstract (Swedish): Ljudfingeravtryck är en välkänd teknik som genererar unika fingeravtryck för ljudsignaler vilka som senare kan användas för identifiering. Ett välkänt
exempel på detta är Shazam-applikationen vars koncept är att matcha en kort låtsnutt med en databas för att hitta namnet på låten och artisten. Generellt skapas ljudfingeravtrycken genom att applicera en tids-frekvens-transform på ljudsignalen och extrahera de mest framträdande komponenterna i tids-frekvensdomänen. Det är standard att använda korttids Fouriertransformen (STFT) men det finns också transformer med andra egenskaper. Denna studie jämför prestandan för STFT med Hyper Localized Wavelet Transform (HLT) inom en ljudfingeravtrycksprocess, med fokus på tre viktiga mätvärden: korrekt... (More); Ljudfingeravtryck är en välkänd teknik som genererar unika fingeravtryck för ljudsignaler vilka som senare kan användas för identifiering. Ett välkänt
exempel på detta är Shazam-applikationen vars koncept är att matcha en kort låtsnutt med en databas för att hitta namnet på låten och artisten. Generellt skapas ljudfingeravtrycken genom att applicera en tids-frekvens-transform på ljudsignalen och extrahera de mest framträdande komponenterna i tids-frekvensdomänen. Det är standard att använda korttids Fouriertransformen (STFT) men det finns också transformer med andra egenskaper. Denna studie jämför prestandan för STFT med Hyper Localized Wavelet Transform (HLT) inom en ljudfingeravtrycksprocess, med fokus på tre viktiga mätvärden: korrekt identifiering av låtar (precision), robusthet mot brus och minnesanvändning. Resultaten visar att medan STFT och HLT uppvisar jämförbar precision, visar den senare överlägsen robusthet mot brus med mindre minnesanvändning. Vidare visade sig STFT generera ungefär 1,23 gånger mer data vid skapandet av fingeravtrycksdatabasen jämfört med HLT. (Less)
Popular Abstract: Have you ever found yourself in a situation where you hear a captivating song at a pub or a restaurant, but can not recall its name or artist? Searching for it based on fragmented lyrics often proves futile and frustrating. However, audio fingerprinting offers a solution to this common dilemma. By simply recording a snippet of the song with your phone, a unique fingerprint of the song can be extracted and matched with a large database of song fingerprints in no-time.

The first step of audio fingerprinting is to perform a time-frequency decomposition of the audio to find the most characterizing frequencies over time. This Master's thesis explores two different methods for time-frequency decomposition, aiming to enhance the precision and... (More); Have you ever found yourself in a situation where you hear a captivating song at a pub or a restaurant, but can not recall its name or artist? Searching for it based on fragmented lyrics often proves futile and frustrating. However, audio fingerprinting offers a solution to this common dilemma. By simply recording a snippet of the song with your phone, a unique fingerprint of the song can be extracted and matched with a large database of song fingerprints in no-time.

The first step of audio fingerprinting is to perform a time-frequency decomposition of the audio to find the most characterizing frequencies over time. This Master's thesis explores two different methods for time-frequency decomposition, aiming to enhance the precision and robustness of audio fingerprinting systems for song identification. By comparing the short-time Fourier transform (STFT) and Hyper Localized Wavelet Transform (HLT), this study seeks to evaluate their accuracy in correctly identifying songs.

Time-frequency decomposition methods play a pivotal role in extracting meaningful features from audio signals that are later used to create audio fingerprints. In summary, the decomposition is created by dividing an audio signal into short time-segments; it is then possible to extract the frequencies that occur within each segment from the decomposition. The audio fingerprint is then created by mapping frequencies from each segment in the decomposition in a way that is as unique as possible for each audio signal.

However, more often than not there is some amount of noise or distortions occurring when we want to identify a song. The noise can make it more difficult to get an accurate time-frequency decomposition which is crucial for creating a unique audio fingerprint to match with the database. It is therefore important to use a time-frequency decomposition that is accurate and resistant to noise.

Our experiments reveal that the STFT and the HLT are quite similar in accurately identifying songs but the HLT is superior when more noise is present. The HLT requires more time to perform the decomposition, but in a practical setting, it can be argued to not be a significant drawback.

In conclusion, this Master's thesis highlights the significance of time-frequency decomposition methods in audio fingerprinting for song recognition. By providing insights into the performance of the STFT and the HLT, this study shows the differences and potential of both methods. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9169577

author

Hultman, Victor ^LU and Gälldin, Niklas ^LU

supervisor

Fredrik Edman ^LU

organization

Department of Electrical and Information Technology

course

EITM01 20241

year

2024

type

H2 - Master's Degree (Two Years)

subject

Technology and Engineering

keywords

Audio Fingerprinting, Time-frequency decomposition, Resolution, Shazam

report number

LU/LTH-EIT 2024-1004

language

English

id

9169577

date added to LUP

2024-07-04 09:52:32

date last changed

2024-07-04 09:52:32

@misc{9169577,
  abstract     = {{Audio fingerprinting is a widely employed technique that involves generating unique fingerprints for given audio signals that later can be used for identification. A well-known example of this is the Shazam application where the concept is to match a short song snippet with a database to find the name of the song and artist. Generally, the audio fingerprints are created by applying a time-frequency transform on the audio signal and extracting the most prominent features in the time-frequency domain. There are different transforms with different properties but the standard choice is the short-time Fourier transform (STFT). This study compares the performance of the STFT with the Hyper Localized Wavelet Transform (HLT) within an audio fingerprinting pipeline, focusing on three key metrics: correctly identifying songs (accuracy), robustness towards noise, and memory. Results indicate that while the STFT and the HLT demonstrate comparable accuracy, the latter exhibits superior noise robustness with a smaller memory usage. The STFT was found to generate approximately 1.23 times more data when creating the fingerprint database compared to the HLT.}},
  author       = {{Hultman, Victor and Gälldin, Niklas}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Audio Fingerprinting - A Decomposing Study}},
  year         = {{2024}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Audio Fingerprinting - A Decomposing Study