Diffusion-based Vocoding for Real-Time Text-To-Speech

Gardberg, Lukas

Diffusion-based Vocoding for Real-Time Text-To-Speech

Mark

Gardberg, Lukas ^LU (2023) In Master's Theses in Mathematical Sciences FMSM01 20231
Mathematical Statistics

Abstract: The emergence of machine learning based text-to-speech systems have made fully automated customer service voice calls, spoken personal assistants, and the creation of synthetic voices seem well within reach. However, there are still many technical challenges with creating such a system which can generate audio quickly and of high enough quality. One critical component of the typical text-to-speech pipeline is the vocoder, which is responsible for producing the final waveform in the process. This thesis investigates solving the vocoder problem using a statistical framework called diffusion, which is used to teach a neural network to sequentially transform noise into recorded speech. Experiments are done by extending the framework with three... (More); The emergence of machine learning based text-to-speech systems have made fully automated customer service voice calls, spoken personal assistants, and the creation of synthetic voices seem well within reach. However, there are still many technical challenges with creating such a system which can generate audio quickly and of high enough quality. One critical component of the typical text-to-speech pipeline is the vocoder, which is responsible for producing the final waveform in the process. This thesis investigates solving the vocoder problem using a statistical framework called diffusion, which is used to teach a neural network to sequentially transform noise into recorded speech. Experiments are done by extending the framework with three different theoretical improvements, and evaluating a range of different diffusion-based vocoders which use these improvements with respect to inference speed and audio quality. In addition to this, a new variant of one such improvement is proposed, called a ”variance schedule”, which is shown to perform on par with previously adopted methods. Greater training stability is also achieved via methods inspired by diffusion models for image generation. The extensions of the framework are found to have a mostly positive effect on model performance, and audio is shown to be able to be generated at a quality equal to current state-of-the-art vocoders based on Generative Adversarial Networks, but not at the same speeds. Furthermore, we find that it is possible for a diffusion-based vocoder to achieve a 12 times speed up while retaining a comparable audio quality, and are convinced that further speed ups are possible. Inference for a real-time text-to-speech application is thought to be viable using a graphics processing unit, but not a central processing unit. (Less)
Popular Abstract (Swedish): Vid sidan av strävan att uppnå AI-modeller som kan skapa extremt verklighetstrogna texter och bilder existerar även målet att kunna generera mänskligt tal. Målet med detta område, kallat "talsyntes", är att skapa en syntetisk mänsklig röst som har förmågan att läsa upp en given text. Detta kan t.ex. användas i automatiserad kundtjänst eller för personliga assistenter. Själva problemet har traditionellt sett oftast lösts i flera steg. Det första steget omvandlar text till en mer specifik representation för uttalet, t.ex. fonetiska symboler. Det andra steget omvandlar dessa till ett spektrogram, det vill säga en lågdimensionell representation av vilka frekvenser talet innehåller. Till sist omvandlas detta till tal i form av en ljudfil,... (More); Vid sidan av strävan att uppnå AI-modeller som kan skapa extremt verklighetstrogna texter och bilder existerar även målet att kunna generera mänskligt tal. Målet med detta område, kallat "talsyntes", är att skapa en syntetisk mänsklig röst som har förmågan att läsa upp en given text. Detta kan t.ex. användas i automatiserad kundtjänst eller för personliga assistenter. Själva problemet har traditionellt sett oftast lösts i flera steg. Det första steget omvandlar text till en mer specifik representation för uttalet, t.ex. fonetiska symboler. Det andra steget omvandlar dessa till ett spektrogram, det vill säga en lågdimensionell representation av vilka frekvenser talet innehåller. Till sist omvandlas detta till tal i form av en ljudfil, vilket görs av en modell kallad en "vocoder".

Detta arbete utforskar användandet av en modern statistisk metod kallad "diffusion" för att utföra det sista vocoder-steget i talsyntes-processen. Detta möjliggörs med en stor mängd ljudboksinspelningar som används för att lära modellen hur mänskligt tal låter. Genom att upprepat låta modellen försöka generera tal för olika fraser, och sedan justera modellen utifrån hur fel den hade kan man stegvis uppnå bättre och bättre resultat. Detta tillvägagångssätt har idag möjliggjorts tack vare en större tillgång till kraftfull beräkningshårdvara, samt mer avancerade modeller så som neurala nätverk.

Huvudidén med diffusion är att utgå ifrån talinspelningar och stegvis förstöra dessa genom att lägga till mer och mer brus. Detta brus kan t.ex. vara vitt brus som är en helt slumpmässig ljudsignal som innehåller lika mycket av alla frekvenser. På så sätt erhålls talinspelningar som innehåller olika mycket brus, och till sist når helt vitt brus. Kärnan i diffusion är sedan att lära sig att invertera denna process, det vill säga att utifrån de olika brusiga ljudklippen lära en modell att stegvis ta bort bruset. När en text sedan ska läsas upp så tittar diffusions-modellen på de frekvenser som talet ska innehålla utifrån det tidigare steget, och "karvar" sedan stegvis bort brus från ett startbrus tills mänskligt tal framträder.

En av de största utmaningarna som detta arbete har undersökt är hur snabb denna process kan göras, då tidskraven i en praktisk applikation oftast är väldigt höga. Genom att reducera antalet steg i brusreduceringen lyckades en hastighet 20 gånger snabbare än realtid uppnås på ett grafikkort, det vill säga att generera 20 sekunder av tal på 1 sekund. Vidare har även ett mer sofistikerat startbrus utforskats, vilket lyckas ge modellen ett försprång i brusreduceringen och på så sätt uppnå ljud av högre kvalitet på samma tid. Till sist har även en förbättring av hur modellen tränas testats vilket resulterade i en mer stabil träningsprocess. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9113799

author

Gardberg, Lukas ^LU

supervisor

Maria Sandsten ^LU
Michael Truong

organization

Mathematical Statistics

course

FMSM01 20231

year

2023

type

H2 - Master's Degree (Two Years)

subject

Mathematics and Statistics

keywords

Diffusion, Vocoding, Text-to-Speech, Machine Learning

publication/series

Master's Theses in Mathematical Sciences

report number

LUTFMS-3469-2023

ISSN

1404-6342

other publication id

2023:E12

language

English

id

9113799

date added to LUP

2023-04-27 10:45:44

date last changed

2023-04-28 14:24:14

@misc{9113799,
  abstract     = {{The emergence of machine learning based text-to-speech systems have made fully automated customer service voice calls, spoken personal assistants, and the creation of synthetic voices seem well within reach. However, there are still many technical challenges with creating such a system which can generate audio quickly and of high enough quality. One critical component of the typical text-to-speech pipeline is the vocoder, which is responsible for producing the final waveform in the process. This thesis investigates solving the vocoder problem using a statistical framework called diffusion, which is used to teach a neural network to sequentially transform noise into recorded speech. Experiments are done by extending the framework with three different theoretical improvements, and evaluating a range of different diffusion-based vocoders which use these improvements with respect to inference speed and audio quality. In addition to this, a new variant of one such improvement is proposed, called a ”variance schedule”, which is shown to perform on par with previously adopted methods. Greater training stability is also achieved via methods inspired by diffusion models for image generation. The extensions of the framework are found to have a mostly positive effect on model performance, and audio is shown to be able to be generated at a quality equal to current state-of-the-art vocoders based on Generative Adversarial Networks, but not at the same speeds. Furthermore, we find that it is possible for a diffusion-based vocoder to achieve a 12 times speed up while retaining a comparable audio quality, and are convinced that further speed ups are possible. Inference for a real-time text-to-speech application is thought to be viable using a graphics processing unit, but not a central processing unit.}},
  author       = {{Gardberg, Lukas}},
  issn         = {{1404-6342}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Master's Theses in Mathematical Sciences}},
  title        = {{Diffusion-based Vocoding for Real-Time Text-To-Speech}},
  year         = {{2023}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Diffusion-based Vocoding for Real-Time Text-To-Speech