Generative Adversarial Networks in Lip-Synchronized Deepfakes for Personalized Video Messages

Liljegren, Johan; Nordqvist, Pontus

Generative Adversarial Networks in Lip-Synchronized Deepfakes for Personalized Video Messages

Mark

Liljegren, Johan ^LU and Nordqvist, Pontus ^LU (2021) In Master's Theses in Mathematical Sciences FMAM05 20211
Mathematics (Faculty of Engineering)

Abstract: The recent progress of deep learning has enabled more powerful frameworks to create good-quality deepfakes. Deepfakes, which are mostly known for malicious purposes, have great potential to be useful in areas such as the movie industry, education, and personalized messaging. This thesis focus on lip-synchronization, which is a part of a broader pipeline to develop personalized video messages, using deepfakes. For this application, the deep learning framework Generative Adversarial Networks (GAN), adapted to a given audio and video input, was used. The objectives were to implement a structure to perform lip-synchronization, investigate what variations of GANs excel at this task, and also how different datasets impact the results.

Three... (More); The recent progress of deep learning has enabled more powerful frameworks to create good-quality deepfakes. Deepfakes, which are mostly known for malicious purposes, have great potential to be useful in areas such as the movie industry, education, and personalized messaging. This thesis focus on lip-synchronization, which is a part of a broader pipeline to develop personalized video messages, using deepfakes. For this application, the deep learning framework Generative Adversarial Networks (GAN), adapted to a given audio and video input, was used. The objectives were to implement a structure to perform lip-synchronization, investigate what variations of GANs excel at this task, and also how different datasets impact the results.

Three different models were investigated: firstly, the GAN architecture LipGAN was reimplemented in Pytorch, secondly, a GAN variation, WGAN-GP, was adapted to the LipGAN architecture, and thirdly, a novel approach that takes inspiration from both models, L1WGAN-GP, was developed and implemented. All models were trained using the dataset GRID and benchmarked by the metrics PSNR, SSIM, and FID-score. Lastly, the influence of the training dataset was tested by comparing our implementation of LipGAN with another implementation trained on another dataset, LRS2.

WGAN-GP did not converge and resulted in suspected mode collapse. For the two other models, we showed that the LipGAN implementation performed best in terms of PSNR and SSIM, whereas L1WGAN-GP performed better than LipGAN according to the FID-score. Yet, L1WGAN-GP produced samples that were polluted by artifacts. Our models trained on the GRID dataset showed bad generalization performance compared to the same model trained on LRS2. Additionally, the models trained on less amount of data were outperformed by models that were trained on the full dataset.

Finally, our results suggest that LipGAN was the best performing network, and with it we successfully managed to produce satisfying lip-synchronization. (Less)
Popular Abstract (Swedish): Med hjälp utav de senaste rönen inom maskininlärning utvecklar vi en produkt, som automatiserat kan producera nästa generation av videomeddelanden. Detta med hjälp av den så kallade “deepfake”-teknologin.

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9060411

author

Liljegren, Johan ^LU and Nordqvist, Pontus ^LU

supervisor

Carina Geldhauser ^LU

organization

Mathematics (Faculty of Engineering)

course

FMAM05 20211

year

2021

type

H2 - Master's Degree (Two Years)

subject

keywords

Generative Adversarial Networks, GAN, Lip-Synchronization, Deepfake, Deep Learning, Autoencoder, WGAN, WGAN-GP, L1WGAN-GP, Skip-Connections, FID-Score

publication/series

Master's Theses in Mathematical Sciences

report number

LUTFMA-3450-2021

ISSN

1404-6342

other publication id

2021:E33

language

English

id

9060411

date added to LUP

2021-07-14 14:41:16

date last changed

2021-07-14 14:41:16

@misc{9060411,
  abstract     = {{The recent progress of deep learning has enabled more powerful frameworks to create good-quality deepfakes. Deepfakes, which are mostly known for malicious purposes, have great potential to be useful in areas such as the movie industry, education, and personalized messaging. This thesis focus on lip-synchronization, which is a part of a broader pipeline to develop personalized video messages, using deepfakes. For this application, the deep learning framework Generative Adversarial Networks (GAN), adapted to a given audio and video input, was used. The objectives were to implement a structure to perform lip-synchronization, investigate what variations of GANs excel at this task, and also how different datasets impact the results. 

Three different models were investigated: firstly, the GAN architecture LipGAN was reimplemented in Pytorch, secondly, a GAN variation, WGAN-GP, was adapted to the LipGAN architecture, and thirdly, a novel approach that takes inspiration from both models, L1WGAN-GP, was developed and implemented. All models were trained using the dataset GRID and benchmarked by the metrics PSNR, SSIM, and FID-score. Lastly, the influence of the training dataset was tested by comparing our implementation of LipGAN with another implementation trained on another dataset, LRS2.

WGAN-GP did not converge and resulted in suspected mode collapse. For the two other models, we showed that the LipGAN implementation performed best in terms of PSNR and SSIM, whereas L1WGAN-GP performed better than LipGAN according to the FID-score. Yet, L1WGAN-GP produced samples that were polluted by artifacts. Our models trained on the GRID dataset showed bad generalization performance compared to the same model trained on LRS2. Additionally, the models trained on less amount of data were outperformed by models that were trained on the full dataset.

Finally, our results suggest that LipGAN was the best performing network, and with it we successfully managed to produce satisfying lip-synchronization.}},
  author       = {{Liljegren, Johan and Nordqvist, Pontus}},
  issn         = {{1404-6342}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Master's Theses in Mathematical Sciences}},
  title        = {{Generative Adversarial Networks in Lip-Synchronized Deepfakes for Personalized Video Messages}},
  year         = {{2021}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Generative Adversarial Networks in Lip-Synchronized Deepfakes for Personalized Video Messages