Machine Learning-based Multimodal Data Compression

Forsell, Jacob; Jin, Yuyang

Machine Learning-based Multimodal Data Compression

Mark

Forsell, Jacob ^LU and Jin, Yuyang (2024) EITM01 20241
Department of Electrical and Information Technology

Abstract: The field of learned image compression has been facing rapid development and research engagement. In this thesis, we seek to make an addition to the field by extending a state-of-the-art learned image compression architecture, called LIC-TCM, by incorporating a depth map as a second complementary modality to further enhance the image compression. Furthermore, we explore the inverse, meaning we primarily compress a depth map (which can be expressed as an image) using LIC-TCM and incorporate the corresponding image frame as a secondary complementary modality. An important goal in this work has been to target low complexity in terms of encoding and decoding time, and implementation, something which is reflected in our proposed architectures.
... (More); The field of learned image compression has been facing rapid development and research engagement. In this thesis, we seek to make an addition to the field by extending a state-of-the-art learned image compression architecture, called LIC-TCM, by incorporating a depth map as a second complementary modality to further enhance the image compression. Furthermore, we explore the inverse, meaning we primarily compress a depth map (which can be expressed as an image) using LIC-TCM and incorporate the corresponding image frame as a secondary complementary modality. An important goal in this work has been to target low complexity in terms of encoding and decoding time, and implementation, something which is reflected in our proposed architectures.

In this thesis, we propose three unique architectures. The first architecture, Attention-based Multimodal-LIC-TCM, uses a depth map as a secondary complementary modality and expands the encoder such that it incorporates depth as a token into its SWIN transformer architecture. The second architecture, Convolution-based Multimodal-LIC-TCM, uses an image as a secondary complementary modality and uses a unique convolution-based module to extract features from both the image and depth modalities and subsequently fuses them for further enhanced compression. The third architecture, 4-channel LIC-TCM architecture, jointly accepts an image and depth as input, and either targets an image or depth map as reconstruction depending on configuration. However, for this last architecture, the LIC-TCM architecture itself has not been altered thus minimizing complexity.

Generally, we find that the three architectures show improved reconstruction performance while simultaneously not affecting the compressed size, but the improvement is not significant. However, the attention- and convolutional-based architectures introduce more time- and implementation complexity, while 4-channel LIC-TCM is not affected in this sense. These results indicate positive research directions, but further investigations are required for more conclusive results. (Less)
Popular Abstract: Imagine you're having a text conversation with a friend who hints that their favorite animal is among the most common livestock in the world, and they want you to guess what it is. Your mind starts narrowing down the possibilities: Is it a sheep, cattle, or goats? Then, your friend sends you a sound clip of the animal making a 'moo' sound, and you realize it's a cow. In solving this problem, you processed information from two different channels - text and sound. By combining these channels of information, or modalities, your final prediction became certain.

In this work, we aim to investigate whether multiple modalities describing the same scenery can be used jointly to improve the performance of machine learning models. Just as the... (More); Imagine you're having a text conversation with a friend who hints that their favorite animal is among the most common livestock in the world, and they want you to guess what it is. Your mind starts narrowing down the possibilities: Is it a sheep, cattle, or goats? Then, your friend sends you a sound clip of the animal making a 'moo' sound, and you realize it's a cow. In solving this problem, you processed information from two different channels - text and sound. By combining these channels of information, or modalities, your final prediction became certain.

In this work, we aim to investigate whether multiple modalities describing the same scenery can be used jointly to improve the performance of machine learning models. Just as the combination of text and sound can accurately predict a cow, we aim to combine different sensory data from a camera to achieve better file compression.

Currently, two common methods for compressing standard color images are PNG and JPEG. PNG effectively saves the image without compromising quality, whereas JPEG achieves smaller file sizes but at the expense of image quality. In other words, PNG is a lossless method, while JPEG is lossy due to its compression technique. However, our focus in this work isn't on conventional compression algorithms; instead, we aim to utilize lossy machine learning-based models to enhance image compression.

The objective of this work is to enhance a modern state-of-the-art machine learning-based image compression model by incorporating multimodal properties. Specifically, we aim to enable the model to combine two modalities, RGB (standard color image) and depth, during compression. Depth data is obtained from a sensor that measures the distance from the camera to different objects in the scene. By integrating these two modalities, the machine learning-based model can effectively learn complex relationships between the sensory data and achieve higher compression rates while maintaining good quality in the RGB image. The decision to combine the two modalities, RGB and depth, was deliberate. Many modern cameras, such as those on smartphones, are equipped with both a standard camera and a depth sensor. Therefore, if our compression method proves to be more effective, it could mean increased disk space saving on a device!

Beyond merely combining these two modalities, we aimed to explore various implementation strategies and enhance state-of-the-art models. In our work, we propose three different multimodal machine learning models, each following their own implementation strategy. Overall, we believe we made a modest yet meaningful contribution to the field of machine learning-based image compression. Our final results suggest promising research directions, although further investigations are needed to achieve more conclusive outcomes. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9169232

author

Forsell, Jacob ^LU and Jin, Yuyang

supervisor

organization

Department of Electrical and Information Technology

alternative title

Maskininlärningsbaserad Multimodal Datakomprimering

course

EITM01 20241

year

2024

type

H2 - Master's Degree (Two Years)

subject

Technology and Engineering

keywords

Machine learning, Image compression, Multimodal, Multimodal machine learning, Image reconstruction, Image coding, Neural network, Codecs, Training, Convolutional Neural Network, Artificial Neural Network, Attention mechanism, Attention module, Deep learning, Lossy image compression, Learned lossy image compression, End-to-end, Entropy model, Variational Autoencoder, Variational Image Compression, Bitrate, Entropy coding, Distortion Measure, Reconstruction quality, Multi-Scale Structural Similarity Index Measure, Peak Signal-to-noise Ratio, Rate-distortion, Multimodal fusion, Comprehensive similarity, Entropy, RGB and depth compression, LIC-TCM

report number

LU/LTH-EIT 2024-1006

language

English

id

9169232

date added to LUP

2024-07-04 09:52:02

date last changed

2024-07-04 09:52:02

@misc{9169232,
  abstract     = {{The field of learned image compression has been facing rapid development and research engagement. In this thesis, we seek to make an addition to the field by extending a state-of-the-art learned image compression architecture, called LIC-TCM, by incorporating a depth map as a second complementary modality to further enhance the image compression. Furthermore, we explore the inverse, meaning we primarily compress a depth map (which can be expressed as an image) using LIC-TCM and incorporate the corresponding image frame as a secondary complementary modality. An important goal in this work has been to target low complexity in terms of encoding and decoding time, and implementation, something which is reflected in our proposed architectures.

In this thesis, we propose three unique architectures. The first architecture, Attention-based Multimodal-LIC-TCM, uses a depth map as a secondary complementary modality and expands the encoder such that it incorporates depth as a token into its SWIN transformer architecture. The second architecture, Convolution-based Multimodal-LIC-TCM, uses an image as a secondary complementary modality and uses a unique convolution-based module to extract features from both the image and depth modalities and subsequently fuses them for further enhanced compression. The third architecture, 4-channel LIC-TCM architecture, jointly accepts an image and depth as input, and either targets an image or depth map as reconstruction depending on configuration. However, for this last architecture, the LIC-TCM architecture itself has not been altered thus minimizing complexity.

Generally, we find that the three architectures show improved reconstruction performance while simultaneously not affecting the compressed size, but the improvement is not significant. However, the attention- and convolutional-based architectures introduce more time- and implementation complexity, while 4-channel LIC-TCM is not affected in this sense. These results indicate positive research directions, but further investigations are required for more conclusive results.}},
  author       = {{Forsell, Jacob and Jin, Yuyang}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Machine Learning-based Multimodal Data Compression}},
  year         = {{2024}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Machine Learning-based Multimodal Data Compression