Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Machine Learning-based Multimodal Data Compression

Forsell, Jacob LU and Jin, Yuyang (2024) EITM01 20241
Department of Electrical and Information Technology
Abstract
The field of learned image compression has been facing rapid development and research engagement. In this thesis, we seek to make an addition to the field by extending a state-of-the-art learned image compression architecture, called LIC-TCM, by incorporating a depth map as a second complementary modality to further enhance the image compression. Furthermore, we explore the inverse, meaning we primarily compress a depth map (which can be expressed as an image) using LIC-TCM and incorporate the corresponding image frame as a secondary complementary modality. An important goal in this work has been to target low complexity in terms of encoding and decoding time, and implementation, something which is reflected in our proposed architectures.
... (More)
The field of learned image compression has been facing rapid development and research engagement. In this thesis, we seek to make an addition to the field by extending a state-of-the-art learned image compression architecture, called LIC-TCM, by incorporating a depth map as a second complementary modality to further enhance the image compression. Furthermore, we explore the inverse, meaning we primarily compress a depth map (which can be expressed as an image) using LIC-TCM and incorporate the corresponding image frame as a secondary complementary modality. An important goal in this work has been to target low complexity in terms of encoding and decoding time, and implementation, something which is reflected in our proposed architectures.

In this thesis, we propose three unique architectures. The first architecture, Attention-based Multimodal-LIC-TCM, uses a depth map as a secondary complementary modality and expands the encoder such that it incorporates depth as a token into its SWIN transformer architecture. The second architecture, Convolution-based Multimodal-LIC-TCM, uses an image as a secondary complementary modality and uses a unique convolution-based module to extract features from both the image and depth modalities and subsequently fuses them for further enhanced compression. The third architecture, 4-channel LIC-TCM architecture, jointly accepts an image and depth as input, and either targets an image or depth map as reconstruction depending on configuration. However, for this last architecture, the LIC-TCM architecture itself has not been altered thus minimizing complexity.

Generally, we find that the three architectures show improved reconstruction performance while simultaneously not affecting the compressed size, but the improvement is not significant. However, the attention- and convolutional-based architectures introduce more time- and implementation complexity, while 4-channel LIC-TCM is not affected in this sense. These results indicate positive research directions, but further investigations are required for more conclusive results. (Less)
Popular Abstract
Imagine you're having a text conversation with a friend who hints that their favorite animal is among the most common livestock in the world, and they want you to guess what it is. Your mind starts narrowing down the possibilities: Is it a sheep, cattle, or goats? Then, your friend sends you a sound clip of the animal making a 'moo' sound, and you realize it's a cow. In solving this problem, you processed information from two different channels - text and sound. By combining these channels of information, or modalities, your final prediction became certain.

In this work, we aim to investigate whether multiple modalities describing the same scenery can be used jointly to improve the performance of machine learning models. Just as the... (More)
Imagine you're having a text conversation with a friend who hints that their favorite animal is among the most common livestock in the world, and they want you to guess what it is. Your mind starts narrowing down the possibilities: Is it a sheep, cattle, or goats? Then, your friend sends you a sound clip of the animal making a 'moo' sound, and you realize it's a cow. In solving this problem, you processed information from two different channels - text and sound. By combining these channels of information, or modalities, your final prediction became certain.

In this work, we aim to investigate whether multiple modalities describing the same scenery can be used jointly to improve the performance of machine learning models. Just as the combination of text and sound can accurately predict a cow, we aim to combine different sensory data from a camera to achieve better file compression.

Currently, two common methods for compressing standard color images are PNG and JPEG. PNG effectively saves the image without compromising quality, whereas JPEG achieves smaller file sizes but at the expense of image quality. In other words, PNG is a lossless method, while JPEG is lossy due to its compression technique. However, our focus in this work isn't on conventional compression algorithms; instead, we aim to utilize lossy machine learning-based models to enhance image compression.

The objective of this work is to enhance a modern state-of-the-art machine learning-based image compression model by incorporating multimodal properties. Specifically, we aim to enable the model to combine two modalities, RGB (standard color image) and depth, during compression. Depth data is obtained from a sensor that measures the distance from the camera to different objects in the scene. By integrating these two modalities, the machine learning-based model can effectively learn complex relationships between the sensory data and achieve higher compression rates while maintaining good quality in the RGB image. The decision to combine the two modalities, RGB and depth, was deliberate. Many modern cameras, such as those on smartphones, are equipped with both a standard camera and a depth sensor. Therefore, if our compression method proves to be more effective, it could mean increased disk space saving on a device!

Beyond merely combining these two modalities, we aimed to explore various implementation strategies and enhance state-of-the-art models. In our work, we propose three different multimodal machine learning models, each following their own implementation strategy. Overall, we believe we made a modest yet meaningful contribution to the field of machine learning-based image compression. Our final results suggest promising research directions, although further investigations are needed to achieve more conclusive outcomes. (Less)
Please use this url to cite or link to this publication:
@misc{9169232,
  abstract     = {{The field of learned image compression has been facing rapid development and research engagement. In this thesis, we seek to make an addition to the field by extending a state-of-the-art learned image compression architecture, called LIC-TCM, by incorporating a depth map as a second complementary modality to further enhance the image compression. Furthermore, we explore the inverse, meaning we primarily compress a depth map (which can be expressed as an image) using LIC-TCM and incorporate the corresponding image frame as a secondary complementary modality. An important goal in this work has been to target low complexity in terms of encoding and decoding time, and implementation, something which is reflected in our proposed architectures.

In this thesis, we propose three unique architectures. The first architecture, Attention-based Multimodal-LIC-TCM, uses a depth map as a secondary complementary modality and expands the encoder such that it incorporates depth as a token into its SWIN transformer architecture. The second architecture, Convolution-based Multimodal-LIC-TCM, uses an image as a secondary complementary modality and uses a unique convolution-based module to extract features from both the image and depth modalities and subsequently fuses them for further enhanced compression. The third architecture, 4-channel LIC-TCM architecture, jointly accepts an image and depth as input, and either targets an image or depth map as reconstruction depending on configuration. However, for this last architecture, the LIC-TCM architecture itself has not been altered thus minimizing complexity.

Generally, we find that the three architectures show improved reconstruction performance while simultaneously not affecting the compressed size, but the improvement is not significant. However, the attention- and convolutional-based architectures introduce more time- and implementation complexity, while 4-channel LIC-TCM is not affected in this sense. These results indicate positive research directions, but further investigations are required for more conclusive results.}},
  author       = {{Forsell, Jacob and Jin, Yuyang}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Machine Learning-based Multimodal Data Compression}},
  year         = {{2024}},
}