Where to Fuse
(2024) In Master's Theses in Mathematical Sciences FMSM01 20232Mathematical Statistics
- Abstract
- This thesis investigates fusion techniques in multimodal transformer models, focusing on enhancing the capabilities of large language models in understanding not just text, but also other
modalities like images, audio, and sensor data. The study compares late fusion (concatenating modality tokens after separate encoding) and early fusion (concatenating before encoding)
techniques, examining their respective advantages and disadvantages. It examines a mid-fusion
approach, aiming to combine the strengths of both methods. The effectiveness of this approach
is evaluated in terms of accuracy and computational impact on the Visual Question Answering
(VQA) task. Using a pretrained T5 model, the research incorporates image tokens (calculaed
... (More) - This thesis investigates fusion techniques in multimodal transformer models, focusing on enhancing the capabilities of large language models in understanding not just text, but also other
modalities like images, audio, and sensor data. The study compares late fusion (concatenating modality tokens after separate encoding) and early fusion (concatenating before encoding)
techniques, examining their respective advantages and disadvantages. It examines a mid-fusion
approach, aiming to combine the strengths of both methods. The effectiveness of this approach
is evaluated in terms of accuracy and computational impact on the Visual Question Answering
(VQA) task. Using a pretrained T5 model, the research incorporates image tokens (calculaed
by Vision Transformer, ViT) into intermediate activations of the model. The findings indicate
that standard early fusion techniques underperform with larger decoders, while late fusion with a
smaller decoder yields the best results on the VQA task. This conclusion also extends to pooled
modality tokens. Additionally, the thesis includes a comprehensive literature study, identifying
benchmark datasets for video understanding in multimodal learning and highlighting datasets that
demand a robust understanding of all involved modalities. This research contributes to the field
by exploring and validating a novel fusion technique in multimodal learning, offering insights into
its practical applications and limitations. (Less) - Popular Abstract
- ChatGPT achieved record-breaking growth as a consumer product. However, the utility of Large
Language Models (LLMs) are somewhat limited due to their reliance on text-only input. What are
the best ways to incorporate other forms of input into these models?
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/9147256
- author
- Petersson, Lukas LU
- supervisor
- organization
- alternative title
- Var man ska sammanfoga
- course
- FMSM01 20232
- year
- 2024
- type
- H2 - Master's Degree (Two Years)
- subject
- publication/series
- Master's Theses in Mathematical Sciences
- report number
- LUTFMS-3493-2024
- ISSN
- 1404-6342
- other publication id
- 2024:E7
- language
- English
- id
- 9147256
- date added to LUP
- 2024-02-02 11:18:23
- date last changed
- 2024-02-08 13:46:24
@misc{9147256, abstract = {{This thesis investigates fusion techniques in multimodal transformer models, focusing on enhancing the capabilities of large language models in understanding not just text, but also other modalities like images, audio, and sensor data. The study compares late fusion (concatenating modality tokens after separate encoding) and early fusion (concatenating before encoding) techniques, examining their respective advantages and disadvantages. It examines a mid-fusion approach, aiming to combine the strengths of both methods. The effectiveness of this approach is evaluated in terms of accuracy and computational impact on the Visual Question Answering (VQA) task. Using a pretrained T5 model, the research incorporates image tokens (calculaed by Vision Transformer, ViT) into intermediate activations of the model. The findings indicate that standard early fusion techniques underperform with larger decoders, while late fusion with a smaller decoder yields the best results on the VQA task. This conclusion also extends to pooled modality tokens. Additionally, the thesis includes a comprehensive literature study, identifying benchmark datasets for video understanding in multimodal learning and highlighting datasets that demand a robust understanding of all involved modalities. This research contributes to the field by exploring and validating a novel fusion technique in multimodal learning, offering insights into its practical applications and limitations.}}, author = {{Petersson, Lukas}}, issn = {{1404-6342}}, language = {{eng}}, note = {{Student Paper}}, series = {{Master's Theses in Mathematical Sciences}}, title = {{Where to Fuse}}, year = {{2024}}, }