Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Where to Fuse

Petersson, Lukas LU (2024) In Master's Theses in Mathematical Sciences FMSM01 20232
Mathematical Statistics
Abstract
This thesis investigates fusion techniques in multimodal transformer models, focusing on enhancing the capabilities of large language models in understanding not just text, but also other
modalities like images, audio, and sensor data. The study compares late fusion (concatenating modality tokens after separate encoding) and early fusion (concatenating before encoding)
techniques, examining their respective advantages and disadvantages. It examines a mid-fusion
approach, aiming to combine the strengths of both methods. The effectiveness of this approach
is evaluated in terms of accuracy and computational impact on the Visual Question Answering
(VQA) task. Using a pretrained T5 model, the research incorporates image tokens (calculaed
... (More)
This thesis investigates fusion techniques in multimodal transformer models, focusing on enhancing the capabilities of large language models in understanding not just text, but also other
modalities like images, audio, and sensor data. The study compares late fusion (concatenating modality tokens after separate encoding) and early fusion (concatenating before encoding)
techniques, examining their respective advantages and disadvantages. It examines a mid-fusion
approach, aiming to combine the strengths of both methods. The effectiveness of this approach
is evaluated in terms of accuracy and computational impact on the Visual Question Answering
(VQA) task. Using a pretrained T5 model, the research incorporates image tokens (calculaed
by Vision Transformer, ViT) into intermediate activations of the model. The findings indicate
that standard early fusion techniques underperform with larger decoders, while late fusion with a
smaller decoder yields the best results on the VQA task. This conclusion also extends to pooled
modality tokens. Additionally, the thesis includes a comprehensive literature study, identifying
benchmark datasets for video understanding in multimodal learning and highlighting datasets that
demand a robust understanding of all involved modalities. This research contributes to the field
by exploring and validating a novel fusion technique in multimodal learning, offering insights into
its practical applications and limitations. (Less)
Popular Abstract
ChatGPT achieved record-breaking growth as a consumer product. However, the utility of Large
Language Models (LLMs) are somewhat limited due to their reliance on text-only input. What are
the best ways to incorporate other forms of input into these models?
Please use this url to cite or link to this publication:
author
Petersson, Lukas LU
supervisor
organization
alternative title
Var man ska sammanfoga
course
FMSM01 20232
year
type
H2 - Master's Degree (Two Years)
subject
publication/series
Master's Theses in Mathematical Sciences
report number
LUTFMS-3493-2024
ISSN
1404-6342
other publication id
2024:E7
language
English
id
9147256
date added to LUP
2024-02-02 11:18:23
date last changed
2024-02-08 13:46:24
@misc{9147256,
  abstract     = {{This thesis investigates fusion techniques in multimodal transformer models, focusing on enhancing the capabilities of large language models in understanding not just text, but also other
modalities like images, audio, and sensor data. The study compares late fusion (concatenating modality tokens after separate encoding) and early fusion (concatenating before encoding)
techniques, examining their respective advantages and disadvantages. It examines a mid-fusion
approach, aiming to combine the strengths of both methods. The effectiveness of this approach
is evaluated in terms of accuracy and computational impact on the Visual Question Answering
(VQA) task. Using a pretrained T5 model, the research incorporates image tokens (calculaed
by Vision Transformer, ViT) into intermediate activations of the model. The findings indicate
that standard early fusion techniques underperform with larger decoders, while late fusion with a
smaller decoder yields the best results on the VQA task. This conclusion also extends to pooled
modality tokens. Additionally, the thesis includes a comprehensive literature study, identifying
benchmark datasets for video understanding in multimodal learning and highlighting datasets that
demand a robust understanding of all involved modalities. This research contributes to the field
by exploring and validating a novel fusion technique in multimodal learning, offering insights into
its practical applications and limitations.}},
  author       = {{Petersson, Lukas}},
  issn         = {{1404-6342}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Master's Theses in Mathematical Sciences}},
  title        = {{Where to Fuse}},
  year         = {{2024}},
}