Where to Fuse

Petersson, Lukas

Where to Fuse

Mark

Petersson, Lukas ^LU (2024) In Master's Theses in Mathematical Sciences FMSM01 20232
Mathematical Statistics

Abstract: This thesis investigates fusion techniques in multimodal transformer models, focusing on enhancing the capabilities of large language models in understanding not just text, but also other
modalities like images, audio, and sensor data. The study compares late fusion (concatenating modality tokens after separate encoding) and early fusion (concatenating before encoding)
techniques, examining their respective advantages and disadvantages. It examines a mid-fusion
approach, aiming to combine the strengths of both methods. The effectiveness of this approach
is evaluated in terms of accuracy and computational impact on the Visual Question Answering
(VQA) task. Using a pretrained T5 model, the research incorporates image tokens (calculaed
... (More); This thesis investigates fusion techniques in multimodal transformer models, focusing on enhancing the capabilities of large language models in understanding not just text, but also other
modalities like images, audio, and sensor data. The study compares late fusion (concatenating modality tokens after separate encoding) and early fusion (concatenating before encoding)
techniques, examining their respective advantages and disadvantages. It examines a mid-fusion
approach, aiming to combine the strengths of both methods. The effectiveness of this approach
is evaluated in terms of accuracy and computational impact on the Visual Question Answering
(VQA) task. Using a pretrained T5 model, the research incorporates image tokens (calculaed
by Vision Transformer, ViT) into intermediate activations of the model. The findings indicate
that standard early fusion techniques underperform with larger decoders, while late fusion with a
smaller decoder yields the best results on the VQA task. This conclusion also extends to pooled
modality tokens. Additionally, the thesis includes a comprehensive literature study, identifying
benchmark datasets for video understanding in multimodal learning and highlighting datasets that
demand a robust understanding of all involved modalities. This research contributes to the field
by exploring and validating a novel fusion technique in multimodal learning, offering insights into
its practical applications and limitations. (Less)
Popular Abstract: ChatGPT achieved record-breaking growth as a consumer product. However, the utility of Large
Language Models (LLMs) are somewhat limited due to their reliance on text-only input. What are
the best ways to incorporate other forms of input into these models?

- Open Access
- |
- PDF

Links

Document download statistics

Related Materials

Related object is popular science:
Where to Fuse - Popular

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9147256

author

Petersson, Lukas ^LU

supervisor

Andreas Jakobsson ^LU

organization

Mathematical Statistics

alternative title

Var man ska sammanfoga

course

FMSM01 20232

year

2024

type

H2 - Master's Degree (Two Years)

subject

Technology and Engineering

publication/series

Master's Theses in Mathematical Sciences

report number

LUTFMS-3493-2024

ISSN

1404-6342

other publication id

2024:E7

language

English

id

9147256

date added to LUP

2024-02-02 11:18:23

date last changed

2024-02-08 13:46:24

@misc{9147256,
  abstract     = {{This thesis investigates fusion techniques in multimodal transformer models, focusing on enhancing the capabilities of large language models in understanding not just text, but also other
modalities like images, audio, and sensor data. The study compares late fusion (concatenating modality tokens after separate encoding) and early fusion (concatenating before encoding)
techniques, examining their respective advantages and disadvantages. It examines a mid-fusion
approach, aiming to combine the strengths of both methods. The effectiveness of this approach
is evaluated in terms of accuracy and computational impact on the Visual Question Answering
(VQA) task. Using a pretrained T5 model, the research incorporates image tokens (calculaed
by Vision Transformer, ViT) into intermediate activations of the model. The findings indicate
that standard early fusion techniques underperform with larger decoders, while late fusion with a
smaller decoder yields the best results on the VQA task. This conclusion also extends to pooled
modality tokens. Additionally, the thesis includes a comprehensive literature study, identifying
benchmark datasets for video understanding in multimodal learning and highlighting datasets that
demand a robust understanding of all involved modalities. This research contributes to the field
by exploring and validating a novel fusion technique in multimodal learning, offering insights into
its practical applications and limitations.}},
  author       = {{Petersson, Lukas}},
  issn         = {{1404-6342}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Master's Theses in Mathematical Sciences}},
  title        = {{Where to Fuse}},
  year         = {{2024}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Where to Fuse