Developing and Evaluating an Object Detection Application for Real-World Data

Paulsson, Arvid; Jóhannsson, Daníel

Developing and Evaluating an Object Detection Application for Real-World Data

Mark

Paulsson, Arvid ^LU and Jóhannsson, Daníel ^LU (2025) In Master's Theses in Mathematical Sciences FMAM05 20251
Mathematics (Faculty of Engineering)

Abstract: This thesis presents a pipeline for automated object identification and dot based annotation in User Generated Content (UGC) images. The need for this type of model is motivated by IKEA’s digital product ”Content Recommendations” which aims to enhance customer engagement through shoppable UGC images and addresses the challenge of detecting specific products in various, real-world visual contexts. The approach combines state of the art multimodal models: GroundingDINO for object localization based on textual prompts, CLIP for text-to-image classification, and EfficientSAM for instance segmentation. A custom dot placement algorithm, utilizing the segmentation masks from the previous step and then assigns coordinates (dot) to each detected... (More); This thesis presents a pipeline for automated object identification and dot based annotation in User Generated Content (UGC) images. The need for this type of model is motivated by IKEA’s digital product ”Content Recommendations” which aims to enhance customer engagement through shoppable UGC images and addresses the challenge of detecting specific products in various, real-world visual contexts. The approach combines state of the art multimodal models: GroundingDINO for object localization based on textual prompts, CLIP for text-to-image classification, and EfficientSAM for instance segmentation. A custom dot placement algorithm, utilizing the segmentation masks from the previous step and then assigns coordinates (dot) to each detected object. The pipeline performs well across diverse and unstructured UGC images, successfully identifying and labeling both frequent and infrequent items. Due to inconsistent ground truth annotations and varying object prominence in UGC data, quantitative evaluation proved challenging. To address this, a proxy dataset containing manually verified dot annotations was used as ground truth. This enabled a more controlled comparison and supported assumptions about the pipeline’s expected performance on UGC images. Results indicate that multimodal AI methods are well suited for scalable, fine grained object detection and annotation in complex real-world visual
data. (Less)
Popular Abstract (Swedish): I en värld där bilder flödar från sociala medier, e-handel och inspirationssajter blir det allt viktigare att snabbt kunna tolka och interagera med visuellt innehåll. Samtidigt ställer användare allt högre krav på interaktiva upplevelser där de enkelt kan identifiera och få information om objekt de ser i bilder. Detta skapar ett behov av automatiserade lösningar som kan förstå och kategorisera objekt i bilder på ett intelligent sätt, samtidigt som de kan hantera den variation och komplexitet som kännetecknar verkliga användarbilder.

I denna masteruppsats presenteras ett system som automatiskt kan identifiera och märka ut "interaktiva punkter" på objekt i bilder tagna av vanliga användare, så kallat användargenererat innehåll (UGC).... (More); I en värld där bilder flödar från sociala medier, e-handel och inspirationssajter blir det allt viktigare att snabbt kunna tolka och interagera med visuellt innehåll. Samtidigt ställer användare allt högre krav på interaktiva upplevelser där de enkelt kan identifiera och få information om objekt de ser i bilder. Detta skapar ett behov av automatiserade lösningar som kan förstå och kategorisera objekt i bilder på ett intelligent sätt, samtidigt som de kan hantera den variation och komplexitet som kännetecknar verkliga användarbilder.

I denna masteruppsats presenteras ett system som automatiskt kan identifiera och märka ut "interaktiva punkter" på objekt i bilder tagna av vanliga användare, så kallat användargenererat innehåll (UGC). Systemet kombinerar flera moderna AI-modeller: Grounding DINO används för att hitta objekt i bilden utifrån textbeskrivningar, CLIP hjälper till med klassificering av objekten och EfficientSAM används för att skilja ut objekten från bakgrunden. Till sist placeras en punkt på varje identifierat objekt för att visa dess position. Figur 1 visar ett exempel på hur interaktiva punkter används i en IKEA-bild, där varje interaktiv punkt representerar ett identifierat objekt som länkar till ytterligare information.

Systemet visar lovande resultat och lyckas ofta hitta och namnge både vanliga och mer ovanliga objekt. Det här visar att det är möjligt att använda AI som kombinerar både bild- och textförståelse för att förbättra tolkningen av komplexa bilder. Resultaten kan få betydelse för exempelvis automatisk märkning av bilder i digitala arkiv eller för att effektivisera bildanalys i olika konsumenttjänster. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9192337

author

Paulsson, Arvid ^LU and Jóhannsson, Daníel ^LU

supervisor

Anders Heyden ^LU

organization

Mathematics (Faculty of Engineering)

course

FMAM05 20251

year

2025

type

H2 - Master's Degree (Two Years)

subject

Technology and Engineering

keywords

Computer Vision, Zero-shot Learning, Object Detection, Instance Segmentation, User Generated Content (UGC), Multimodal AI, GroundingDINO, CLIP, EfficientSAM

publication/series

Master's Theses in Mathematical Sciences

report number

LUTFMA-3585-2025

ISSN

1404-6342

other publication id

2025:E37

language

English

id

9192337

date added to LUP

2025-09-03 17:02:42

date last changed

2025-10-22 14:24:54

@misc{9192337,
  abstract     = {{This thesis presents a pipeline for automated object identification and dot based annotation in User Generated Content (UGC) images. The need for this type of model is motivated by IKEA’s digital product ”Content Recommendations” which aims to enhance customer engagement through shoppable UGC images and addresses the challenge of detecting specific products in various, real-world visual contexts. The approach combines state of the art multimodal models: GroundingDINO for object localization based on textual prompts, CLIP for text-to-image classification, and EfficientSAM for instance segmentation. A custom dot placement algorithm, utilizing the segmentation masks from the previous step and then assigns coordinates (dot) to each detected object. The pipeline performs well across diverse and unstructured UGC images, successfully identifying and labeling both frequent and infrequent items. Due to inconsistent ground truth annotations and varying object prominence in UGC data, quantitative evaluation proved challenging. To address this, a proxy dataset containing manually verified dot annotations was used as ground truth. This enabled a more controlled comparison and supported assumptions about the pipeline’s expected performance on UGC images. Results indicate that multimodal AI methods are well suited for scalable, fine grained object detection and annotation in complex real-world visual
data.}},
  author       = {{Paulsson, Arvid and Jóhannsson, Daníel}},
  issn         = {{1404-6342}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Master's Theses in Mathematical Sciences}},
  title        = {{Developing and Evaluating an Object Detection Application for Real-World Data}},
  year         = {{2025}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Developing and Evaluating an Object Detection Application for Real-World Data