Visual Re-ranking with Non-visual Side Information

Hanning, Gustav; Flood, Gabrielle; Larsson, Viktor

Visual Re-ranking with Non-visual Side Information

Mark

Hanning, Gustav ^LU

; Flood, Gabrielle ^LU

and Larsson, Viktor ^LU (2025) 23rd Scandinavian Conference on Image Analysis, SCIA 2025 In Lecture Notes in Computer Science 15725 LNCS. p.310-323

Abstract: The standard approach for visual place recognition is to use global image descriptors to retrieve the most similar database images for a given query image. The results can then be further improved with re-ranking methods that re-order the top scoring images. However, existing methods focus on re-ranking based on the same image descriptors that were used for the initial retrieval, which we argue provides limited additional signal. In this work we propose Generalized Contextual Similarity Aggregation (GCSA), which is a graph neural network-based re-ranking method that, in addition to the visual descriptors, can leverage other types of available side information. This can for example be other sensor data (such as signal strength of nearby... (More); The standard approach for visual place recognition is to use global image descriptors to retrieve the most similar database images for a given query image. The results can then be further improved with re-ranking methods that re-order the top scoring images. However, existing methods focus on re-ranking based on the same image descriptors that were used for the initial retrieval, which we argue provides limited additional signal. In this work we propose Generalized Contextual Similarity Aggregation (GCSA), which is a graph neural network-based re-ranking method that, in addition to the visual descriptors, can leverage other types of available side information. This can for example be other sensor data (such as signal strength of nearby WiFi or BlueTooth endpoints) or geometric properties such as camera poses for database images. In many applications this information is already present or can be acquired with low effort. Our architecture leverages the concept of affinity vectors to allow for a shared encoding of the heterogeneous multi-modal input. Two large-scale datasets, covering both outdoor and indoor localization scenarios, are utilized for training and evaluation. In experiments we show significant improvement not only on image retrieval metrics, but also for the downstream visual localization task.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/4b7f1768-a803-4403-957d-3ef8381ba536

author

Hanning, Gustav ^LU

; Flood, Gabrielle ^LU

and Larsson, Viktor ^LU

organization

publishing date

2025

type

Chapter in Book/Report/Conference proceeding

publication status

published

subject

Computer Sciences

keywords

GNN, Image retrieval re-ranking, Visual localization

host publication

Image Analysis - 23rd Scandinavian Conference, SCIA 2025, Proceedings

series title

Lecture Notes in Computer Science

editor

Petersen, Jens and Dahl, Vedrana Andersen

volume

15725 LNCS

pages

14 pages

publisher

Springer Science and Business Media B.V.

conference name

23rd Scandinavian Conference on Image Analysis, SCIA 2025

conference location

Reykjavik, Iceland

conference dates

2025-06-23 - 2025-06-25

external identifiers

scopus:105009845345

ISSN

1611-3349

0302-9743

ISBN

9783031959103

DOI

10.1007/978-3-031-95911-0_22

language

English

LU publication?

yes

additional info

id

4b7f1768-a803-4403-957d-3ef8381ba536

date added to LUP

2025-12-22 10:30:45

date last changed

2026-02-16 14:14:23

@inproceedings{4b7f1768-a803-4403-957d-3ef8381ba536,
  abstract     = {{<p>The standard approach for visual place recognition is to use global image descriptors to retrieve the most similar database images for a given query image. The results can then be further improved with re-ranking methods that re-order the top scoring images. However, existing methods focus on re-ranking based on the same image descriptors that were used for the initial retrieval, which we argue provides limited additional signal. In this work we propose Generalized Contextual Similarity Aggregation (GCSA), which is a graph neural network-based re-ranking method that, in addition to the visual descriptors, can leverage other types of available side information. This can for example be other sensor data (such as signal strength of nearby WiFi or BlueTooth endpoints) or geometric properties such as camera poses for database images. In many applications this information is already present or can be acquired with low effort. Our architecture leverages the concept of affinity vectors to allow for a shared encoding of the heterogeneous multi-modal input. Two large-scale datasets, covering both outdoor and indoor localization scenarios, are utilized for training and evaluation. In experiments we show significant improvement not only on image retrieval metrics, but also for the downstream visual localization task.</p>}},
  author       = {{Hanning, Gustav and Flood, Gabrielle and Larsson, Viktor}},
  booktitle    = {{Image Analysis - 23rd Scandinavian Conference, SCIA 2025, Proceedings}},
  editor       = {{Petersen, Jens and Dahl, Vedrana Andersen}},
  isbn         = {{9783031959103}},
  issn         = {{1611-3349}},
  keywords     = {{GNN; Image retrieval re-ranking; Visual localization}},
  language     = {{eng}},
  pages        = {{310--323}},
  publisher    = {{Springer Science and Business Media B.V.}},
  series       = {{Lecture Notes in Computer Science}},
  title        = {{Visual Re-ranking with Non-visual Side Information}},
  url          = {{http://dx.doi.org/10.1007/978-3-031-95911-0_22}},
  doi          = {{10.1007/978-3-031-95911-0_22}},
  volume       = {{15725 LNCS}},
  year         = {{2025}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Visual Re-ranking with Non-visual Side Information