Feature semantic alignment and information supplement for Text-based person search

Zhou, Hang; Li, Fan; Tian, Xuening; Huang, Yuling

Feature semantic alignment and information supplement for Text-based person search

Mark

Zhou, Hang ; Li, Fan ; Tian, Xuening and Huang, Yuling (2023) In Frontiers in Physics 11.

Abstract: The goal of person text-image matching is to retrieve images of specific pedestrians using natural language. Although a lot of research results have been achieved in persona text-image matching, existing methods still face two challenges. First,due to the ambiguous semantic information in the features, aligning the textual features with their corresponding image features is always tricky. Second, the absence of semantic information in each local feature of pedestrians poses a significant challenge to the network in extracting robust features that match both modalities. To address these issues, we propose a model for explicit semantic feature extraction and effective information supplement. On the one hand, by attaching the textual and... (More); The goal of person text-image matching is to retrieve images of specific pedestrians using natural language. Although a lot of research results have been achieved in persona text-image matching, existing methods still face two challenges. First,due to the ambiguous semantic information in the features, aligning the textual features with their corresponding image features is always tricky. Second, the absence of semantic information in each local feature of pedestrians poses a significant challenge to the network in extracting robust features that match both modalities. To address these issues, we propose a model for explicit semantic feature extraction and effective information supplement. On the one hand, by attaching the textual and image features with consistent and clear semantic information, the course-grained alignment between the textual and corresponding image features is achieved. On the other hand, an information supplement network is proposed, which captures the relationships between local features of each modality and supplements them to obtain more complete local features with semantic information. In the end, the local features are then concatenated to a comprehensive global feature, which capable of precise alignment of the textual and described image features. We did extensive experiments on CUHK-PEDES dataset and RSTPReid dataset, the experimental results show that our method has better performance. Additionally, the ablation experiment also proved the effectiveness of each module designed in this paper.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/297fb711-53a3-4973-856c-ae407bc2879f

author

Zhou, Hang ; Li, Fan ; Tian, Xuening and Huang, Yuling

publishing date

2023

type

Contribution to journal

publication status

published

subject

Computer graphics and computer vision

keywords

cross-modal retrieval, deep learning, neural network, Text-based image retrieval, Text-based person search

in

Frontiers in Physics

volume

11

article number

1192412

publisher

Frontiers Media S. A.

external identifiers

scopus:85161016905

ISSN

2296-424X

DOI

10.3389/fphy.2023.1192412

language

English

LU publication?

no

id

297fb711-53a3-4973-856c-ae407bc2879f

date added to LUP

2023-08-30 14:20:48

date last changed

2025-10-14 13:07:25

@article{297fb711-53a3-4973-856c-ae407bc2879f,
  abstract     = {{<p>The goal of person text-image matching is to retrieve images of specific pedestrians using natural language. Although a lot of research results have been achieved in persona text-image matching, existing methods still face two challenges. First,due to the ambiguous semantic information in the features, aligning the textual features with their corresponding image features is always tricky. Second, the absence of semantic information in each local feature of pedestrians poses a significant challenge to the network in extracting robust features that match both modalities. To address these issues, we propose a model for explicit semantic feature extraction and effective information supplement. On the one hand, by attaching the textual and image features with consistent and clear semantic information, the course-grained alignment between the textual and corresponding image features is achieved. On the other hand, an information supplement network is proposed, which captures the relationships between local features of each modality and supplements them to obtain more complete local features with semantic information. In the end, the local features are then concatenated to a comprehensive global feature, which capable of precise alignment of the textual and described image features. We did extensive experiments on CUHK-PEDES dataset and RSTPReid dataset, the experimental results show that our method has better performance. Additionally, the ablation experiment also proved the effectiveness of each module designed in this paper.</p>}},
  author       = {{Zhou, Hang and Li, Fan and Tian, Xuening and Huang, Yuling}},
  issn         = {{2296-424X}},
  keywords     = {{cross-modal retrieval; deep learning; neural network; Text-based image retrieval; Text-based person search}},
  language     = {{eng}},
  publisher    = {{Frontiers Media S. A.}},
  series       = {{Frontiers in Physics}},
  title        = {{Feature semantic alignment and information supplement for Text-based person search}},
  url          = {{http://dx.doi.org/10.3389/fphy.2023.1192412}},
  doi          = {{10.3389/fphy.2023.1192412}},
  volume       = {{11}},
  year         = {{2023}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Feature semantic alignment and information supplement for Text-based person search