LidarCLIP or : How i Learned to Talk to Point Clouds

Hess, Georg; Tonderski, Adam; Petersson, Christoffer; Astrom, Kalle; Svensson, Lennart

LidarCLIP or : How i Learned to Talk to Point Clouds

Mark

Hess, Georg ; Tonderski, Adam ^LU

; Petersson, Christoffer ; Astrom, Kalle ^LU

and Svensson, Lennart (2024) 2024 IEEE Winter Conference on Applications of Computer Vision, WACV 2024 p.7423-7432

Abstract: Research connecting text and images has recently seen several breakthroughs, with models like CLIP, DALL•E 2, and Stable Diffusion. However, the connection between text and other visual modalities, such as lidar data, has received less attention, prohibited by the lack of text-lidar datasets. In this work, we propose LidarCLIP, a mapping from automotive point clouds to a pre-existing CLIP embedding space. Using image-lidar pairs, we supervise a point cloud encoder with the image CLIP embeddings, effectively relating text and lidar data with the image domain as an intermediary. We show the effectiveness of Lidar-CLIP by demonstrating that lidar-based retrieval is generally on par with image-based retrieval, but with complementary... (More); Research connecting text and images has recently seen several breakthroughs, with models like CLIP, DALL•E 2, and Stable Diffusion. However, the connection between text and other visual modalities, such as lidar data, has received less attention, prohibited by the lack of text-lidar datasets. In this work, we propose LidarCLIP, a mapping from automotive point clouds to a pre-existing CLIP embedding space. Using image-lidar pairs, we supervise a point cloud encoder with the image CLIP embeddings, effectively relating text and lidar data with the image domain as an intermediary. We show the effectiveness of Lidar-CLIP by demonstrating that lidar-based retrieval is generally on par with image-based retrieval, but with complementary strengths and weaknesses. By combining image and lidar features, we improve upon both single-modality methods and enable a targeted search for challenging detection scenarios under adverse sensor conditions. We also explore zero-shot classification and show that LidarCLIP outperforms existing attempts to use CLIP for point clouds by a large margin. Finally, we leverage our compatibility with CLIP to explore a range of applications, such as point cloud captioning and lidar-to-image generation, without any additional training. Code and pre-trained models at github.com/atonderski/lidarclip.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/7c9f8308-87df-43af-998b-48dc4a30b4d2

author

Hess, Georg ; Tonderski, Adam ^LU

; Petersson, Christoffer ; Astrom, Kalle ^LU

and Svensson, Lennart

organization

publishing date

2024

type

Chapter in Book/Report/Conference proceeding

publication status

published

subject

Computer graphics and computer vision

keywords

Algorithms, Applications, Autonomous Driving, Vision + language and/or other modalities

host publication

Proceedings - 2024 IEEE Winter Conference on Applications of Computer Vision, WACV 2024

pages

10 pages

publisher

IEEE - Institute of Electrical and Electronics Engineers Inc.

conference name

2024 IEEE Winter Conference on Applications of Computer Vision, WACV 2024

conference location

Waikoloa, United States

conference dates

2024-01-04 - 2024-01-08

external identifiers

scopus:85188146235

ISBN

9798350318920

DOI

10.1109/WACV57701.2024.00727

language

English

LU publication?

yes

additional info

id

7c9f8308-87df-43af-998b-48dc4a30b4d2

date added to LUP

2024-06-07 14:24:28

date last changed

2025-04-04 14:12:26

@inproceedings{7c9f8308-87df-43af-998b-48dc4a30b4d2,
  abstract     = {{<p>Research connecting text and images has recently seen several breakthroughs, with models like CLIP, DALL•E 2, and Stable Diffusion. However, the connection between text and other visual modalities, such as lidar data, has received less attention, prohibited by the lack of text-lidar datasets. In this work, we propose LidarCLIP, a mapping from automotive point clouds to a pre-existing CLIP embedding space. Using image-lidar pairs, we supervise a point cloud encoder with the image CLIP embeddings, effectively relating text and lidar data with the image domain as an intermediary. We show the effectiveness of Lidar-CLIP by demonstrating that lidar-based retrieval is generally on par with image-based retrieval, but with complementary strengths and weaknesses. By combining image and lidar features, we improve upon both single-modality methods and enable a targeted search for challenging detection scenarios under adverse sensor conditions. We also explore zero-shot classification and show that LidarCLIP outperforms existing attempts to use CLIP for point clouds by a large margin. Finally, we leverage our compatibility with CLIP to explore a range of applications, such as point cloud captioning and lidar-to-image generation, without any additional training. Code and pre-trained models at github.com/atonderski/lidarclip.</p>}},
  author       = {{Hess, Georg and Tonderski, Adam and Petersson, Christoffer and Astrom, Kalle and Svensson, Lennart}},
  booktitle    = {{Proceedings - 2024 IEEE Winter Conference on Applications of Computer Vision, WACV 2024}},
  isbn         = {{9798350318920}},
  keywords     = {{Algorithms; Applications; Autonomous Driving; Vision + language and/or other modalities}},
  language     = {{eng}},
  pages        = {{7423--7432}},
  publisher    = {{IEEE - Institute of Electrical and Electronics Engineers Inc.}},
  title        = {{LidarCLIP or : How i Learned to Talk to Point Clouds}},
  url          = {{http://dx.doi.org/10.1109/WACV57701.2024.00727}},
  doi          = {{10.1109/WACV57701.2024.00727}},
  year         = {{2024}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

LidarCLIP or : How i Learned to Talk to Point Clouds