Scalable Optimization of Product Category Embeddings Using Multi-Dimensional Scaling and LLM Embeddings
(2025)Department of Automatic Control
- Abstract
- Semantic search using large language models (LLMs) combined with an approximate nearest neighbor (ANN) index is the current state of the art in search technology. Semantic search aims to match a user’s search query based on its contextual meaning and intent with a highly relevant search result, rather than relying solely on matching keywords. While semantic search is superior to keyword-based search systems in capturing user intent, it falls short in its ability to perform advanced filtering.
This paper explores encoding of product category embeddings onto an ndimensional hypersphere to reduce the number of dimensions used to represent each product category embedding. This aims to prevent embedding vectors from growing to an unreasonable... (More) - Semantic search using large language models (LLMs) combined with an approximate nearest neighbor (ANN) index is the current state of the art in search technology. Semantic search aims to match a user’s search query based on its contextual meaning and intent with a highly relevant search result, rather than relying solely on matching keywords. While semantic search is superior to keyword-based search systems in capturing user intent, it falls short in its ability to perform advanced filtering.
This paper explores encoding of product category embeddings onto an ndimensional hypersphere to reduce the number of dimensions used to represent each product category embedding. This aims to prevent embedding vectors from growing to an unreasonable size, which can occur when multiple internal representations of data are concatenated during the querying process.
The goal is to preserve the spatial similarity property of category embeddings, allowing for accurate ranking based on these similarities.
This paper presents suggestions for dimension reduction and optimization methods for placing category embedding on an n-dimensional hypersphere, while retaining neighbors based on semantic similarities or hierarchical distances. It is demonstrated that the dimensions of the embedding vectors can be reduced while optimizing them on the n-dimensional hypersphere, thereby retaining semantically similar neighbors. It is also demonstrated that hierarchical data from category trees can be utilized to optimize category embeddings on the n-dimensional hypersphere, thereby placing categories that are hierarchically close as neighbors. The methods presented are discussed and compared in terms of their performance and scalability, examining how the size of the input dataset and the dimension n, to which the embeddings are reduced, affect performance and time complexity. It is found that Riemannian optimization methods, based on the optimization methods ADAM and SGD, which utilize an initial guess calculated using the dimensionality reduction method PCA, perform the optimization task most effectively. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/9207872
- author
- Stenström, Adam and Cederberg, Nils
- supervisor
- organization
- year
- 2025
- type
- H3 - Professional qualifications (4 Years - )
- subject
- report number
- TFRT-6280
- other publication id
- 0280-5316
- language
- English
- id
- 9207872
- date added to LUP
- 2025-08-08 15:09:51
- date last changed
- 2025-08-08 15:09:51
@misc{9207872, abstract = {{Semantic search using large language models (LLMs) combined with an approximate nearest neighbor (ANN) index is the current state of the art in search technology. Semantic search aims to match a user’s search query based on its contextual meaning and intent with a highly relevant search result, rather than relying solely on matching keywords. While semantic search is superior to keyword-based search systems in capturing user intent, it falls short in its ability to perform advanced filtering. This paper explores encoding of product category embeddings onto an ndimensional hypersphere to reduce the number of dimensions used to represent each product category embedding. This aims to prevent embedding vectors from growing to an unreasonable size, which can occur when multiple internal representations of data are concatenated during the querying process. The goal is to preserve the spatial similarity property of category embeddings, allowing for accurate ranking based on these similarities. This paper presents suggestions for dimension reduction and optimization methods for placing category embedding on an n-dimensional hypersphere, while retaining neighbors based on semantic similarities or hierarchical distances. It is demonstrated that the dimensions of the embedding vectors can be reduced while optimizing them on the n-dimensional hypersphere, thereby retaining semantically similar neighbors. It is also demonstrated that hierarchical data from category trees can be utilized to optimize category embeddings on the n-dimensional hypersphere, thereby placing categories that are hierarchically close as neighbors. The methods presented are discussed and compared in terms of their performance and scalability, examining how the size of the input dataset and the dimension n, to which the embeddings are reduced, affect performance and time complexity. It is found that Riemannian optimization methods, based on the optimization methods ADAM and SGD, which utilize an initial guess calculated using the dimensionality reduction method PCA, perform the optimization task most effectively.}}, author = {{Stenström, Adam and Cederberg, Nils}}, language = {{eng}}, note = {{Student Paper}}, title = {{Scalable Optimization of Product Category Embeddings Using Multi-Dimensional Scaling and LLM Embeddings}}, year = {{2025}}, }