Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Semantic Similarity in Data Mesh Environments: A Column Classification Approach

Delgado Medina, Analiz Alejandra LU and Cortez Budnik, Denisse LU (2025) DABN01 20251
Department of Economics
Department of Statistics
Abstract
Efficient classification techniques are needed in decentralized environments due to the rapid increase in high-dimensional and semi-structured data in recent years. In this thesis, we investigate the usability of text embeddings for column classification by analyzing semantic distance metrics to detect semantic similarities between different database columns. OpenAI-text-embeddings-3-small model has been used to calculate four different distance metrics: Cosine, Euclidean, Manhattan and Chebyshev to measure the distances between embeddings in vectorial space. Furthermore, a clustering analysis has been conducted to further validate the classification performance. For a better refinement of the data a comparison between two different... (More)
Efficient classification techniques are needed in decentralized environments due to the rapid increase in high-dimensional and semi-structured data in recent years. In this thesis, we investigate the usability of text embeddings for column classification by analyzing semantic distance metrics to detect semantic similarities between different database columns. OpenAI-text-embeddings-3-small model has been used to calculate four different distance metrics: Cosine, Euclidean, Manhattan and Chebyshev to measure the distances between embeddings in vectorial space. Furthermore, a clustering analysis has been conducted to further validate the classification performance. For a better refinement of the data a comparison between two different dimensionality reduction techniques has been implemented with three different clustering models, respectively. Our results depict cosine distance maintains a high performance in line with previous literature. Additionally, clustering implementation correctly corroborates the relationships obtained by the similarity distance metrics. (Less)
Please use this url to cite or link to this publication:
author
Delgado Medina, Analiz Alejandra LU and Cortez Budnik, Denisse LU
supervisor
organization
course
DABN01 20251
year
type
H1 - Master's Degree (One Year)
subject
keywords
text embeddings, column classification, semantic similarity, distance metrics, clustering analysis
language
English
id
9192280
date added to LUP
2025-09-12 09:03:50
date last changed
2025-09-12 09:03:50
@misc{9192280,
  abstract     = {{Efficient classification techniques are needed in decentralized environments due to the rapid increase in high-dimensional and semi-structured data in recent years. In this thesis, we investigate the usability of text embeddings for column classification by analyzing semantic distance metrics to detect semantic similarities between different database columns. OpenAI-text-embeddings-3-small model has been used to calculate four different distance metrics: Cosine, Euclidean, Manhattan and Chebyshev to measure the distances between embeddings in vectorial space. Furthermore, a clustering analysis has been conducted to further validate the classification performance. For a better refinement of the data a comparison between two different dimensionality reduction techniques has been implemented with three different clustering models, respectively. Our results depict cosine distance maintains a high performance in line with previous literature. Additionally, clustering implementation correctly corroborates the relationships obtained by the similarity distance metrics.}},
  author       = {{Delgado Medina, Analiz Alejandra and Cortez Budnik, Denisse}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Semantic Similarity in Data Mesh Environments: A Column Classification Approach}},
  year         = {{2025}},
}