Semantic Similarity in Data Mesh Environments: A Column Classification Approach
(2025) DABN01 20251Department of Economics
Department of Statistics
- Abstract
- Efficient classification techniques are needed in decentralized environments due to the rapid increase in high-dimensional and semi-structured data in recent years. In this thesis, we investigate the usability of text embeddings for column classification by analyzing semantic distance metrics to detect semantic similarities between different database columns. OpenAI-text-embeddings-3-small model has been used to calculate four different distance metrics: Cosine, Euclidean, Manhattan and Chebyshev to measure the distances between embeddings in vectorial space. Furthermore, a clustering analysis has been conducted to further validate the classification performance. For a better refinement of the data a comparison between two different... (More)
- Efficient classification techniques are needed in decentralized environments due to the rapid increase in high-dimensional and semi-structured data in recent years. In this thesis, we investigate the usability of text embeddings for column classification by analyzing semantic distance metrics to detect semantic similarities between different database columns. OpenAI-text-embeddings-3-small model has been used to calculate four different distance metrics: Cosine, Euclidean, Manhattan and Chebyshev to measure the distances between embeddings in vectorial space. Furthermore, a clustering analysis has been conducted to further validate the classification performance. For a better refinement of the data a comparison between two different dimensionality reduction techniques has been implemented with three different clustering models, respectively. Our results depict cosine distance maintains a high performance in line with previous literature. Additionally, clustering implementation correctly corroborates the relationships obtained by the similarity distance metrics. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/9192280
- author
- Delgado Medina, Analiz Alejandra LU and Cortez Budnik, Denisse LU
- supervisor
- organization
- course
- DABN01 20251
- year
- 2025
- type
- H1 - Master's Degree (One Year)
- subject
- keywords
- text embeddings, column classification, semantic similarity, distance metrics, clustering analysis
- language
- English
- id
- 9192280
- date added to LUP
- 2025-09-12 09:03:50
- date last changed
- 2025-09-12 09:03:50
@misc{9192280, abstract = {{Efficient classification techniques are needed in decentralized environments due to the rapid increase in high-dimensional and semi-structured data in recent years. In this thesis, we investigate the usability of text embeddings for column classification by analyzing semantic distance metrics to detect semantic similarities between different database columns. OpenAI-text-embeddings-3-small model has been used to calculate four different distance metrics: Cosine, Euclidean, Manhattan and Chebyshev to measure the distances between embeddings in vectorial space. Furthermore, a clustering analysis has been conducted to further validate the classification performance. For a better refinement of the data a comparison between two different dimensionality reduction techniques has been implemented with three different clustering models, respectively. Our results depict cosine distance maintains a high performance in line with previous literature. Additionally, clustering implementation correctly corroborates the relationships obtained by the similarity distance metrics.}}, author = {{Delgado Medina, Analiz Alejandra and Cortez Budnik, Denisse}}, language = {{eng}}, note = {{Student Paper}}, title = {{Semantic Similarity in Data Mesh Environments: A Column Classification Approach}}, year = {{2025}}, }