Building a Corpus and Database for Rare and Undeciphered Scripts
(2026) Fourth Workshop on Language Technologies forHistorical and Ancient Languages p.184-196
- Abstract
- Historical sources written in rare or undeciphered scripts represent an immense but underexploited part of the world’s cultural and linguistic heritage. Their study is often hindered by fragmentary preservation, non-standard symbol systems, and the absence of interoperable digital resources. While recent advances in imaging, transcription, and computational analysis have improved access to historical texts, most tools rely on large quantities of labeled data and standardized encodings, requirements that are rarely met for rare or unknown writing systems. This paper presents the design and methodology of a new corpus and database dedicated to rare and undeciphered scripts worldwide. The resource integrates high-quality images,... (More)
- Historical sources written in rare or undeciphered scripts represent an immense but underexploited part of the world’s cultural and linguistic heritage. Their study is often hindered by fragmentary preservation, non-standard symbol systems, and the absence of interoperable digital resources. While recent advances in imaging, transcription, and computational analysis have improved access to historical texts, most tools rely on large quantities of labeled data and standardized encodings, requirements that are rarely met for rare or unknown writing systems. This paper presents the design and methodology of a new corpus and database dedicated to rare and undeciphered scripts worldwide. The resource integrates high-quality images, transliterations, transcriptions, linguistic annotations, and metadata within a unified data model tailored for low-resource and non-standard scripts. By adhering to FAIR principles and existing standards for linguistic and cultural heritage data, the database enables reproducible, interdisciplinary research across philology, linguistics, cryptology, and computer science. The paper outlines the data collection and digitization workflow, describes the metadata and database architecture, and demonstrates applications in analysis and decipherment. (Less)
Please use this url to cite or link to this publication:
https://lup.lub.lu.se/record/bb4a5bfc-9f02-4b1c-a127-1957f9493a66
- author
- Megyesi, Beáta
; Rattenborg, Rune
LU
; Láng, Benedek
; Waldispühl, Michelle
and Héder, Mihály
- organization
- publishing date
- 2026-05-11
- type
- Chapter in Book/Report/Conference proceeding
- publication status
- published
- subject
- host publication
- Fourth Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA 2026) @ LREC 2026
- editor
- Passarotti, Marco and Sprugnoli, Rachele
- pages
- 13 pages
- publisher
- European Language Resources Association
- conference name
- Fourth Workshop on Language Technologies for<br/>Historical and Ancient Languages
- conference location
- Palma, Spain
- conference dates
- 2026-05-11 - 2026-05-11
- ISBN
- 978-2-493814-58-6
- project
- Echoes of History: Analysis and Decipherment of Historical Writings
- language
- English
- LU publication?
- yes
- id
- bb4a5bfc-9f02-4b1c-a127-1957f9493a66
- alternative location
- http://lrec-conf.org/proceedings/lrec2026/workshops/lt4hala/2026.lt4hala-1.0.pdf
- date added to LUP
- 2025-11-19 11:38:59
- date last changed
- 2026-06-02 08:41:10
@inproceedings{bb4a5bfc-9f02-4b1c-a127-1957f9493a66,
abstract = {{Historical sources written in rare or undeciphered scripts represent an immense but underexploited part of the world’s cultural and linguistic heritage. Their study is often hindered by fragmentary preservation, non-standard symbol systems, and the absence of interoperable digital resources. While recent advances in imaging, transcription, and computational analysis have improved access to historical texts, most tools rely on large quantities of labeled data and standardized encodings, requirements that are rarely met for rare or unknown writing systems. This paper presents the design and methodology of a new corpus and database dedicated to rare and undeciphered scripts worldwide. The resource integrates high-quality images, transliterations, transcriptions, linguistic annotations, and metadata within a unified data model tailored for low-resource and non-standard scripts. By adhering to FAIR principles and existing standards for linguistic and cultural heritage data, the database enables reproducible, interdisciplinary research across philology, linguistics, cryptology, and computer science. The paper outlines the data collection and digitization workflow, describes the metadata and database architecture, and demonstrates applications in analysis and decipherment.}},
author = {{Megyesi, Beáta and Rattenborg, Rune and Láng, Benedek and Waldispühl, Michelle and Héder, Mihály}},
booktitle = {{Fourth Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA 2026) @ LREC 2026}},
editor = {{Passarotti, Marco and Sprugnoli, Rachele}},
isbn = {{978-2-493814-58-6}},
language = {{eng}},
month = {{05}},
pages = {{184--196}},
publisher = {{European Language Resources Association}},
title = {{Building a Corpus and Database for Rare and Undeciphered Scripts}},
url = {{http://lrec-conf.org/proceedings/lrec2026/workshops/lt4hala/2026.lt4hala-1.0.pdf}},
year = {{2026}},
}