Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

Building a Corpus and Database for Rare and Undeciphered Scripts

Megyesi, Beáta ; Rattenborg, Rune LU orcid ; Láng, Benedek ; Waldispühl, Michelle and Héder, Mihály (2026) Fourth Workshop on Language Technologies for
Historical and Ancient Languages
p.184-196
Abstract
Historical sources written in rare or undeciphered scripts represent an immense but underexploited part of the world’s cultural and linguistic heritage. Their study is often hindered by fragmentary preservation, non-standard symbol systems, and the absence of interoperable digital resources. While recent advances in imaging, transcription, and computational analysis have improved access to historical texts, most tools rely on large quantities of labeled data and standardized encodings, requirements that are rarely met for rare or unknown writing systems. This paper presents the design and methodology of a new corpus and database dedicated to rare and undeciphered scripts worldwide. The resource integrates high-quality images,... (More)
Historical sources written in rare or undeciphered scripts represent an immense but underexploited part of the world’s cultural and linguistic heritage. Their study is often hindered by fragmentary preservation, non-standard symbol systems, and the absence of interoperable digital resources. While recent advances in imaging, transcription, and computational analysis have improved access to historical texts, most tools rely on large quantities of labeled data and standardized encodings, requirements that are rarely met for rare or unknown writing systems. This paper presents the design and methodology of a new corpus and database dedicated to rare and undeciphered scripts worldwide. The resource integrates high-quality images, transliterations, transcriptions, linguistic annotations, and metadata within a unified data model tailored for low-resource and non-standard scripts. By adhering to FAIR principles and existing standards for linguistic and cultural heritage data, the database enables reproducible, interdisciplinary research across philology, linguistics, cryptology, and computer science. The paper outlines the data collection and digitization workflow, describes the metadata and database architecture, and demonstrates applications in analysis and decipherment. (Less)
Please use this url to cite or link to this publication:
author
; ; ; and
organization
publishing date
type
Chapter in Book/Report/Conference proceeding
publication status
published
subject
host publication
Fourth Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA 2026) @ LREC 2026
editor
Passarotti, Marco and Sprugnoli, Rachele
pages
13 pages
publisher
European Language Resources Association
conference name
Fourth Workshop on Language Technologies for<br/>Historical and Ancient Languages
conference location
Palma, Spain
conference dates
2026-05-11 - 2026-05-11
ISBN
978-2-493814-58-6
project
Echoes of History: Analysis and Decipherment of Historical Writings
language
English
LU publication?
yes
id
bb4a5bfc-9f02-4b1c-a127-1957f9493a66
alternative location
http://lrec-conf.org/proceedings/lrec2026/workshops/lt4hala/2026.lt4hala-1.0.pdf
date added to LUP
2025-11-19 11:38:59
date last changed
2026-06-02 08:41:10
@inproceedings{bb4a5bfc-9f02-4b1c-a127-1957f9493a66,
  abstract     = {{Historical sources written in rare or undeciphered scripts represent an immense but underexploited part of the world’s cultural and linguistic heritage. Their study is often hindered by fragmentary preservation, non-standard symbol systems, and the absence of interoperable digital resources. While recent advances in imaging, transcription, and computational analysis have improved access to historical texts, most tools rely on large quantities of labeled data and standardized encodings, requirements that are rarely met for rare or unknown writing systems. This paper presents the design and methodology of a new corpus and database dedicated to rare and undeciphered scripts worldwide. The resource integrates high-quality images, transliterations, transcriptions, linguistic annotations, and metadata within a unified data model tailored for low-resource and non-standard scripts. By adhering to FAIR principles and existing standards for linguistic and cultural heritage data, the database enables reproducible, interdisciplinary research across philology, linguistics, cryptology, and computer science. The paper outlines the data collection and digitization workflow, describes the metadata and database architecture, and demonstrates applications in analysis and decipherment.}},
  author       = {{Megyesi, Beáta and Rattenborg, Rune and Láng, Benedek and Waldispühl, Michelle and Héder, Mihály}},
  booktitle    = {{Fourth Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA 2026) @ LREC 2026}},
  editor       = {{Passarotti, Marco and Sprugnoli, Rachele}},
  isbn         = {{978-2-493814-58-6}},
  language     = {{eng}},
  month        = {{05}},
  pages        = {{184--196}},
  publisher    = {{European Language Resources Association}},
  title        = {{Building a Corpus and Database for Rare and Undeciphered Scripts}},
  url          = {{http://lrec-conf.org/proceedings/lrec2026/workshops/lt4hala/2026.lt4hala-1.0.pdf}},
  year         = {{2026}},
}