Building a Corpus and Database for Rare and Undeciphered Scripts

Megyesi, Beáta; Rattenborg, Rune; Láng, Benedek; Waldispühl, Michelle; Héder, Mihály

Building a Corpus and Database for Rare and Undeciphered Scripts

Mark

; Láng, Benedek ; Waldispühl, Michelle and Héder, Mihály (2026) Fourth Workshop on Language Technologies for
Historical and Ancient Languages p.184-196

Abstract: Historical sources written in rare or undeciphered scripts represent an immense but underexploited part of the world’s cultural and linguistic heritage. Their study is often hindered by fragmentary preservation, non-standard symbol systems, and the absence of interoperable digital resources. While recent advances in imaging, transcription, and computational analysis have improved access to historical texts, most tools rely on large quantities of labeled data and standardized encodings, requirements that are rarely met for rare or unknown writing systems. This paper presents the design and methodology of a new corpus and database dedicated to rare and undeciphered scripts worldwide. The resource integrates high-quality images,... (More); Historical sources written in rare or undeciphered scripts represent an immense but underexploited part of the world’s cultural and linguistic heritage. Their study is often hindered by fragmentary preservation, non-standard symbol systems, and the absence of interoperable digital resources. While recent advances in imaging, transcription, and computational analysis have improved access to historical texts, most tools rely on large quantities of labeled data and standardized encodings, requirements that are rarely met for rare or unknown writing systems. This paper presents the design and methodology of a new corpus and database dedicated to rare and undeciphered scripts worldwide. The resource integrates high-quality images, transliterations, transcriptions, linguistic annotations, and metadata within a unified data model tailored for low-resource and non-standard scripts. By adhering to FAIR principles and existing standards for linguistic and cultural heritage data, the database enables reproducible, interdisciplinary research across philology, linguistics, cryptology, and computer science. The paper outlines the data collection and digitization workflow, describes the metadata and database architecture, and demonstrates applications in analysis and decipherment. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/bb4a5bfc-9f02-4b1c-a127-1957f9493a66

author

Megyesi, Beáta ; Rattenborg, Rune ^LU

; Láng, Benedek ; Waldispühl, Michelle and Héder, Mihály

organization

Greek (Ancient and Byzantine)

publishing date

2026-05-11

type

Chapter in Book/Report/Conference proceeding

publication status

published

subject

host publication

Fourth Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA 2026) @ LREC 2026

editor

Passarotti, Marco and Sprugnoli, Rachele

pages

13 pages

publisher

European Language Resources Association

conference name

Fourth Workshop on Language Technologies for<br/>Historical and Ancient Languages

conference location

Palma, Spain

conference dates

2026-05-11 - 2026-05-11

ISBN

978-2-493814-58-6

project

Echoes of History: Analysis and Decipherment of Historical Writings

language

English

LU publication?

yes

id

bb4a5bfc-9f02-4b1c-a127-1957f9493a66

alternative location

http://lrec-conf.org/proceedings/lrec2026/workshops/lt4hala/2026.lt4hala-1.0.pdf

date added to LUP

2025-11-19 11:38:59

date last changed

2026-06-02 08:41:10

@inproceedings{bb4a5bfc-9f02-4b1c-a127-1957f9493a66,
  abstract     = {{Historical sources written in rare or undeciphered scripts represent an immense but underexploited part of the world’s cultural and linguistic heritage. Their study is often hindered by fragmentary preservation, non-standard symbol systems, and the absence of interoperable digital resources. While recent advances in imaging, transcription, and computational analysis have improved access to historical texts, most tools rely on large quantities of labeled data and standardized encodings, requirements that are rarely met for rare or unknown writing systems. This paper presents the design and methodology of a new corpus and database dedicated to rare and undeciphered scripts worldwide. The resource integrates high-quality images, transliterations, transcriptions, linguistic annotations, and metadata within a unified data model tailored for low-resource and non-standard scripts. By adhering to FAIR principles and existing standards for linguistic and cultural heritage data, the database enables reproducible, interdisciplinary research across philology, linguistics, cryptology, and computer science. The paper outlines the data collection and digitization workflow, describes the metadata and database architecture, and demonstrates applications in analysis and decipherment.}},
  author       = {{Megyesi, Beáta and Rattenborg, Rune and Láng, Benedek and Waldispühl, Michelle and Héder, Mihály}},
  booktitle    = {{Fourth Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA 2026) @ LREC 2026}},
  editor       = {{Passarotti, Marco and Sprugnoli, Rachele}},
  isbn         = {{978-2-493814-58-6}},
  language     = {{eng}},
  month        = {{05}},
  pages        = {{184--196}},
  publisher    = {{European Language Resources Association}},
  title        = {{Building a Corpus and Database for Rare and Undeciphered Scripts}},
  url          = {{http://lrec-conf.org/proceedings/lrec2026/workshops/lt4hala/2026.lt4hala-1.0.pdf}},
  year         = {{2026}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Building a Corpus and Database for Rare and Undeciphered Scripts