A Handwritten Text Recognition Dataset for Ajami Manuscripts in Fulfulde and Hausa
(2026) 19th International Conference on Document Analysis and Recognition, ICDAR 2025 In Lecture Notes in Computer Science 16026 LNCS. p.620-637- Abstract
We present the first ever dataset of manually segmented and transcribed Ajami manuscripts written in Fulfulde and Hausa. The term Ajami refers to modified Arabic-script orthographies in Africa. Existing handwritten text recognition (HTR) and optical character recognition (OCR) models for Arabic-script languages perform poorly on West African manuscripts due to a lack of these manuscripts representation in the models’ pre-training. This leads to models struggling to adapt to Ajami style calligraphy, being unequipped to recognize Ajami specific characters, and being unable to extract certain Arabic-script diacritics which are present in Ajami manuscripts but lacking in many manuscripts for other Arabic-script languages like Arabic and... (More)
We present the first ever dataset of manually segmented and transcribed Ajami manuscripts written in Fulfulde and Hausa. The term Ajami refers to modified Arabic-script orthographies in Africa. Existing handwritten text recognition (HTR) and optical character recognition (OCR) models for Arabic-script languages perform poorly on West African manuscripts due to a lack of these manuscripts representation in the models’ pre-training. This leads to models struggling to adapt to Ajami style calligraphy, being unequipped to recognize Ajami specific characters, and being unable to extract certain Arabic-script diacritics which are present in Ajami manuscripts but lacking in many manuscripts for other Arabic-script languages like Arabic and Persian. The latter poses a significant challenge to Ajami HTR. We release the following as an open-source dataset: an ALTO formatting of high-quality images of Fulfulde and Hausa manuscripts, manual segmentation (region and line), and manual transcriptions. Our HTR dataset is also the first to diplomatically transcribe newly Unicode-encoded, special Quranic recitation characters. We evaluate a suite of Arabic-script recognition models specifically for historical manuscripts and find that they produce character error rates of 65–84% when attempting to automatically transcribe our curated manuscripts. Transcriptions produced by the evaluated models are released as well.
(Less)
- author
- Yousuf, Oreen ; Aminu, Abdulmalik ; Muhammad, Musa Salih ; Usman, Bashir ; Hashim, Mustapha Kurfi ; Nivre, Joakim ; Megyesi, Beáta and Høgel, Christian LU
- organization
- publishing date
- 2026
- type
- Chapter in Book/Report/Conference proceeding
- publication status
- published
- subject
- keywords
- Ajami, benchmark dataset, handwritten text recognition
- host publication
- Document Analysis and Recognition – ICDAR 2025 - 19th International Conference, Proceedings
- series title
- Lecture Notes in Computer Science
- editor
- Yin, Xu-Cheng ; Karatzas, Dimosthenis and Lopresti, Daniel
- volume
- 16026 LNCS
- pages
- 18 pages
- publisher
- Springer Science and Business Media B.V.
- conference name
- 19th International Conference on Document Analysis and Recognition, ICDAR 2025
- conference location
- Wuhan, China
- conference dates
- 2025-09-16 - 2025-09-21
- external identifiers
-
- scopus:105017374269
- ISSN
- 0302-9743
- 1611-3349
- ISBN
- 9783032046260
- DOI
- 10.1007/978-3-032-04627-7_36
- language
- English
- LU publication?
- yes
- id
- c33132da-8c68-4065-ae59-d3f1fdf643bb
- date added to LUP
- 2025-11-20 11:24:10
- date last changed
- 2025-11-20 11:25:28
@inproceedings{c33132da-8c68-4065-ae59-d3f1fdf643bb,
abstract = {{<p>We present the first ever dataset of manually segmented and transcribed Ajami manuscripts written in Fulfulde and Hausa. The term Ajami refers to modified Arabic-script orthographies in Africa. Existing handwritten text recognition (HTR) and optical character recognition (OCR) models for Arabic-script languages perform poorly on West African manuscripts due to a lack of these manuscripts representation in the models’ pre-training. This leads to models struggling to adapt to Ajami style calligraphy, being unequipped to recognize Ajami specific characters, and being unable to extract certain Arabic-script diacritics which are present in Ajami manuscripts but lacking in many manuscripts for other Arabic-script languages like Arabic and Persian. The latter poses a significant challenge to Ajami HTR. We release the following as an open-source dataset: an ALTO formatting of high-quality images of Fulfulde and Hausa manuscripts, manual segmentation (region and line), and manual transcriptions. Our HTR dataset is also the first to diplomatically transcribe newly Unicode-encoded, special Quranic recitation characters. We evaluate a suite of Arabic-script recognition models specifically for historical manuscripts and find that they produce character error rates of 65–84% when attempting to automatically transcribe our curated manuscripts. Transcriptions produced by the evaluated models are released as well.</p>}},
author = {{Yousuf, Oreen and Aminu, Abdulmalik and Muhammad, Musa Salih and Usman, Bashir and Hashim, Mustapha Kurfi and Nivre, Joakim and Megyesi, Beáta and Høgel, Christian}},
booktitle = {{Document Analysis and Recognition – ICDAR 2025 - 19th International Conference, Proceedings}},
editor = {{Yin, Xu-Cheng and Karatzas, Dimosthenis and Lopresti, Daniel}},
isbn = {{9783032046260}},
issn = {{0302-9743}},
keywords = {{Ajami; benchmark dataset; handwritten text recognition}},
language = {{eng}},
pages = {{620--637}},
publisher = {{Springer Science and Business Media B.V.}},
series = {{Lecture Notes in Computer Science}},
title = {{A Handwritten Text Recognition Dataset for Ajami Manuscripts in Fulfulde and Hausa}},
url = {{http://dx.doi.org/10.1007/978-3-032-04627-7_36}},
doi = {{10.1007/978-3-032-04627-7_36}},
volume = {{16026 LNCS}},
year = {{2026}},
}