Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

A Handwritten Text Recognition Dataset for Ajami Manuscripts in Fulfulde and Hausa

Yousuf, Oreen ; Aminu, Abdulmalik ; Muhammad, Musa Salih ; Usman, Bashir ; Hashim, Mustapha Kurfi ; Nivre, Joakim ; Megyesi, Beáta and Høgel, Christian LU (2026) 19th International Conference on Document Analysis and Recognition, ICDAR 2025 In Lecture Notes in Computer Science 16026 LNCS. p.620-637
Abstract

We present the first ever dataset of manually segmented and transcribed Ajami manuscripts written in Fulfulde and Hausa. The term Ajami refers to modified Arabic-script orthographies in Africa. Existing handwritten text recognition (HTR) and optical character recognition (OCR) models for Arabic-script languages perform poorly on West African manuscripts due to a lack of these manuscripts representation in the models’ pre-training. This leads to models struggling to adapt to Ajami style calligraphy, being unequipped to recognize Ajami specific characters, and being unable to extract certain Arabic-script diacritics which are present in Ajami manuscripts but lacking in many manuscripts for other Arabic-script languages like Arabic and... (More)

We present the first ever dataset of manually segmented and transcribed Ajami manuscripts written in Fulfulde and Hausa. The term Ajami refers to modified Arabic-script orthographies in Africa. Existing handwritten text recognition (HTR) and optical character recognition (OCR) models for Arabic-script languages perform poorly on West African manuscripts due to a lack of these manuscripts representation in the models’ pre-training. This leads to models struggling to adapt to Ajami style calligraphy, being unequipped to recognize Ajami specific characters, and being unable to extract certain Arabic-script diacritics which are present in Ajami manuscripts but lacking in many manuscripts for other Arabic-script languages like Arabic and Persian. The latter poses a significant challenge to Ajami HTR. We release the following as an open-source dataset: an ALTO formatting of high-quality images of Fulfulde and Hausa manuscripts, manual segmentation (region and line), and manual transcriptions. Our HTR dataset is also the first to diplomatically transcribe newly Unicode-encoded, special Quranic recitation characters. We evaluate a suite of Arabic-script recognition models specifically for historical manuscripts and find that they produce character error rates of 65–84% when attempting to automatically transcribe our curated manuscripts. Transcriptions produced by the evaluated models are released as well.

(Less)
Please use this url to cite or link to this publication:
author
; ; ; ; ; ; and
organization
publishing date
type
Chapter in Book/Report/Conference proceeding
publication status
published
subject
keywords
Ajami, benchmark dataset, handwritten text recognition
host publication
Document Analysis and Recognition – ICDAR 2025 - 19th International Conference, Proceedings
series title
Lecture Notes in Computer Science
editor
Yin, Xu-Cheng ; Karatzas, Dimosthenis and Lopresti, Daniel
volume
16026 LNCS
pages
18 pages
publisher
Springer Science and Business Media B.V.
conference name
19th International Conference on Document Analysis and Recognition, ICDAR 2025
conference location
Wuhan, China
conference dates
2025-09-16 - 2025-09-21
external identifiers
  • scopus:105017374269
ISSN
0302-9743
1611-3349
ISBN
9783032046260
DOI
10.1007/978-3-032-04627-7_36
language
English
LU publication?
yes
id
c33132da-8c68-4065-ae59-d3f1fdf643bb
date added to LUP
2025-11-20 11:24:10
date last changed
2025-11-20 11:25:28
@inproceedings{c33132da-8c68-4065-ae59-d3f1fdf643bb,
  abstract     = {{<p>We present the first ever dataset of manually segmented and transcribed Ajami manuscripts written in Fulfulde and Hausa. The term Ajami refers to modified Arabic-script orthographies in Africa. Existing handwritten text recognition (HTR) and optical character recognition (OCR) models for Arabic-script languages perform poorly on West African manuscripts due to a lack of these manuscripts representation in the models’ pre-training. This leads to models struggling to adapt to Ajami style calligraphy, being unequipped to recognize Ajami specific characters, and being unable to extract certain Arabic-script diacritics which are present in Ajami manuscripts but lacking in many manuscripts for other Arabic-script languages like Arabic and Persian. The latter poses a significant challenge to Ajami HTR. We release the following as an open-source dataset: an ALTO formatting of high-quality images of Fulfulde and Hausa manuscripts, manual segmentation (region and line), and manual transcriptions. Our HTR dataset is also the first to diplomatically transcribe newly Unicode-encoded, special Quranic recitation characters. We evaluate a suite of Arabic-script recognition models specifically for historical manuscripts and find that they produce character error rates of 65–84% when attempting to automatically transcribe our curated manuscripts. Transcriptions produced by the evaluated models are released as well.</p>}},
  author       = {{Yousuf, Oreen and Aminu, Abdulmalik and Muhammad, Musa Salih and Usman, Bashir and Hashim, Mustapha Kurfi and Nivre, Joakim and Megyesi, Beáta and Høgel, Christian}},
  booktitle    = {{Document Analysis and Recognition – ICDAR 2025 - 19th International Conference, Proceedings}},
  editor       = {{Yin, Xu-Cheng and Karatzas, Dimosthenis and Lopresti, Daniel}},
  isbn         = {{9783032046260}},
  issn         = {{0302-9743}},
  keywords     = {{Ajami; benchmark dataset; handwritten text recognition}},
  language     = {{eng}},
  pages        = {{620--637}},
  publisher    = {{Springer Science and Business Media B.V.}},
  series       = {{Lecture Notes in Computer Science}},
  title        = {{A Handwritten Text Recognition Dataset for Ajami Manuscripts in Fulfulde and Hausa}},
  url          = {{http://dx.doi.org/10.1007/978-3-032-04627-7_36}},
  doi          = {{10.1007/978-3-032-04627-7_36}},
  volume       = {{16026 LNCS}},
  year         = {{2026}},
}