A Handwritten Text Recognition Dataset for Ajami Manuscripts in Fulfulde and Hausa

Yousuf, Oreen; Aminu, Abdulmalik; Muhammad, Musa Salih; Usman, Bashir; Hashim, Mustapha Kurfi; Nivre, Joakim; Megyesi, Beáta; Høgel, Christian

A Handwritten Text Recognition Dataset for Ajami Manuscripts in Fulfulde and Hausa

Mark

Yousuf, Oreen ; Aminu, Abdulmalik ; Muhammad, Musa Salih ; Usman, Bashir ; Hashim, Mustapha Kurfi ; Nivre, Joakim ; Megyesi, Beáta and Høgel, Christian ^LU (2026) 19th International Conference on Document Analysis and Recognition, ICDAR 2025 In Lecture Notes in Computer Science 16026 LNCS. p.620-637

Abstract: We present the first ever dataset of manually segmented and transcribed Ajami manuscripts written in Fulfulde and Hausa. The term Ajami refers to modified Arabic-script orthographies in Africa. Existing handwritten text recognition (HTR) and optical character recognition (OCR) models for Arabic-script languages perform poorly on West African manuscripts due to a lack of these manuscripts representation in the models’ pre-training. This leads to models struggling to adapt to Ajami style calligraphy, being unequipped to recognize Ajami specific characters, and being unable to extract certain Arabic-script diacritics which are present in Ajami manuscripts but lacking in many manuscripts for other Arabic-script languages like Arabic and... (More); We present the first ever dataset of manually segmented and transcribed Ajami manuscripts written in Fulfulde and Hausa. The term Ajami refers to modified Arabic-script orthographies in Africa. Existing handwritten text recognition (HTR) and optical character recognition (OCR) models for Arabic-script languages perform poorly on West African manuscripts due to a lack of these manuscripts representation in the models’ pre-training. This leads to models struggling to adapt to Ajami style calligraphy, being unequipped to recognize Ajami specific characters, and being unable to extract certain Arabic-script diacritics which are present in Ajami manuscripts but lacking in many manuscripts for other Arabic-script languages like Arabic and Persian. The latter poses a significant challenge to Ajami HTR. We release the following as an open-source dataset: an ALTO formatting of high-quality images of Fulfulde and Hausa manuscripts, manual segmentation (region and line), and manual transcriptions. Our HTR dataset is also the first to diplomatically transcribe newly Unicode-encoded, special Quranic recitation characters. We evaluate a suite of Arabic-script recognition models specifically for historical manuscripts and find that they produce character error rates of 65–84% when attempting to automatically transcribe our curated manuscripts. Transcriptions produced by the evaluated models are released as well.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/c33132da-8c68-4065-ae59-d3f1fdf643bb

author

Yousuf, Oreen ; Aminu, Abdulmalik ; Muhammad, Musa Salih ; Usman, Bashir ; Hashim, Mustapha Kurfi ; Nivre, Joakim ; Megyesi, Beáta and Høgel, Christian ^LU

organization

publishing date

2026

type

Chapter in Book/Report/Conference proceeding

publication status

published

subject

Natural Language Processing

keywords

Ajami, benchmark dataset, handwritten text recognition

host publication

Document Analysis and Recognition – ICDAR 2025 - 19th International Conference, Proceedings

series title

Lecture Notes in Computer Science

editor

Yin, Xu-Cheng ; Karatzas, Dimosthenis and Lopresti, Daniel

volume

16026 LNCS

pages

18 pages

publisher

Springer Science and Business Media B.V.

conference name

19th International Conference on Document Analysis and Recognition, ICDAR 2025

conference location

Wuhan, China

conference dates

2025-09-16 - 2025-09-21

external identifiers

scopus:105017374269

ISSN

1611-3349

0302-9743

ISBN

9783032046260

DOI

10.1007/978-3-032-04627-7_36

language

English

LU publication?

yes

id

c33132da-8c68-4065-ae59-d3f1fdf643bb

date added to LUP

2025-11-20 11:24:10

date last changed

2026-07-04 19:14:34

@inproceedings{c33132da-8c68-4065-ae59-d3f1fdf643bb,
  abstract     = {{<p>We present the first ever dataset of manually segmented and transcribed Ajami manuscripts written in Fulfulde and Hausa. The term Ajami refers to modified Arabic-script orthographies in Africa. Existing handwritten text recognition (HTR) and optical character recognition (OCR) models for Arabic-script languages perform poorly on West African manuscripts due to a lack of these manuscripts representation in the models’ pre-training. This leads to models struggling to adapt to Ajami style calligraphy, being unequipped to recognize Ajami specific characters, and being unable to extract certain Arabic-script diacritics which are present in Ajami manuscripts but lacking in many manuscripts for other Arabic-script languages like Arabic and Persian. The latter poses a significant challenge to Ajami HTR. We release the following as an open-source dataset: an ALTO formatting of high-quality images of Fulfulde and Hausa manuscripts, manual segmentation (region and line), and manual transcriptions. Our HTR dataset is also the first to diplomatically transcribe newly Unicode-encoded, special Quranic recitation characters. We evaluate a suite of Arabic-script recognition models specifically for historical manuscripts and find that they produce character error rates of 65–84% when attempting to automatically transcribe our curated manuscripts. Transcriptions produced by the evaluated models are released as well.</p>}},
  author       = {{Yousuf, Oreen and Aminu, Abdulmalik and Muhammad, Musa Salih and Usman, Bashir and Hashim, Mustapha Kurfi and Nivre, Joakim and Megyesi, Beáta and Høgel, Christian}},
  booktitle    = {{Document Analysis and Recognition – ICDAR 2025 - 19th International Conference, Proceedings}},
  editor       = {{Yin, Xu-Cheng and Karatzas, Dimosthenis and Lopresti, Daniel}},
  isbn         = {{9783032046260}},
  issn         = {{1611-3349}},
  keywords     = {{Ajami; benchmark dataset; handwritten text recognition}},
  language     = {{eng}},
  pages        = {{620--637}},
  publisher    = {{Springer Science and Business Media B.V.}},
  series       = {{Lecture Notes in Computer Science}},
  title        = {{A Handwritten Text Recognition Dataset for Ajami Manuscripts in Fulfulde and Hausa}},
  url          = {{http://dx.doi.org/10.1007/978-3-032-04627-7_36}},
  doi          = {{10.1007/978-3-032-04627-7_36}},
  volume       = {{16026 LNCS}},
  year         = {{2026}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

A Handwritten Text Recognition Dataset for Ajami Manuscripts in Fulfulde and Hausa