Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

Joint Handwritten Text Recognition and Word Classification for Tabular Information Extraction

Blomqvist, Christopher ; Enflo, Kerstin LU orcid ; Jakobsson, Andreas LU orcid and Åström, Kalle LU orcid (2022) 26TH International Conference on Pattern Recognition, 2022 p.1564-1570
Abstract
In this paper, we present a system for extracting tabular information from loosely structured handwritten documents. The system consists of three parts, (i) a u-net like CNN-based method for text detection and segmentation, (ii) a new attention-based method for simultaneous text recognition and classification of word-parts, and (iii) a method for matching the word parts into a tabular structure for each entry. A key contribution is the observation that the new attention-based recognition and classification module makes it possible for improved spatial analysis of the tabular information. The method is evaluated on a unique historical document: The Swedish Wealth Tax of 1571, consisting of 11,453 pages of hand-written tax records. The... (More)
In this paper, we present a system for extracting tabular information from loosely structured handwritten documents. The system consists of three parts, (i) a u-net like CNN-based method for text detection and segmentation, (ii) a new attention-based method for simultaneous text recognition and classification of word-parts, and (iii) a method for matching the word parts into a tabular structure for each entry. A key contribution is the observation that the new attention-based recognition and classification module makes it possible for improved spatial analysis of the tabular information. The method is evaluated on a unique historical document: The Swedish Wealth Tax of 1571, consisting of 11,453 pages of hand-written tax records. The evaluation shows that the system provides a significant improvement to the state-of-the-art to the problem of tabular extraction from loosely structured historical documents. (Less)
Please use this url to cite or link to this publication:
author
; ; and
organization
publishing date
type
Chapter in Book/Report/Conference proceeding
publication status
published
subject
keywords
Histograms, Image segmentation, Text recognition, Finance, Writing, Information retrieval, Decoding
host publication
2022 26th International Conference on Pattern Recognition (ICPR)
pages
6 pages
publisher
IEEE - Institute of Electrical and Electronics Engineers Inc.
conference name
26TH International Conference on Pattern Recognition, 2022
conference location
Montreal, Canada
conference dates
2022-08-21 - 2022-08-25
external identifiers
  • scopus:85128381076
ISBN
978-1-6654-9063-4
978-1-6654-9062-7
DOI
10.1109/ICPR56361.2022.9956282
project
Praise the people or praise the place: How culture and specialization drive long-term regional growth
language
English
LU publication?
yes
id
b5f50e29-597f-474b-b687-ab45f476d11d
date added to LUP
2022-12-12 14:43:41
date last changed
2024-06-13 21:41:09
@inproceedings{b5f50e29-597f-474b-b687-ab45f476d11d,
  abstract     = {{In this paper, we present a system for extracting tabular information from loosely structured handwritten documents. The system consists of three parts, (i) a u-net like CNN-based method for text detection and segmentation, (ii) a new attention-based method for simultaneous text recognition and classification of word-parts, and (iii) a method for matching the word parts into a tabular structure for each entry. A key contribution is the observation that the new attention-based recognition and classification module makes it possible for improved spatial analysis of the tabular information. The method is evaluated on a unique historical document: The Swedish Wealth Tax of 1571, consisting of 11,453 pages of hand-written tax records. The evaluation shows that the system provides a significant improvement to the state-of-the-art to the problem of tabular extraction from loosely structured historical documents.}},
  author       = {{Blomqvist, Christopher and Enflo, Kerstin and Jakobsson, Andreas and Åström, Kalle}},
  booktitle    = {{2022 26th International Conference on Pattern Recognition (ICPR)}},
  isbn         = {{978-1-6654-9063-4}},
  keywords     = {{Histograms; Image segmentation; Text recognition; Finance; Writing; Information retrieval; Decoding}},
  language     = {{eng}},
  month        = {{11}},
  pages        = {{1564--1570}},
  publisher    = {{IEEE - Institute of Electrical and Electronics Engineers Inc.}},
  title        = {{Joint Handwritten Text Recognition and Word Classification for Tabular Information Extraction}},
  url          = {{http://dx.doi.org/10.1109/ICPR56361.2022.9956282}},
  doi          = {{10.1109/ICPR56361.2022.9956282}},
  year         = {{2022}},
}