Assessing Data Quality in Image Recognition Datasets of Swedish Financial Reports

Demuth, Maren

Assessing Data Quality in Image Recognition Datasets of Swedish Financial Reports

Mark

Demuth, Maren ^LU (2024) In Master's Theses in Mathematical Sciences MASM02 20232
Mathematical Statistics

Abstract: A great amount of financial statements of Swedish companies are registered with the Swedish companies registration office, Bolagsverket, as paper copies. These copies are then scanned and made available to the public as image-PDFs, meaning the financial information contained in them is not digitised and therefore not easily processed in an automated manner. Being able to do so, is valuable however, e.g. to offer automated credit risk assessments or investigate fraud cases through modern technologies. In order to digitise this financial information, the company I work for has developed image recognition algorithms that can create a structured data representation of the financial statements. In some cases however, the image recognition fails... (More); A great amount of financial statements of Swedish companies are registered with the Swedish companies registration office, Bolagsverket, as paper copies. These copies are then scanned and made available to the public as image-PDFs, meaning the financial information contained in them is not digitised and therefore not easily processed in an automated manner. Being able to do so, is valuable however, e.g. to offer automated credit risk assessments or investigate fraud cases through modern technologies. In order to digitise this financial information, the company I work for has developed image recognition algorithms that can create a structured data representation of the financial statements. In some cases however, the image recognition fails at creating an accurate representation of what is written in the financial statement. This poses a data quality issue where further applications onto the digitised financial data might be misconstrued and offer a skewed or outright wrong perspective of the underlying financial situation of a certain company. It is therefore the goal of this thesis to develop a solution that can identify cases in which the image recognition data does not match its true counterpart.
The solution developed in this thesis is three-fold. First, the quality of the extracted data is evaluated through the assignment of so-called error labels. Second, a Random Forest classifier is trained to be able to predict these error labels and lastly, a quality score is calculated to offer a suggestion of the best possible representation for each value in a financial statement. It is shown that this approach obtains reasonable results and does indeed beat an existing approach to solving the same problem by a considerable margin. The solution build in this thesis therefore offers a valuable extension to the image recognition program, by allowing for a data quality assessment of the extracted information and therefore increasing the confidence one can have in the digitised financial data to accurately represent the original paper copies. (Less)
Popular Abstract: In this thesis, I build a solution that can identify whether or not a data point is of good or bad quality. These data points are originally created by a program that extracts information from financial reports. Now, this program works by using image recognition technology and the problem is, that sometimes it simply extracts the wrong information from the financial report. I want to figure out when that happens.
A financial report contains numbers that describe a company’s financial situation, usually over the period of a year. For example, it contains the company’s revenue and cost figures. In Sweden, these reports are often registered as paper copies with the financial authorities and then they get scanned into PDFs. Because these PDFs... (More); In this thesis, I build a solution that can identify whether or not a data point is of good or bad quality. These data points are originally created by a program that extracts information from financial reports. Now, this program works by using image recognition technology and the problem is, that sometimes it simply extracts the wrong information from the financial report. I want to figure out when that happens.
A financial report contains numbers that describe a company’s financial situation, usually over the period of a year. For example, it contains the company’s revenue and cost figures. In Sweden, these reports are often registered as paper copies with the financial authorities and then they get scanned into PDFs. Because these PDFs are basically just images, the information contained in them cannot easily be processed with computers. It would be very valuable to be able to do that however, for example to offer automated credit risk assessments on these companies. In order to be able to do that, the company I work for has therefore developed this image recognition program which can extract information from the PDFs and store it in a table format.
The central problem of my thesis is that sometimes, the program will not extract the correct information. In the PDF, the revenue value might read 4502 SEK, but the program has extracted the value 8502 SEK instead. I want to be able to identify when that happens to make sure that the extracted information is of high quality. To solve this problem, I build a Machine Learning program that can predict whether an extracted number matches the number written in the PDF or not. As it turns out, this is a decent way to identify the incorrect extractions. This means that my Machine Learning program can be integrated with the image recognition program that the company I work for develops, to improve the extracted information it offers. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9164455

author

Demuth, Maren ^LU

supervisor

Magnus Wiktorsson ^LU

organization

Mathematical Statistics

course

MASM02 20232

year

2024

type

H2 - Master's Degree (Two Years)

subject

Mathematics and Statistics

keywords

Data Quality, Random Forest, Financial Reports

publication/series

Master's Theses in Mathematical Sciences

report number

LUNFMS-3129-2024

ISSN

1404-6342

other publication id

2024:E58

language

English

id

9164455

date added to LUP

2024-06-19 16:24:58

date last changed

2025-01-22 14:22:35

@misc{9164455,
  abstract     = {{A great amount of financial statements of Swedish companies are registered with the Swedish companies registration office, Bolagsverket, as paper copies. These copies are then scanned and made available to the public as image-PDFs, meaning the financial information contained in them is not digitised and therefore not easily processed in an automated manner. Being able to do so, is valuable however, e.g. to offer automated credit risk assessments or investigate fraud cases through modern technologies. In order to digitise this financial information, the company I work for has developed image recognition algorithms that can create a structured data representation of the financial statements. In some cases however, the image recognition fails at creating an accurate representation of what is written in the financial statement. This poses a data quality issue where further applications onto the digitised financial data might be misconstrued and offer a skewed or outright wrong perspective of the underlying financial situation of a certain company. It is therefore the goal of this thesis to develop a solution that can identify cases in which the image recognition data does not match its true counterpart.
The solution developed in this thesis is three-fold. First, the quality of the extracted data is evaluated through the assignment of so-called error labels. Second, a Random Forest classifier is trained to be able to predict these error labels and lastly, a quality score is calculated to offer a suggestion of the best possible representation for each value in a financial statement. It is shown that this approach obtains reasonable results and does indeed beat an existing approach to solving the same problem by a considerable margin. The solution build in this thesis therefore offers a valuable extension to the image recognition program, by allowing for a data quality assessment of the extracted information and therefore increasing the confidence one can have in the digitised financial data to accurately represent the original paper copies.}},
  author       = {{Demuth, Maren}},
  issn         = {{1404-6342}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Master's Theses in Mathematical Sciences}},
  title        = {{Assessing Data Quality in Image Recognition Datasets of Swedish Financial Reports}},
  year         = {{2024}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Assessing Data Quality in Image Recognition Datasets of Swedish Financial Reports