Examining the role of training data for supervised methods of automated record linkage : Lessons for best practice in economic history

Feigenbaum, James J.; Helgertz, Jonas; Price, Joseph

Examining the role of training data for supervised methods of automated record linkage : Lessons for best practice in economic history

Mark

Feigenbaum, James J. ; Helgertz, Jonas ^LU and Price, Joseph (2025) In Explorations in Economic History 96.

Abstract: During the past decade, scholars have produced a vast amount of research using linked historical individual-level data, shaping and changing our understanding of the past. This linked data revolution has been powered by methodological and computational advances, partly focused on supervised machine-learning methods that rely on training data. The importance of obtaining high-quality training data for the performance of the record linkage algorithm largely, however, remains unknown. This paper comprehensively examines the role of training data, and—by extension—improves our understanding of best practices in supervised methods of probabilistic record linkage. First, we compare the speed and costs of building training data using different... (More); During the past decade, scholars have produced a vast amount of research using linked historical individual-level data, shaping and changing our understanding of the past. This linked data revolution has been powered by methodological and computational advances, partly focused on supervised machine-learning methods that rely on training data. The importance of obtaining high-quality training data for the performance of the record linkage algorithm largely, however, remains unknown. This paper comprehensively examines the role of training data, and—by extension—improves our understanding of best practices in supervised methods of probabilistic record linkage. First, we compare the speed and costs of building training data using different methods. Second, we document high rates of conditional accuracy across the training data sets, rates that are especially high when built with access to more information. Third, we show that data constructed by record linking algorithms learning from different training-data-generation methods do not substantially differ in their accuracy, either overall or across demographic groups, though algorithms tend to perform best when their feature space aligns with the features used to build the training data. Lastly, we introduce errors in the training data and find that the examined record linking algorithms are remarkably capable of making accurate links even working with flawed training data.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/46305d00-0b77-4034-b521-228183e4b564

author

Feigenbaum, James J. ; Helgertz, Jonas ^LU and Price, Joseph

organization

publishing date

2025-04

type

Contribution to journal

publication status

published

subject

Other Computer and Information Science

keywords

Automated record linkage, Historical data, Probabilistic record linkage, Supervised record linkage, Training data

in

Explorations in Economic History

volume

96

article number

101656

publisher

Academic Press

external identifiers

scopus:85219581882

ISSN

0014-4983

DOI

10.1016/j.eeh.2025.101656

language

English

LU publication?

yes

id

46305d00-0b77-4034-b521-228183e4b564

date added to LUP

2025-06-18 13:07:30

date last changed

2025-10-14 09:11:58

@article{46305d00-0b77-4034-b521-228183e4b564,
  abstract     = {{<p>During the past decade, scholars have produced a vast amount of research using linked historical individual-level data, shaping and changing our understanding of the past. This linked data revolution has been powered by methodological and computational advances, partly focused on supervised machine-learning methods that rely on training data. The importance of obtaining high-quality training data for the performance of the record linkage algorithm largely, however, remains unknown. This paper comprehensively examines the role of training data, and—by extension—improves our understanding of best practices in supervised methods of probabilistic record linkage. First, we compare the speed and costs of building training data using different methods. Second, we document high rates of conditional accuracy across the training data sets, rates that are especially high when built with access to more information. Third, we show that data constructed by record linking algorithms learning from different training-data-generation methods do not substantially differ in their accuracy, either overall or across demographic groups, though algorithms tend to perform best when their feature space aligns with the features used to build the training data. Lastly, we introduce errors in the training data and find that the examined record linking algorithms are remarkably capable of making accurate links even working with flawed training data.</p>}},
  author       = {{Feigenbaum, James J. and Helgertz, Jonas and Price, Joseph}},
  issn         = {{0014-4983}},
  keywords     = {{Automated record linkage; Historical data; Probabilistic record linkage; Supervised record linkage; Training data}},
  language     = {{eng}},
  publisher    = {{Academic Press}},
  series       = {{Explorations in Economic History}},
  title        = {{Examining the role of training data for supervised methods of automated record linkage : Lessons for best practice in economic history}},
  url          = {{http://dx.doi.org/10.1016/j.eeh.2025.101656}},
  doi          = {{10.1016/j.eeh.2025.101656}},
  volume       = {{96}},
  year         = {{2025}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Examining the role of training data for supervised methods of automated record linkage : Lessons for best practice in economic history