Linking Individuals Across Historical Sources: a Fully Automated Approach

Ran Abramitzky, Roy Mill, Santiago Pérez

NBER Working Paper No. 24324
Issued in February 2018, Revised in March 2019
NBER Program(s):Economics of Aging, Development of the American Economy, Labor Studies

Linking individuals across historical datasets relies on information such as name and age that is both non-unique and prone to enumeration and transcription errors. These errors make it impossible to find the correct match with certainty. In the first part of the paper, we suggest a fully automated probabilistic method for linking historical datasets that enables researchers to create samples at the frontier of minimizing type I (false positives) and type II (false negatives) errors. The first step guides researchers in the choice of which variables to use for linking. The second step uses the Expectation-Maximization (EM) algorithm, a standard tool in statistics, to compute the probability that each two records correspond to the same individual. The third step suggests how to use these estimated probabilities to choose which records to use in the analysis. In the second part of the paper, we apply the method to link historical population censuses in the US and Norway, and use these samples to estimate measures of intergenerational occupational mobility. The estimates using our method are remarkably similar to the ones using IPUMS’, which relies on hand linking to create a training sample. We created an R code and a Stata command that implement this method.

download in pdf format
   (672 K)

email paper

Machine-readable bibliographic record - MARC, RIS, BibTeX

Document Object Identifier (DOI): 10.3386/w24324

Published: Ran Abramitzky & Roy Mill & Santiago Pérez, 2020. "Linking individuals across historical sources: A fully automated approach*," Historical Methods: A Journal of Quantitative and Interdisciplinary History, vol 53(2), pages 94-111. citation courtesy of

Users who downloaded this paper also downloaded* these:
Bryan, Choi, and Karlan w24278 Randomizing Religion: The Impact of Protestant Evangelism on Economic Outcomes
Auclert and Rognlie w24280 Inequality and Aggregate Demand
Bernanke and Gürkaynak Is Growth Exogenous? Taking Mankiw, Romer, and Weil Seriously
Fieldhouse and Mertens w23165 A Narrative Analysis of Mortgage Asset Purchases by Federal Agencies
Cheng and Xiong w19642 The Financialization of Commodity Markets
NBER Videos

National Bureau of Economic Research, 1050 Massachusetts Ave., Cambridge, MA 02138; 617-868-3900; email:

Contact Us