Automated Linking of Historical Data
The recent digitization of complete count census data is an extraordinary opportunity for social scientists to create large longitudinal datasets by linking individuals from one census to another or from other sources to the census. We evaluate different automated methods for record linkage, performing a series of comparisons across methods and against hand linking. We have three main findings that lead us to conclude that automated methods perform well. First, a number of automated methods generate very low (less than 5%) false positive rates. The automated methods trace out a frontier illustrating the tradeoff between the false positive rate and the (true) match rate. Relative to more conservative automated algorithms, humans tend to link more observations but at a cost of higher rates of false positives. Second, when human linkers and algorithms use the same linking variables, there is relatively little disagreement between them. Third, across a number of plausible analyses, coefficient estimates and parameters of interest are very similar when using linked samples based on each of the different automated methods. We provide code and Stata commands to implement the various automated methods.
We are grateful to Jaime Arellano-Bover, Helen Kissel, and Tom Zohar for superb research assistance and useful comments and conversations, and to Horace Lee and Antigone Xenopoulos for help with data collection. We are grateful to Steven Durlauf (the editor) and six anonymous referees, as well as to Alvaro Calderón, Jacob Conway, John Parman, Laura Salisbury and Marianne Wanamaker for their most useful comments and suggestions. We are grateful to the Laura and John Arnold Foundation for financial support. We especially wish to thank Joe Price and Jacob Van Leeuwen from the BYU Record Linking Lab for comparing linkages made using our codes to the hand linkages made by users in the Family Tree Data on the FamilySearch.org website. The views expressed herein are those of the authors and do not necessarily reflect the views of the National Bureau of Economic Research.
Ran Abramitzky & Leah Boustan & Katherine Eriksson & James Feigenbaum & Santiago Pérez, 2021. "Automated Linking of Historical Data," Journal of Economic Literature, American Economic Association, vol. 59(3), pages 865-918, September. citation courtesy of