Linking Individuals Across Historical Sources: a Fully Automated Approach
Linking individuals across historical datasets relies on information such as name and age that is both non-unique and prone to enumeration and transcription errors. These errors make it impossible to find the correct match with certainty. We suggest a fully automated method for linking historical datasets that enables researchers to create samples that minimize type I (false positives) and type II (false negatives) errors. The first step of the method uses the Expectation-Maximization (EM) algorithm, a standard tool in statistics, to compute the probability that each two observations correspond to the same individual. The second step uses these estimated probabilities to determine which records to use in the analysis. We provide codes to implement this method.
You may purchase this paper on-line in .pdf format from SSRN.com ($5) for electronic delivery.
Supplementary materials for this paper:
Document Object Identifier (DOI): 10.3386/w24324