This project will create a large database of digitized, linked census records for people living in the United States between 1850 and 1950. The project will expand the set of available links among the US censuses for these years by a factor of three, while simultaneously increasing the representation of women and minority groups in the data. The research team will also develop and use state-of-the-art handwriting recognition tools to digitize millions of census, vital statistics, and other records. The database will be used in promising research in the social, behavioral, and economic sciences that would not otherwise be possible. For example, the data will allow researchers to measure the intergenerational transmission of wealth or education, and to estimate the long-term impacts of childhood circumstances. This project will enhance the infrastructure for research and education because the new data science tools will be shared with the research community. Because the majority of the resources for this project will support graduate and undergraduate research assistants, the project will also contribute to the development of a globally competitive STEM workforce.
This project will create an infrastructure for transcribing and linking together large collections of handwritten historical documents using deep machine learning and crowdsourced human resources. Our project is informed by other initiatives that take advantage of recent advances in data science, computer vision, and machine learning to extract text from historical images, convert that text into usable data, and link the individuals in the record to other collections. Our work will complement these efforts, and we have unique advantages that will allow us to dramatically advance the frontier in this area. These advantages arise from our partnership with FamilySearch.org, the largest genealogical research organization in the world, which allows us to access the largest-ever corpus of labeled handwriting images from historical records (1 billion+). Furthermore, because the users of these genealogy platforms have themselves created millions of record links using their private information, we have access to a massive database of "true" links that can inform our data cleaning processes and act as training data to improve the machine learning algorithms that are part of our linking strategy. With these advantages, we will create a digitized, longitudinal panel of linked census records that will include each of the 217 million people who lived in the United States between 1850 and 1950. For the majority of these we will be able to link individuals across multiple censuses and across families, and we will extend this effort to other publicly available record collections. We will make the tools and training data that we create easy to access and use by other researchers and the general public.
Supported by the National Science Foundation grant #2049762
More from NBER
In addition to working papers, the NBER disseminates affiliates’ latest findings through a range of free periodicals — the NBER Reporter, the NBER Digest, the Bulletin on Retirement and Disability, the Bulletin on Health, and the Bulletin on Entrepreneurship — as well as online conference reports, video lectures, and interviews.