- MyNBER

Ancestry.com and IPUMS Complete Count Restricted File.

Ancestry.com has sponsored the digitization of the available complete count census files and allowed IPUMS to offer all but the respondent names on its website to the general user population.

1790-1830 are household level files. 1840-1940 are person level files. 1890 was lost in a fire and nothing is available for that year.

Confidentiality Considerations

The name (namefrst and namelast) fields are available to affiliated users at the NBER by special arrangement through IPUMS. NBER affiliates wishing to use the IPUMS-RESTRICTED (Ancestry.com) census files for a new project should sign the application and NDA agreement forms here and send them to Carla Tokman and our IRB for our submission to IPUMS. Once approved and assigned a project number by IPUMS, the project will be forwarded to the NBER IRB for review. The IRB may follow-up with additional questions.

Once all approvals are in place you can be added to the Linux groups with permission to read the data. These files (or extracts) may be processed on our servers but should not be downloaded from them.

To add an investigator or RA to an existing project send a signed agreement form marked with the project title and IPUMS number. The approval process is the same and you will be notified when the new researcher can access the data.

Linking between the restricted and public versions is fine, but the data must be maintained/analyzed on the NBER server.

Please ensure that your extracts are not world readable. It is important to respect the agreement to ensure continued access to this important resource, for you and your collegues. I can create a shared directory for you and others working on the same project.

On at least one occaision a researcher has been allowed to export data to a very secure server for matching to other confidential data. Details are here but this is not done for convenience.

Replication

If the data is anonymized and remains on the server, we can give the journal representative access for data replication. If the data is going to be removed from the server, or the data will not be de-anonymized and remains on the server, IPUMS will need to provide additional approval. In the one case, the researchers removed all individual names from their data and provided us a list of variables. IPUMS considered it anonymized when there were no potential string variables. We can always run the list of variables by IPUMS if it would be helpful.

Citation

Publications and research reports based on the IPUMS USA database must cite it appropriately. The citation should include the following:

Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas and Matthew Sobek. IPUMS USA: Version 10.0 [dataset]. Minneapolis, MN: IPUMS, 2020. https://doi.org/10.18128/D010.V10.0

File locations

Starting with the June 2019 distributions we keep our copies of the files in

/home/data/census-ipums/ and its subdirectories. We have 1850-1940 except 1890 (1890 was lost in a fire). These files contain all the named fields, not just the restricted fields. Unedited (numbered) fields are not included. If you have need for unedited fields, contact Dan Feenberg - we can add what you need. Each IPUMS revision set is kept in a separate directory, with the latest at /home/data/census-ipums/v2022 for the version obtained by NBER in September of 2022, and whose filedates were from February and April of that year. Within that directory are directories "orig", "sps", etc for programs that can read the data files in various packages. These programs are provided for reference, as we have already modified the Stata code to convert the raw ASCII files (in ./dat) to dta format in ./dta. Comma delimited files are in the ./csv directory and Parquet format in ./parquet were not created for v2022, but could be on request. Other formats can be added if useful and requested. Please do not make private copies of the datasets. The locations of earlier versions of the files will not change, and those files will not be deleted or updated. /home/data/census-ipums/current will always point to the latest available revision, but older versions will be retained for consistency. For this reason there is no reason for you to make private copies of the full datasets.

Documentation

There is the beginning of an FAQ here. The IPUMS website covers all the publicly available variables. The additional variables in the restricted use file include:

namefrst: 16 character first name (and possibly middle initial)
namelast: 16 character last name
histid: 36 character person id for matching across IPUMS versions (but not census decades)
street: street address

Here is a compact concordance of variables and descriptions.

File Structure

The original files are hierarchical, but we have created the dta etc files as rectangular person datasets. That is, the household record is appended to each person record. We also apply the scaling factors in the IPUMS supplied code, which should conform the data to the documentation. Value labels that are merely the ASCII expression of the numeric value are dropped. There are no other changes.

Resource considerations

These are very large files as evidenced by record counts and file sizes (up to 150GB) but with some thought it is practical to work with only traditional econometric software. The .dta files are somewhat more reasonableThere is advice for Stata users on dealing with very large files here but see especially this general advice for large projects and these suggestions from James Feigenbaum specific to this data.

A new, fast and compact format is Parquet. This is column oriented, so if you load just a few variables only a fraction of the file need be read. Please see here for details however the Parquet support in Stata and R is so ineffective we have not continued to supply Parquet versions after v2021.

Matching

The Census Linking Project at Princeton created a set of linked datasets between every historical Census pair using a variety of automated methods. There are considerable savings in time and resources in using a pre-made match. The code and documentation is also available on our system at

http://www.nber.org/data/census-ipums/linking_project

while access to the data on our system is restricted to members of the "cens1930" group. For internal use the files are at:

/home/data/census-ipums/linking_project

Publications using data from the matches should cite the Census Linking Project as: Ran Abramitzky, Leah Boustan and Myera Rashid. Census Linking Project: Version 1.0 [dataset]. 2020. https://censuslinkingproject.org

Please respect other users by being reasonably efficient with computational resources, especially memory. In particular when reading even one of these files into Stata you will want to subset on variables or rows, or both. There is a directory

/home/data/census-ipums/tiny with Arkansas records only. This is a good way to get a small sample for testing that can be followed through time. In Stata, using a qualifier such as use /home/data/census-ipums/current/dta/1920 in 1/10000 will give you a small file, but with no ability to link across decades. use /home/data/census-ipums/tiny/dta/1920 is much more satisfactory.

Showload will show you the available memory on all the machines, and "top -o RES" will show how your job is doing on the current machine. Computer time is cheap, but waiting for the computer is not. You may run multiple jobs, but resist the urge to use more than half the available CPU or memory on any one machine. If you ask for more memory than the computer has free, your job will run so slowly that it may never finish. It is always a good idea to keep track of the progress of large, long jobs.

Note that all our disk storage is compressed in the filesystem. Zip or Z compression will not reduce the actual resources used, and will add complexity and time to your analysis.

Notes and questions for discussion

Are any more of the uscenNNNN_NNNN (numbered) variables useful? Including all of them would multiply the load times. At this time only these have been added and only in the decades noted:

rawmcd	Minor Civil Division	us1900m_0045
rawmcd	Minor Civil Division	us1910m_0053
rawhnum	House Number	us1910m_0056
rawmcd	Minor Civil Division	us1940b_0074
rawhnum	House number	us1930d_0061
rawhnum	House number	us1940b_0028

The unedited varialbles are given the prefix "raw" to avoid confusion when edited versions become available. Apparently none of these variables are supplied in the 2022 version of the files.

Online documentation from IPUMS suggests using datanum,serial and pernum for identifying individual records, but common practice among NBER users is to use histid. Datanum is absent in 1860-70. An advantage of histid is that it is mostly maintained across IPUMS versions, however IPUMS had some early hiccups with this plan and 1880 and 1940 do have different histids across versions.
The name fields sometimes contain what appear to me to be stray nonsense characters such as quote marks, dollar signs, brackets, etc. The quote marks especially can discomfort Stata. Would it be better to drop these?
The programs from IPUMS would create hierarchical files. I didn't think most users would like that. Is there a preference for hierarchical files?
I did not continue the practice of dividing the files into 100 pieces. The dta files can load in a couple of minutes. I do make extracts with the variables required for matching divided by birthplace and sex. Those are never very large and are available in ./mx. Is that ok?
It should be possible to greatly reduce the resource load for matching across decades and I would like to talk to users doing that. Some preliminary work is outlined here.
Jaro-Winkler distance programs don't seem to have uniform outcomes. The Feigenbaum -jarowinkler.ado- program applies the Winkler correction to all scores, while Winkler himself applies it only when the Jaro score is greater than .7. The Winkler adjustment parameter is .1, other authors have other values. The distance for null and one character strings varies across implementations. I can follow Feigenbaum, but seek input from all users.
Winkler also has an adjustment for often confused characters (such as "X" and "K") which is not often used. Other authors have nickname lists. I would like to collect such lists and offer them for more general use.
Some users have supplemented the data with additional variables. If you would like to allow others to use these new variables, I can add them to the common use files.
If you have a crosswalk across decades, I would be happy to post it here for other users. Records can be identified by histid or the combination of datanum, serial and pernum. I will standarize the variable names by adding a 4 digit year to each.
I have dropped 50,000 value labels which are simply ASCII presentations of numeric values, such as:
There are 196 "P" records in 1940 with no corresponding "H" record. This does not happen in any other year, and those records are omitted from the .dta file.
There are a number of records with blank histid.
What is the appropriate sort order? Matching by histid requires records sorted by histid, which is not the native order.

Don't hesitate to contact me for computer issues, but I don't have that much domain-specific knowledge.

Daniel Feenberg
6 Decmber 2022
617-863-0343