Ancestry.com and IPUMS Complete Count Restricted File.
Ancestry.com has sponsored the digitization of the available complete count census files and allowed IPUMS to offer all but the respondent names on its website to the general user population.
1790-1830 are household level files. 1840-1940 are person level files. 1890 was lost in a fire and nothing is available for that year.
Confidentiality Considerations
The name (namefrst and namelast) fields are available to
affiliated users at the NBER by special arrangement through
IPUMS. NBER affiliates wishing to use the IPUMS-RESTRICTED
(Ancestry.com) census files for a new project should sign
the
Once all approvals are in place you can be added to the Linux groups with permission to read the data. These files (or extracts) may be processed on our servers but should not be downloaded from them.
To add an investigator or RA to an existing project send a signed agreement form marked with the project title and IPUMS number. The approval process is the same and you will be notified when the new researcher can access the data.
Linking between the restricted and public versions is fine, but the data must be maintained/analyzed on the NBER server.
Please ensure that your extracts are not world readable. It is important to respect the agreement to ensure continued access to this important resource, for you and your collegues. I can create a shared directory for you and others working on the same project.
On at least one occaision a researcher has been allowed to export data to a very secure server for matching to other confidential data. Details are here but this is not done for convenience.
Replication
If the data is anonymized and remains on the server, we can give the journal representative access for data replication. If the data is going to be removed from the server, or the data will not be de-anonymized and remains on the server, IPUMS will need to provide additional approval. In the one case, the researchers removed all individual names from their data and provided us a list of variables. IPUMS considered it anonymized when there were no potential string variables. We can always run the list of variables by IPUMS if it would be helpful.Citation
Publications and research reports based on the IPUMS USA database must cite it appropriately. The citation should include the following:Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas and Matthew Sobek. IPUMS USA: Version 10.0 [dataset]. Minneapolis, MN: IPUMS, 2020. https://doi.org/10.18128/D010.V10.0
File locations
Starting with the June 2019 distributions we keep our copies of the files in
Archives
From 2019 on, each version of the data will be (and are) kept available in /home/data/census-ipums/. There is no need to keep a personal copy.
Documentation
There is the beginning of an FAQ here. The IPUMS website covers all the publicly available variables. The additional variables in the restricted use file include:namefrst
- 16 character first name (and possibly middle initial)
namelast
- 16 character last name
histid
- 36 character person id for matching across IPUMS versions (but not census decades)
street
- street address
File Structure
The original files are hierarchical, but we have created the dta etc files as rectangular person datasets. That is, the household record is appended to each person record. We also apply the scaling factors in the IPUMS supplied code, which should conform the data to the documentation. Value labels that are merely the ASCII expression of the numeric value are dropped. There are no other changes.Resource considerations
These are very large files as evidenced by record counts and file sizes (up to 150GB) but with some thought it is practical to work with only traditional econometric software. The .dta files are somewhat more reasonableThere is advice for Stata users on dealing with very large files here but see especially this general advice for large projects and these suggestions from James Feigenbaum specific to this data.A new, fast and compact format is Parquet. This is column oriented, so if you load just a few variables only a fraction of the file need be read. Please see here for details however the Parquet support in Stata and R is so ineffective we have not continued to supply Parquet versions after v2021.
Matching
The Census Linking Project at Princeton created a set of linked datasets between every historical Census pair using a variety of automated methods. There are considerable savings in time and resources in using a pre-made match. The code and documentation is also available on our system at while access to the data on our system is restricted to members of the "cens1930" group. For internal use the files are at:- /home/data/census-ipums/linking_project
Please respect other users by being reasonably efficient with computational resources, especially memory. In particular when reading even one of these files into Stata you will want to subset on variables or rows, or both. There is a directory
Showload will show you the available memory on all the machines, and "top -o RES" will show how your job is doing on the current machine. Computer time is cheap, but waiting for the computer is not. You may run multiple jobs, but resist the urge to use more than half the available CPU or memory on any one machine. If you ask for more memory than the computer has free, your job will run so slowly that it may never finish. It is always a good idea to keep track of the progress of large, long jobs.
Note that all our disk storage is compressed in the filesystem. Zip or Z compression will not reduce the actual resources used, and will add complexity and time to your analysis.
Notes and questions for discussion
- Are any more of the uscenNNNN_NNNN (numbered) variables
useful? Including all of them would multiply the load times. At
this time only these have been added and only in the decades noted:
rawmcd Minor Civil Division us1900m_0045 rawmcd Minor Civil Division us1910m_0053 rawhnum House Number us1910m_0056 rawmcd Minor Civil Division us1940b_0074 rawhnum House number us1930d_0061 rawhnum House number us1940b_0028 - Online documentation from IPUMS suggests using datanum,serial and pernum for identifying individual records, but common practice among NBER users is to use histid. Datanum is absent in 1860-70. An advantage of histid is that it is mostly maintained across IPUMS versions, however IPUMS had some early hiccups with this plan and 1880 and 1940 do have different histids across versions.
- The name fields sometimes contain what appear to me to be stray nonsense characters such as quote marks, dollar signs, brackets, etc. The quote marks especially can discomfort Stata. Would it be better to drop these?
- The programs from IPUMS would create hierarchical files. I didn't think most users would like that. Is there a preference for hierarchical files?
- I did not continue the practice of dividing the files into 100 pieces. The dta files can load in a couple of minutes. I do make extracts with the variables required for matching divided by birthplace and sex. Those are never very large and are available in ./mx. Is that ok?
- It should be possible to greatly reduce the resource load for matching across decades and I would like to talk to users doing that. Some preliminary work is outlined here.
- Jaro-Winkler distance programs don't seem to have uniform outcomes. The Feigenbaum -jarowinkler.ado- program applies the Winkler correction to all scores, while Winkler himself applies it only when the Jaro score is greater than .7. The Winkler adjustment parameter is .1, other authors have other values. The distance for null and one character strings varies across implementations. I can follow Feigenbaum, but seek input from all users.
- Winkler also has an adjustment for often confused characters (such as "X" and "K") which is not often used. Other authors have nickname lists. I would like to collect such lists and offer them for more general use.
- Some users have supplemented the data with additional variables. If you would like to allow others to use these new variables, I can add them to the common use files.
- If you have a crosswalk across decades, I would be happy to post it here for other users. Records can be identified by histid or the combination of datanum, serial and pernum. I will standarize the variable names by adding a 4 digit year to each.
- I have dropped 50,000 value labels which are simply ASCII
presentations of numeric values, such as:
label define value x 7 `7' - There are 196 "P" records in 1940 with no corresponding "H" record. This does not happen in any other year, and those records are omitted from the .dta file.
- There are a number of records with blank histid.
- What is the appropriate sort order? Matching by histid requires records sorted by histid, which is not the native order.
Don't hesitate to contact me for computer issues, but I don't have that much domain-specific knowledge.
Daniel Feenberg
6 Decmber 2022
617-863-0343