State Mortality Data, 1900-1936

Files containing the unbalanced panel of annual state-level mortality in registration states by cause and by age and sex for years 1900-1936:

Each state by year observation contains variables with self- explanatory names for total deaths, deaths by cause, deaths by age, deaths by age for males, and deaths by age for females in all death registration area states, 1900-1936. I have used conservative assumptions to standardize a few causes and age groups across years (and have not included causes that are inconsistent across years - these are present in the raw Excel data which I'll provide as well.)

Here is a list of anomalies of which I am aware and that appear to be present in the historical mortality volumes (rather than being due to data entry errors):

  • 1902: total deaths summed by cause and total deaths summed by age and sex for all reporting states do not agree
  • 1912 and 1913: deaths by state and age are not disaggregated by gender in the historical volumes
  • 1916: total deaths and the sum of age-specific deaths for females by age in Ohio do not agree
  • 1919: total deaths and the sum of age-specific deaths for males by age in Kentucky do not agree
  • 1920: total deaths and the sum of age-specific deaths for males in California and for females in Indiana do not agree; total deaths and the sums by age and sex for Kentucky do not agree
  • 1924 through 1936: deaths by age are aggregated for ages 1- 4 (for earlier years they are provided by single years of age in this interval)
  • 1935: total deaths by cause and total deaths by age for Massachusetts do not agree

One nice feature of this data is that it is possible to compare totals provided in the historical volumes with sums by cause and by age to detect data entry errors (the data was also double-entered by Digital Divide Data). I've doubled-checked all inconsistencies caught by these comparisons - the ones listed here appear to be present in the printed historical mortality statistics volumes.

I'll need to do a bit more work in documenting everything - the main thing that is not transparent is how I created a few variables. My objective was to make the STATA dataset consistent across years, so under conservative assumptions, I combined a few categories of deaths that are reported differently in different years. For example, in some years, "cancer" and "tumor" deaths are reported separately, while in other years they are reported together as "cancer and tumors." So I created a single variable throughout called "cancer and tumors." The variables which required a little manipulation and a few reasonable assumptions are:

As I mentioned, I'd like to encourage people to use this data, so please feel free to share it with whomever might be interested. In particular, I'd very much like to know if additional errors are found. Once its up on the Berkeley demography web site, I'll let you know.

Grant Miller
ngmiller at stanford edu
23 March 2006