Tax Model Data at the NBER

In accordance with a data agreement, all of the data described below is available for use solely on computers resident at the NBER. See Dan Feenberg for an account, and access to the data directories.

The NBER has a complete collection of all the public use Tax Model files created by the Statistics of Income Division of the IRS. There are cross section files from 1960 to 2012 (except 61, 63, and 65) and a panel file from 1979 to 1990. Each file includes about 200 variables, with some censoring of sensitive information. The cross section files are stratified random samples with oversampling of high income households. A weight varible is included and population estimates should take account of the weight. In Stata it would be an -aweight-.

The panel files are random samples based on the last 4 digits of the SSN and are unweighted. 2 to 5 endings are selected each year. New taxpayers enter the file if they have one of the selected 4 digit ssn endings, and leave only if they fail to file. If they start up again, then they are included again. Note that the selection is on the primary taxpayer's ssn, so women leave the panel on marriage and return on divorce or widowhood.

Also note that the number of selected endings is larger in some years, so a taxpayer from a year with 5 endings can be missing from a year with only 2 endings, but could reappear in a later year when more endings are selected.

A brief review of the data, the TAXSIM program for calculating tax liabilities and some of the applications made of it at the NBER is included at http://www.nber.org/taxsim/feenberg-coutts.pdf

All files are available as flat files in SAS and Stata format. The fully processed files are available as ASCII text also. In each data directory, file names have a structure. An initial letter "x" for the full cross section "s" for a two percent subset of the full cross section and "p" for the panel file is followed by a four digit year and an extension showing the file type. For example, x1960.dta is the full cross section for 1960 in Stata format.

There are no missing values in these files. If someone doesn't work, their wages are logically zero, and so forth for all the other variables. There is no ambiguity.

/homes/data/soi/raw/

These are the original Tax Model files from the Statistics of Income Division of the IRS. Many are in packed or zoned decimal and variables wander across the record layout from year to year so you will probably find it easier to use one of the more processed formats described below. Full documentation for these files is at http://www.nber.org/taxsim/gdb and is of interest even if you are using processed files, as it includes sample tax forms showing the source of every data item, or describing the calculation reported, as well as documenting what censoring has taken place.

/homes/data/soi/e/

These are SAS format (sas7bdat) files of the original Tax Model files, with the variable names assigned by SOI (E-codes, mostly). They start in 1998 because that is the first year that the E-codes were made available.

/homes/data/soi/sas/
/homes/data/soi/dta/

These are the raw files converted to SAS or Stata format, and with semi-consistent variable names. A concordance of names, descriptions and source locations for all variables is linked from the main TAXSIM web page With these documents, you can quickly discover what is available for any given year. NBER has assigned names because the SOI considered the variable names confidential until recent files.

The names are only semi-consistent because while some variables such as adjusted gross income keep their name in all years, variables such as long term gains change their basic meaning as the share of gains included in AGI changes. In most years SOI reports only the included amount and the variable name changes as the inclusion fraction changes.

/homes/data/soi/taxsim/sas/
/homes/data/soi/taxsim/dta/
/homes/data/soi/taxsim/txt/

These files provide a highly consistent naming (actually numbering) convention through time for a subset of the original variables. Wages are always "data11". Full long term gains are calculated by dividing the SOI supplied amount by 1., .5 or .4 and stored as "data70". Similar calculations are done for various deductions subject to a floor, etc. We have an index showing all the names, and the years for which each item is available.

Adding 10% of AGI to deductible medical expenses (for taxpayers with deductible medical expenses) and calling the result "gross medical expenses" simplifies using data from one year with a tax calculator for another year and allows one to answer questions such as "What is the effect of raising the floor on medical deductions". But no new data is created, so that it doesn't provide any answer (even a simulated one) to the question "What is the effect of reducing the floor", since taxpayers with less than the floor deduction will be left at zero. Casulty losses are subject to the same treatment in those years when only the casulty deduction is provided by the SOI.

When using this file keep in mind that very little is imputed without a firm source and items are zero if not available on the tax return that year. A complete list of variables is available here

At present the only statistically imputed variables are the state of residence for taxpayers with AGI over $200,000 and the split of husbands and wifes wages in data85 and data86.

TAXSIM

These highly consistent files may be used in Stata or FORTRAN to calculate tax liabilities, marginal tax rates, and many intermediate tax calculations with the NBER TAXSIM model. More information about the Stata interface is available with the Stata help command:

help taxpuf27 Because we are not allowed to distribute the PUF outside of 1050, taxpuf27 is not available elsewhere. Taxpuf is easy to use, for example: . use /homes/data/soi/taxsim/dta/s1999 . taxpuf27 . merge 1:1 data100 using taxpuf_out . table state [aw=data1], c(mean ftax mean stax) reads in the subset version of the 1999 file, and calculates federal and state tax liabilites by state. Note that the variable names for the input are documented here, (as noted above) or you could use the "describe" command for a listing. Also note that state of residence is only available for years 1977-2008 and even in those years is an imputation for taxpayers with AGI>200,000.

Sometimes it is easier to just use FORTRAN. In that case you should see me and I can get you started. FORTRAN is very easy, and you won't have a lot of trouble going forward once I get you on the right track with it. Calculations are much faster without the Stata interface. There is a demo program at /homes/nber/feenberg/taxsim-demo.

TAXSIM on the Web

You can submit a 27 variable characterization of an individual taxpayer, or a file of such taxpayers to our website at http://www.nber.org/taxsim/taxsim35 and get back a detailed calculation of tax liabilities. This facility is intended to encourage users of survey datasets such as the CPS or SIPP to use after-tax prices in their work. If you have the PUF, you want to use all the data and not try to cut the PUF down to 35 variables.

net cd http://www.nber.org/stata net describe taxsim35

TAXCALC

An alternative to TAXSIM is the newer TAXCALC program. TAXCALC differs from TAXSIM in that it is written in SAS or Stata, and operates directly on the SOI distributed PUF or on the internal, confidential file. It does not include any state tax calculation, and covers only 1993-2015. More information at:

http://www.nber.org/taxcalc

Modeling Tax Proposals

changes to the tax system naturally divide into two categories. Some that can be modeled as changes to the data, but others cannot. For example, a secondary earner's deduction can be modeled as a change in wages - data11 in the model. A change in the top bracket rate requires access to the model internals and would have to be discussed with me and implemented here. In many cases we would be glad to make these modifications, or help you make them.

If you are using these files, I would like to meet with you.

Daniel Feenberg
feenberg@nber.org
617-863-0343