Stata for very large datasets

The analysis of very large files, such as Medicare claims, has long been the considered the preserve of SAS, because SAS could handle datasets of any size, while Stata was limited to datasets that would fit in core. In many cases a preliminary extraction has been done is SAS, followed by analysis of a smaller subset in Stata. In this note we offer suggestions for doing the extraction in Stata, eliminating the SAS step.

Efficient Sample selection

With some thought, and some lesser know options to the -merge- command it should be possible to dispense with the SAS step for nearly all work, as long as the final analysis dataset will fit in core. With memory allocations of up to 40 GB now readily available that would cover 40 million observations on 1,000 float variables in the analysis subsample and no limit on the size of the full file from which the sample is drawn.

In particular, there is no need for the sample selection and analysis variables to be in core at the same time.

Row and column selection with -merge-

In order to minimize the amount of memory used, we need to separate the sample selection decision from the variable inclusion decision. We assume that the initial ("full") .dta file will not fit in core, but has been created by Stat/Transfer, or on a larger computer or is in a fixed format. We also assume a record id is present and that the data are sorted by that value. To get such a record id we can record the value of _n when the full file is created. The procedure is as follows:

List only the variables needed in sample selection in the -use- command with the -using- keyword. Use the -if- qualifier to subset records, to the extent they do not have cross-observation restrictions.
use var1 var2... if in_sample_condition using filename
If the data is read via a Stata dictionary, list only the variables necessary for sample selection in the dictionary, and use the -if- qualifier to the -infile- command.
Apply any cross-observation qualifications to identify the subsample required for analysis.
Keep only the record id, and only for records in the subsample.
Do a 1:1 merge back to the full dataset, with the -keep- and -keepusing- options to subset by record and variable at the same time.
sort recno
merge 1:1 recno using full,nogenerate keep(master match) sorted keepusing(varlist...)
Save the analysis dataset.

See the benchmarking script below for a complete example. Note that if we use 10% of the variables in the sample selection, and select 5% of the records, then the first step uses memory of only .5% of the original file unless there are cross-observation restrictions. If we then need 10% of the variables for the analysis, that step also uses the same amount of memory.

We have done some initial experiments, and found that with 100 variables in the full dataset, 10 variables in sample selection, and 10 different variables in analysis subsample our Linux PC can convert the full file to the analysis file at a rate of about 1 million full sample records per second. Doubling the number of variables used in both steps adds about 10% to that time. Nearly all the time is used in the merge - which appears to be inefficient compared to the other Stata steps. There is no noticable improvement with 4-core statamp.

Merging files

If datafiles are provided in sort order by record id on the disk, it is not necessary to sort them before merging if the -sorted- option is specified on the merge statement. Actually, the master (in-core) dataset will need to be sorted before the merge, but this is very fast for an already sorted dataset.

Reading a compressed .dta file

The Stata knowledge base includes a note on reading ASCII data from a pipe, which allows one to process a file without storing the decompressed version on disk. You may have to unset the "noclobber" variable depending upon the shell you are using.

Reduced Nuisance Variables

-fvvarlist-

A new feature of Stata is the factor variable list. See -help fvvarlist- for more information, but briefly, it allows Stata to create dummy variables and interactions for each observation just as the estimation command calls for that observation, and without saving the dummy value. This makes possible such constructs as interacting a hospital dummy with a time trend without using any memory to store the interaction itself. (You would still need memory for the cross-product matrix).

. reg dep i.state year#i.state regresses dep on 50 state dummies and a time trend interacted with the state dummies but doesn't use any memory to store the interaction (assuming year and state are as you would expect from the names).

-xtreg- and relations

There are a large number of regression procedures in Stata that avoid calculating nuisance parameters entirely, a potentially large saving in both space and time. Where analysis bumps against the 9,000 variable limit in stata-se, they are essential. These are documented in the panel data volume of the Stata manual set, or you can use the -help- command for xtreg, xtgee, xtgls, xtivreg, xtmixed, xtregar or areg. There are additional panel analysis commands in the SSC mentioned here

Mundlack Procedure

From John.Antonakis@unil.ch Thu Apr 19 10:58:43 2012
Date: Thu, 19 Apr 2012 16:57:27 +0200
From: John Antonakis 
To: statalist@hsphsun2.harvard.edu
Subject: Re: st: Regression with about 5000 (dummy) variables

Hi:

Let me let you in on a trick that is relatively unknown.

One way around the problem of a huge amount of dummy variables is to use 
the Mundlak procedure:

Mundlak, Y. (1978). Pooling of Time-Series and Cross-Section Data. 
Econometrica, 46(1), 69-85.

....for an intuitive explanation, see:

Antonakis, J., Bendahan, S., Jacquart, P., & Lalive, R. (2010). On 
making causal claims: A review and recommendations. The Leadership 
Quarterly, 21(6). 1086-1120. 
http://www.hec.unil.ch/jantonakis/Causal_Claims.pdf

Basically, for each time varying independent variable (x1-x4), take the 
cluster mean and include that in the regression.  That is, do:

foreach var of varlist x1-x4 {
bys panelvar: egen cl_`var'=mean(`var')
}

Then, run your regression like this:

xtreg y x1-x4 cl_x1-cl_x4, cluster(panelvar)

The Hausman test for fixed- versus random-effects is:

testparm cl_x1-cl_x4

This will save you on degrees of freedom and computational requirements. 
This estimator is consistent.  Try it out with a subsample of your 
dataset to see. Many econometricians have been amazed by this.

HTH,
J.

__________________________________________

Here is the benchmark test program for the subsampling merge process.

log using medibench,replace text set more off * First we create the "full" dataset (not timed, but slow) cd /tmp set memory 1000m set obs 1000000 gen double recno=_n forvalues i = 1(1)100 { quietly gen x`i' = mod(_n,1000)+`i' } sort recno /*See note below*/ save full, replace clear timer on 1 *Read in the sample selection variable(s) *Here we take a 5% sample of observations use recno x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 using full if mod(x1,20)==0 *Save the sample keep recno save sample,replace clear *Merge sample ids and analysis variables. *Here we have 10 analysis variables. use sample merge 1:1 recno using full,nogenerate keep(master match) sorted /// keepusing(x91 x92 x93 x94 x95 x96 x97 x98 x99 x100) *Save analysis file - now 5% of records and 10% of variables. save analysis,replace *Reload analysis file clear use analysis timer off 1 timer list log close exit,clear

Note that the master dataset must be sorted before the merge, even though it is already in sort order, and -sorted- is specified as an option to the merge command. This is apparently a bug in the implementation of the option, but sorting a dataset already in sort order is done in linear time and is not a problem.

last revised 17 August 2011 by feenberg@nber.org