The analysis of very large files, such as Medicare claims, has long been the considered the preserve of SAS, because SAS could handle datasets of any size, while Stata was limited to datasets that would fit in core. In many cases a preliminary extraction has been done is SAS, followed by analysis of a smaller subset in Stata. In this note we offer suggestions for doing the extraction in Stata, eliminating the SAS step.

With some thought, and some lesser know options to the -merge- command it should be possible to dispense with the SAS step for nearly all work, as long as the final analysis dataset will fit in core. With memory allocations of up to 40 GB now readily available that would cover 40 million observations on 1,000 float variables in the analysis subsample and no limit on the size of the full file from which the sample is drawn.

In particular, there is no need for the sample selection and analysis variables to be in core at the same time.

In order to minimize the amount of memory used, we need to separate the sample selection decision from the variable inclusion decision. We assume that the initial ("full") .dta file will not fit in core, but has been created by Stat/Transfer, or on a larger computer or is in a fixed format. We also assume a record id is present and that the data are sorted by that value. To get such a record id we can record the value of _n when the full file is created. The procedure is as follows:

- List only the variables needed in sample selection in the -use- command with
the -using- keyword. Use the -if- qualifier to subset records, to the extent they
do not have cross-observation restrictions.
**use***var1 var2...***if***in_sample_condition***using***filename* - If the data is read via a Stata dictionary, list only the variables necessary for sample selection in the dictionary, and use the -if- qualifier to the -infile- command.
- Apply any cross-observation qualifications to identify the subsample required for analysis.
- Keep only the record id, and only for records in the subsample.
- Do a 1:1 merge back to the full dataset, with the -keep- and -keepusing-
options to subset by record and variable at the same time.
**sort***recno*

**merge 1:1***recno***using full,nogenerate keep(master match) sorted keepusing(***varlist...***)** - Save the analysis dataset.

See the benchmarking script below for a complete example. Note that if we use 10% of the variables in the sample selection, and select 5% of the records, then the first step uses memory of only .5% of the original file unless there are cross-observation restrictions. If we then need 10% of the variables for the analysis, that step also uses the same amount of memory.

We have done some initial experiments, and found that with 100 variables in the full dataset, 10 variables in sample selection, and 10 different variables in analysis subsample our Linux PC can convert the full file to the analysis file at a rate of about 1 million full sample records per second. Doubling the number of variables used in both steps adds about 10% to that time. Nearly all the time is used in the merge - which appears to be inefficient compared to the other Stata steps. There is no noticable improvement with 4-core statamp.

If datafiles are provided in sort order by record id on the disk, it is not necessary to sort them before merging if the -sorted- option is specified on the merge statement. Actually, the master (in-core) dataset will need to be sorted before the merge, but this is very fast for an already sorted dataset.

The Stata knowledge base includes a note on reading ASCII data from a pipe, which allows one to process a file without storing the decompressed version on disk. You may have to unset the "noclobber" variable depending upon the shell you are using.

A new feature of Stata is the factor variable list. See -help fvvarlist- for more information, but briefly, it allows Stata to create dummy variables and interactions for each observation just as the estimation command calls for that observation, and without saving the dummy value. This makes possible such constructs as interacting a hospital dummy with a time trend without using any memory to store the interaction itself. (You would still need memory for the cross-product matrix).

There are a large number of regression procedures in Stata that avoid calculating nuisance parameters entirely, a potentially large saving in both space and time. Where analysis bumps against the 9,000 variable limit in stata-se, they are essential. These are documented in the panel data volume of the Stata manual set, or you can use the -help- command for xtreg, xtgee, xtgls, xtivreg, xtmixed, xtregar or areg. There are additional panel analysis commands in the SSC mentioned here

From John.Antonakis@unil.ch Thu Apr 19 10:58:43 2012 Date: Thu, 19 Apr 2012 16:57:27 +0200 From: John Antonakis__________________________________________To: statalist@hsphsun2.harvard.edu Subject: Re: st: Regression with about 5000 (dummy) variables Hi: Let me let you in on a trick that is relatively unknown. One way around the problem of a huge amount of dummy variables is to use the Mundlak procedure: Mundlak, Y. (1978). Pooling of Time-Series and Cross-Section Data. Econometrica, 46(1), 69-85. ....for an intuitive explanation, see: Antonakis, J., Bendahan, S., Jacquart, P., & Lalive, R. (2010). On making causal claims: A review and recommendations. The Leadership Quarterly, 21(6). 1086-1120. http://www.hec.unil.ch/jantonakis/Causal_Claims.pdf Basically, for each time varying independent variable (x1-x4), take the cluster mean and include that in the regression. That is, do: foreach var of varlist x1-x4 { bys panelvar: egen cl_`var'=mean(`var') } Then, run your regression like this: xtreg y x1-x4 cl_x1-cl_x4, cluster(panelvar) The Hausman test for fixed- versus random-effects is: testparm cl_x1-cl_x4 This will save you on degrees of freedom and computational requirements. This estimator is consistent. Try it out with a subsample of your dataset to see. Many econometricians have been amazed by this. HTH, J.

Here is the benchmark test program for the subsampling merge process.

Note that the master dataset must be sorted before the merge, even though it is already in sort order, and -sorted- is specified as an option to the merge command. This is apparently a bug in the implementation of the option, but sorting a dataset already in sort order is done in linear time and is not a problem.