Stata for nearly big data

Stata for very large datasets

The analysis of very large files, such as health insurance claims, has long been the considered the preserve of SAS, because SAS could handle datasets of any size, while Stata was limited to datasets that would fit in core. In many cases a preliminary extraction has been done is SAS, followed by analysis of a smaller subset in Stata. In this note we offer suggestions for doing the extraction in Stata, eliminating the SAS step. This is followed by some suggestions for greatly reducing the run time for common operations in Stata.

It is a truism that computers are cheap and people are expensive. However, people waiting for computers are also expensive, and often a little thought put into programming can pay dividends in faster results, especially when programs are run repeatedly on datasets with tens or hundreds of million observations and take days or weeks to complete.

Decreasing sorts
Sample selection
Recoding
Reshape
Common subexpressions
Fixed effects
Multiple Fixed Effects
Fast Clustered 2SLS with IV and FE
Reading or writing pipes
Stata-MP
Simulations
Checkpoints
Restartable Bootstraps
I/O for very large files
Separate Slope Coefficients
Cross observation sampling restrictions - Medicare data
Conserving Memory
Memory Management in Stata 14
Percentiles and Quantiles
Non-linear non-convergence
Collapse and aggregation
Faster then -egen-
Regressions on many subsets
Rolling Regressions
levelsof

Links to other sources

Advice from the author of -gtools-, Mauricio Caceres Bravo
Too many variables
IV with fixed effects
Joe Canner on large datasets
Faster implementation of Stata's collapse, egen, xtile, isid, and more using C plugins
Guimaraes on Big Data

Daniel Feenberg
feenberg@nber.org

last update 28 October 2019 by drf