The analysis of very large files, such as health insurance claims, has long been the considered the preserve of SAS, because SAS could handle datasets of any size, while Stata was limited to datasets that would fit in core. In many cases a preliminary extraction has been done is SAS, followed by analysis of a smaller subset in Stata. In this note we offer suggestions for doing the extraction in Stata, eliminating the SAS step. This is followed by some suggestions for greatly reducing the run time for common operations in Stata.
It is a truism that computers are cheap and people are expensive. However, people waiting for computers are also expensive, and often a little thought put into programming can pay dividends in faster results, especially when programs are run repeatedly on datasets with tens or hundreds of million observations and take days or weeks to complete.
- Decreasing sorts
- Sample selection
- Recoding
- Reshape
- Common subexpressions
- Fixed effects
- Multiple Fixed Effects
- Fast Clustered 2SLS with IV and FE
- Reading or writing pipes
- Stata-MP
- Simulations
- Checkpoints
- Restartable Bootstraps
- I/O for very large files
- Separate Slope Coefficients
- Cross observation sampling restrictions - Medicare data
- Conserving Memory
- Memory Management in Stata 14
- Percentiles and Quantiles
- Non-linear non-convergence
- Collapse and aggregation
- Faster then -egen-
- Regressions on many subsets
- Rolling Regressions
- levelsof
Links to other sources
- Too many variables
- IV with fixed effects
- Joe Canner on large datasets
- Faster implementation of Stata's collapse, egen, xtile, isid, and more using C plugins
- Guimaraes on Big Data
Daniel Feenberg
feenberg@nber.org