Fixed Effects

-fvvarlist-

A new feature of Stata is the factor variable list. See -help fvvarlist- for more information, but briefly, it allows Stata to create dummy variables and interactions for each observation just as the estimation command calls for that observation, and without saving the dummy value. This makes possible such constructs as interacting a state dummy with a time trend without using any memory to store the 50 possible interactions themselves. (You would still need memory for the cross-product matrix).

. reg dep i.state year#i.state regresses dep on 50 state dummies and a time trend interacted with the state dummies but doesn't use any memory to store the interaction (assuming year and state are as you would expect from the names). While highly convenient, this is far from the most efficient way to obtain these estimates. With 20 years of data Stata is still doing all the arithmetic of multiplying each of 998 zero dummy values times all 1,000 other values for each row of data added to the cross-product matrix. Nevertheless, when interaction terms are required, this may be the preferred method.

-xtreg- and relations

There are a large number of regression procedures in Stata that avoid calculating fixed effect parameters entirely, a potentially large saving in both space and time. Where analysis bumps against the 9,000 variable limit in stata-se, they are essential. These are documented in the panel data volume of the Stata manual set, or you can use the -help- command for xtreg, xtgee, xtgls, xtivreg, xtivreg2, xtmixed, xtregar or areg. There are additional panel analysis commands in the SSC mentioned here

However, by and large these routines are not coded with efficiency in mind and will be intolerably slow for very large datasets. Worse still, the -xtivreg2- requires additional memory for the de-meaned data turning 20GB of floats into 40GB of doubles, for a total requirement of 60GB.

-xtreg- is the basic panel estimation command in Stata, but it is very slow compared to taking out means. For example:

xtset id xtreg y1 y2, fe runs about 5 seconds per million observations whereas the undocumented command. _regress y1 y2, absorb(id) takes less than half a second per million observations. The difference increases with more variables. The -_regress- command is in fact the old -regress- command from editions of stata before the xt style commands were made available. It seems to have been kept (with the name changed) because it was called by other Stata commands, and was undocumented to induce users to transfer to the new and more comprehensive command set.

Efficient Panel Estimation

What if you have endogenous variables, or need to cluster standard errors? Jacob Robbins has written a fast tsls.ado program that handles those complications:

tsls y1 (y2 = z1) x1, demean replace fe(panelid) cluster(panelid) will run about 30 times as fast as -xtivreg2 and use half to a ninth the memory. It is available as: net from "http://www.nber.org/stata" net install tsls If -tsls- doesn't do everything you need, it is useful to know how to do fixed effects and two-stage regressions "by hand".

The dof() option on the -reg- command is used to correct the standard errors for degrees of freedom after taking out means. -distinct- is a very fast way of calculating the number of panel units. I warn you against either of

tab id,nofreq or egen count = count(id) which may take up to 50 times as long as the regression itself.

For taking out means, you may use

by id: egen temp = mean(var) replace var = var-temp but if there are many groups egen becomes very slow. A faster method is: by id: generate temp = sum(var) by id: replace var = var-temp[_N]/_N but it is important that all observations with missing values are dropped before this step is performed on any of the variables.

For IV regressions this is not sufficient to correct the standard errors. Use the -reg- command for the 1st stage regression. Then run the 2nd stage regression using the predicted (-predict- with the xb option) values for the endogenous variables. In econometrics class you will have learned that the coefficients from this sequence will be unbiased, but the standard errors will be inconsistent. The formulas for the correction of the standard errors are known, and not computationally expensive. An easy way to obtain corrected standard errors is to regress the 2nd stage residuals (calculated with the real, not predicted data) on the independent variables. Those standard errors are unbiased for the coefficients of the 2nd stage regression.

-REGHDFE- Multiple Fixed Effects

xtreg, tsls and their ilk are good for one fixed effect, but what if you have more than one? Possibly you can take out means for the largest dimensionality effect and use factor variables for the others. That works untill you reach the 11,000 variable limit for a Stata regression. Otherwise, there is -reghdfe- on SSC which is an interative process that can deal with multiple high dimensional fixed effects. It used to be slow but I recently tested a regression with a million observations and three fixed effects, each with 100 categories. That took 8 seconds (limited to 2 cores). Increasing the number of categories to 10,000 only tripled the execution time.


last revised 10 August 2018 by feenberg@nber.org