Spreading values across a by group without -egen-

Since I wrote the following Sergio Corriea has posted his -fegen- package to SSC, which promises a similar improvement in speed but retains the egen syntax.

-egen- is convenient for spreading values across a by group, but can be slow for very large files, or if repeated often. In this demonstration there are 10 million records in 2 million groups and we wish to place the maximum of variable x for each group in all the records for that group. The variable is random and sometimes missing. We can substitute 2 statements for one -egen- command and speed processing by a factor of 20. The method is quite general, it works for minimum, sum, etc with slight and obvious modification.

We use

by id: replace x = max(x[_n-1],x) to get the maximum within the group into the last member of the group. Then we use by id: gen groupmax = x[_N] to copy the last (and greatest) value to all the other records as the new variable "groupmax".

The speed increase in the test run is a factor of twenty over the equivalent -egen- command. The final -assert- statement shows that the results are equivalent. Here is the full .do file for the demonstration:

log using spread,text replace set obs 10000000 set rmsg on gen id = 5*int(_n/5-.1) sort id gen x = runiform() replace x = . if runiform()>.95 by id: replace x = max(x[_n-1],x) by id: gen groupmax = x[_N] by id: egen egenmax = max(x) assert egenmax == groupmax
last modified October 15, 2016 by drf