Faster Recoding

The typical Stata job involves considerable recoding of variables. There are many ways to recode variables and how this is done can have a substantial effect on runtime. Consider the conversion of the FIPS state code to the IRS state code. Stata includes the -recode- statement for just this task:

recode fips = ( 1 = 1) ( 2 = 2) ( 4 = 3) ( 5 = 4) ( 6 = 5) ( 8 = 6) ( 9 = 7) (10 = 8) (11 = 9) (12 = 10) (13 = 11) (15 = 12) (16 = 13) (17 = 14) (18 = 15) (19 = 16) (20 = 17) (21 = 18) (22 = 19) (23 = 20) (24 = 21) (25 = 22) (26 = 23) (27 = 24) (28 = 25) (29 = 26) (30 = 27) (31 = 28) (32 = 29) (33 = 30) (34 = 31) (35 = 32) (36 = 33) (37 = 34) (38 = 35) (39 = 36) (40 = 37) (41 = 38) (42 = 39) (44 = 40) (45 = 41) (46 = 42) (47 = 43) (48 = 44) (49 = 45) (50 = 46) (51 = 47) (53 = 48) (54 = 49) (55 = 50) (56 = 51),generate(irs); That statement takes about 11 seconds per million observations. Given that the source values are all small integers, it is quite possible to substitute an array assignment: matrix define fips2irs = (1,2,.,3,4,5,.,6,7,8,9,10,11,., 12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32, 33,34,35,36,37,38,39,.,40,41,42,43,44,45,46,47,.,48,49,50,51); generate irs = fips2irs[1,fips]; which takes only .18 seconds per million observations, or a factor of 60 speedup. A series of 51 replace...if statements ... generate irs = 1 if fips==1 replace irs = 2 if fips==2 replace irs = 3 if fips==4 ... would fall in between those times, at 1.6 seconds/million values but is maximally flexible.

Still another way to recode would use a sort and merge with a translation dataset, which is .65 seconds in my test dataset, but would be highly variable.

In some cases the -group- option on the -egen- statement will be practical and is also quite fast. It doesn't let you choose the new values (they are consecutive integers) but it may be used to rapidly and easily compress long strings into single bytes. See also -encode- or -multencode- (SSC), In my tests -encode- is nearly instantaneous - less than a second for converting a million strings with a thousand possible values.

Note that if you want regular quantiles such as deciles or percentiles, there are fast one-line commands.

At least one of the techniques mentioned above should suit any recoding task, and even the most flexible methods are likely to be an order of magnitude faster than recode.