Stata has built-in commands -ptile- and -xtile- for calculating the quantile ranks of a variable. For instance: xtile ptile = x,nq(100) assigns to ptile the percentile rank associated with the variable x. For 100 million observations, this took 31 minutes. A faster way is: sort x gen ptile = int(100*(_n-1)/_N)+1 which took only 6 minutes, assuming you do not need to restore the original sort order. The plus and minus one move the ptiles from [0-99] to [1-100], matching the -xtile- command, but are otherwise superfluous. There is a potential problem with this code - equal values may be assigned to different quantiles. That can be fixed with one line, at the expense of increasing the variation in the size of supposedly equal quantiles: replace ptile = ptile(-1) if x==x(-1)

There is a drop-in replacement SSC command for -xtile- called -fastxtile- that is even faster than the DIY method, however, like -xtile- it is not byable. The DIY method extends easily to by variables:

sort byvar x by byvar: gen ptile = int(100*(_n-1)/_N)+1 taking advantage of _n and _N referring to position in the current by group. -egen- helps us generalize to by variables and weights at the same time: sort byvar x by byvar: egen sumwgt = sum(wgt) by byvar: gen rsum = sum(wgt) by byvar: gen ptile = int(100*rsum/sumwgt) Notice the two different meanings of the -sum()- function. It is a running sum in -generate- commands and a completed sum in -egen- commands. In this case the -egen- command added only a minute to the total time.

The speed of -xtile- and relatives is highly dependent on the number of categories requested - the more categories the less efficient compared to DIY. With only two categories, the techniques are roughly equal.

SSC also contains the -stile- option to -egen- and the -fastxtile- command which may be worth looking into.


Last changed 17 September 2016 - drf