Efficient Reshapes

The reshape command is inexplicably slow. Take a million observation dataset with variables id, year and x2001-x2010. Then the command to reshape wide to long format:

. xtset id year . reshape long x, i(id) j(year) take about 20 seconds per million observations. But you can write out a separate file for each year of data, and then concatenate them into one long dataset in about 2 seconds. For example: . forvalues year = 2001/2010 { . use id year x`year' using "/tmp/reshape",replace . rename x`year' x . save "/tmp/reshape`year'",replace . } . clear . forvalues year = 2001/2010 { . append using "/tmp/reshape`year'" . }

Long back to wide might be more difficult, since it would require a merge command, and the Stata merge command is quite slow compared to -use-.

Paul von Hippel has posted even faster (but more complicated) methods that cover both directions at statalist forum

This thread includes a posting with a full explanation of why -reshape- is so slow, including a demonstration that the time runs as the square of the number of variables.

There are now faster commands -sreshape- and -fastreshape- in SSC and -greshape- in gtools that maintains the syntax of -reshape- but are reasonably fast. Rafal Raciborski has written me about his new command -tolong- also available from the ssc.

In limited testing, wide-to-long of 50 variables ran at a rate of 10,000 observations/second in -reshape-, 30,000/second in fastreshape and 300,000/second in tolong.


Last update 23 December 2022