Notes for Big Users
Picking a Computer:Linux/X86-64, good for Stata, Xstata or Statamp
Check out the load on each machine with the "showload" command before starting a long job. You will want to select a machine with sufficient memory available, as well as low current load. A machine is not heavily loaded until the load is larger than the number of cores.
These machines make it possible to run extremely large jobs in Stata. A workspace of 20GB will hold more than 25 million observations on 200 variables.
The "top -a" command will show you which jobs are running on the computer, and how much memory they are using. As the memory used approaches the amount of physical memory on the machine, performance tanks. Please don't start any large (memory size) jobs while memory is short. Don't look at the summary line 4th from the top of the page - that is confused by I/O buffering. Look at the list of jobs and add up the numbers under "SIZE". Probably only the ones with a "G" (gigabyte) matter.
Anything more than 50 GB on a 64 GB machine is likely to overload the machine to the extent that it ceases to do any useful work - don't do that.
Long and large Stata jobs should be run in background, so that they don't use memory waiting for keyboard input for hours after you have left for the evening.
Dan Feenberg is eager to help users improve the efficiency of their programs. It is often possible to decrease run times by 50% to 95% with minor changes. We know that computers are cheap and you are expensive, however, your waiting for the computer is expensive, so efficiency is important in long running jobs.
Stata jobs can be run in background with:
See "Stata for very large datasets" for how to subset datasets too large for memory.
If you have to run multiple large jobs, please don't run all of them at once. That is, run:
Long jobs should be restartable from several points, so that if the job is cancelled or the machine crashes, you don't have to start again from the beginning.
If your job crashes for lack of disk space, please check /tmp for files that may not have been deleted. Often users will fill /tmp, their job will die, and all subsequent jobs for that user and all other users will die because /tmp is still full. If a full /tmp (filled by another user) prevents you from working on one machine, it will not affect the other machine.
How big is too big?
Anything that will use more than a few dozen hours of CPU time or more than 10 gigabytes of memory is pretty big and you shouldn't start it without checking "top -a" to make sure there is room.
A very large STATA job (more than 40 GB) to make a subset from a large .dta file can reasonably be run during regular working hours since it will not exclude others for more than an hour or so. But to have more than one such job crunch away for many hours would be inappropriate.
There is a 10GB quota on home directories but this can be expanded by Mohan if you need additional space. The quota system will prevent writes that exceed the quota, but doesn't have a recognizable error message. Suspect over-quota if you can't write a large file to your home directory. Most home directories are stored in a compressed filesystem. This means that compressing your files with zip won't actually reduce usage.
Last revised 7 July 2018 - drf