NATIONAL BUREAU OF ECONOMIC RESEARCH
NATIONAL BUREAU OF ECONOMIC RESEARCH
loading...

Notes for Big Users

Picking a Computer:

Linux/X86-64, good for Stata, Xstata or Statamp
  • nber0 has 264 GB and 32 cores
  • nber1 has 64 GB and 2 cores
  • nber2 has 96 GB and 8 cores
  • nber3 has 198 GB and 6 cores
  • nber4 has 256 GB and 6 cores
  • nber5 has 64 GB and 12 cores
  • nber6 has 128 GB and 6 cores
  • nber7 has 512 GB and 4 cores
  • nber8 has 24 GB and 2 cores
  • nber9 has 128 GB and 8 cores
We also have sas1 with 32 GB and 2 cores, mostly for SAS jobs. We have plenty of CPU resources on these machines, and statamp is very fast. All the Stata machines (except nber0) have hypercores, and in many cases this doubles the number of available cores. Please visit http://www.nber.org/xming.html for instructions on running xstata from your windows box. A new package that shows promise is MobaXterm linked from http://ssh.nber.org

Check out the load on each machine with the "showload" command before starting a long job. You will want to select a machine with sufficient memory available, as well as low current load. A machine is not heavily loaded until the load is larger than the number of cores.

These machines make it possible to run extremely large jobs in Stata. A workspace of 20GB will hold more than 25 million observations on 200 variables.

Memory Considerations:

The "top -a" command will show you which jobs are running on the computer, and how much memory they are using. As the memory used approaches the amount of physical memory on the machine, performance tanks. Please don't start any large (memory size) jobs while memory is short. Don't look at the summary line 4th from the top of the page - that is confused by I/O buffering. Look at the list of jobs and add up the numbers under "SIZE". Probably only the ones with a "G" (gigabyte) matter.

Anything more than 50 GB on a 64 GB machine is likely to overload the machine to the extent that it ceases to do any useful work - don't do that.

Long and large Stata jobs should be run in background, so that they don't use memory waiting for keyboard input for hours after you have left for the evening.

Dan Feenberg is eager to help users improve the efficiency of their programs. It is often possible to decrease run times by 50% to 95% with minor changes. We know that computers are cheap and you are expensive, however, your waiting for the computer is expensive, so efficiency is important in long running jobs.

Stata jobs can be run in background with:

stata -b file.do and can be made immune from logouts and hangups with: nohup stata -b file.do Advice on efficient progamming in Stata is offered at http://www.nber.org/stata/efficient

See "Stata for very large datasets" for how to subset datasets too large for memory.

Multiple Jobs:

If you have to run multiple large jobs, please don't run all of them at once. That is, run:

sas job1;sas job2;sas job3; rather than sas job1 & sas job2 & sas job3 & Run together the jobs are likely to interfere with each other rather than finish sooner. You may be thinking that if the machine is shared by several users, you can get a greater percentage of the CPU by having more sheep grazing on the common. Put this out of your mind - there is a village chief guarding the common. If I see more than half the cores on a machine running jobs from a single login, or more than half the memory devoted to a single user with multiple large jobs, I may cancel all but one or two. If the jobs use little memory I will allow more. It is OK to run a long job on each of machine, but not if they require lots of memory. In short, don't hog the machines, leave some memory and CPU for others.

Long jobs should be restartable from several points, so that if the job is cancelled or the machine crashes, you don't have to start again from the beginning.

If your job crashes for lack of disk space, please check /tmp for files that may not have been deleted. Often users will fill /tmp, their job will die, and all subsequent jobs for that user and all other users will die because /tmp is still full. If a full /tmp (filled by another user) prevents you from working on one machine, it will not affect the other machine.

How big is too big?

Anything that will use more than a few dozen hours of CPU time or more than 10 gigabytes of memory is pretty big and you shouldn't start it without checking "top -a" to make sure there is room.

A very large STATA job (more than 40 GB) to make a subset from a large .dta file can reasonably be run during regular working hours since it will not exclude others for more than an hour or so. But to have more than one such job crunch away for many hours would be inappropriate.

Disk Space:

There is a 10GB quota on home directories but this can be expanded by Mohan if you need additional space. The quota system will prevent writes that exceed the quota, but doesn't have a recognizable error message. Suspect over-quota if you can't write a large file to your home directory. Most home directories are stored in a compressed filesystem. This means that compressing your files with zip won't actually reduce usage.


Last revised 7 July 2018 - drf
 
Publications
Activities
Meetings
NBER Videos
Themes
Data
People
About

National Bureau of Economic Research, 1050 Massachusetts Ave., Cambridge, MA 02138; 617-868-3900; email: info@nber.org

Contact Us