Aging Servers - Job submission - SLURM

Large Job Submission/Queuing Utility

NBER Large job submission/queuing utility

The NBER’s current computing environment consists of a set of servers with varying amounts of CPU cores and memory. It allows the researchers to run jobs without restrictions. A key drawback of this setup is that often a server’s resources, especially memory, is exhausted when multiple jobs are run simultaneously by multiple researchers. A consequence of this is that jobs may not finish once they are started and a researcher would not know until later the fate of their job. This is a source of frustration for our researchers. This current computing strategy suffers from the case of "the tragedy of the common good."

To alleviate this problem, we are instituting a job submission mechanism on a new server with 64 cores and 4TB of RAM using Linux’s SLURM utility. SLURM (Simple Linux Utility for Resource Management) has been used by very many universities where the resources are much larger and varied. In our instance, we are targeting its use on jobs that require large amounts of memory, so they may be run on this new server without congesting the rest of the servers. The objective is to allow submission of jobs that will be queued and started only when adequate resources(memory) are available for the job to run. However, once a job is started, the requested resources (memory) will be locked for the duration of its run. The hope is that while there may be longer wait times in the queue, once the job is started, it has a high likelihood of completion. We will require users of this utility to estimate their job’s memory requirement and specify it when submitting the job. If the job, during the course of its run, exceeds this specified memory, the SLURM system will terminate it. It is therefore important to estimate the requirements before submitting your jobs. Yet, it is also important to resist over-subscription of resources which would make overall running of all jobs very inefficient. We will provide some rough guidance on how to make an estimate later on in this document.

If you UNDER-estimate the amount of memory your job needs, your job will be killed when it goes over that limit.
If you OVER-estimate the amount, your job is likely to remain in the queue longer, and it will be inefficient overall.

We will allow the rest of our servers to operate as before. On those servers, for jobs requiring smaller amounts of memory, researchers will not be constrained by the job submission processes, or a requirement to estimate resources, or wait in queues. An inefficiency under the job submission environment is that the job submission environment will allocate the requested memory for the entire duration of your job, but in reality, that requested memory may be needed only for a small portion of the job's run time. For example, you may have a job that assembles various years of data but later it collapses your data into some aggregate levels and runs statistical analysis. This job would not require the maximum required memory throughout its run. Under the non-job submission environment, memory can be dynamically allocated between jobs while they are running. In fact, we recommend that you set up your workflow where you set up the data assembly part in a job run via the job submission environment and run your statistical analysis on the aggregated data on the rest of the servers.

This setup is still experimental and will be refined as we learn more about its usage and the feedback we obtain from our researchers.

You can get this usage help document from the command line on the NBER servers using the command "man nberslurm"

Configuration and Policies

Keeping our above stated objective in mind and to support fair usage, here are key details relevant to users:

Jobs submitted are put into a hold state at first.

Each user will be allowed to run a max of three (3) active jobs in their queue at any given time.

You are expected to specify the amount of memory required using the –mem flag for every job you submit. The default memory allocation is 10 GB. If no memory is specified, Slurm will assume 10 GB. Jobs exceeding their allocated memory will automatically be canceled. There is no other requirement that needs to be set when submitting a job.

The Slurm administrative service will release a max of 3 jobs per user to run. They will start based on where in the queue they are and whether the resources (memory) they have requested are available at that time.

When determining how much memory to allocate to a job, users should consider: dataset size, number and complexity of variables, type and scope of operations, etc. If you are unsure, please contact it-support@nber.org for guidance.

How to submit a job?

Jobs can be submitted only from agerdp1 or agerdp2 and operated from the command line. We will illustrate with an example of submitting a stata job via Slurm. There are very many variations of job submission, but we will illustrate only the very simple and basic method here. Suppose you have a stata code named run_job1.do, and it resides in the folder /disk/agediskX/…./myfolder/

First you want to open a "Terminal" (Open from your RDP session Applications->System Tools-> Terminal or Mate Terminal, or the "Monitor Icon" at the top left corner of your Linux Desktop)

From this terminal, you can use commands similar to the following:

srun --mem 500G /usr/local/bin/stata -b /disk/agediskX/…/myfolder/run_job1.do &

sbatch --mem 500G /usr/local/bin/stata -b /disk/agediskX/…/myfolder/run_job1.do &

There is only a slight difference between srun and sbatch. ‘srun’ will submit the job to the queue for execution immediately, whereas sbatch can be configured to submit the job at a future time/date. You can submit multiple jobs, e.g. after you submit run_job1.do, you can submit

srun --mem 500G /usr/local/bin/stata -b /disk/agediskX/…/myfolder/run_job2.do &

This will put your run_job2.do in the queue behind run_job1.do.

You can submit an R job via “/usr/bin/Rscript run_rjob1.R” or a python job via “/usr/local/bin/python run_pythonjob.py” similar to stata using the srun or sbatch command.

It is sometimes useful to associate which code is being run under a job. One can add a comment to a job when submitting it. Below is an example of it:

srun -–mem 100G --comment=”100G memory job, x.stata” /usr/local/bin/stata -b x.do

How to check the status of your submitted job(s)?

The simplest method to get information on your jobs is, from a terminal on agerdp1 or agerdp2, just run the command -

squeue --user=<username>

This will produce a list of jobs in queue with information on job ids, name of the job, the owner of the job, the status of the job, the time of the job.

To get a slightly more sophisticated/formatted output, you can use the command

squeue --user=<username> --format="%.18i %.8j %.8u %.8T %.10M %R %k"

This command will produce a formatted output with most relevant columns -

JobID, Name of the Job, Job User/Owner, Job Status, Job Run Time, and Comments on the Job.

How to cancel your submitted job(s)?

Use squeue to identify the JobID of the job you want to cancel. Then use

scancel <JobID>

You can provide multiple JobIDs that are space separated to kill multiple jobs at once. You will be able to kill only the jobs you own.

How to check how much memory your job is using or had used?

There are multiple ways in which you can check your job's memory utilization. First start by using the command:

sacct -u <username> which will list the jobs you have. It will show jobs in queue, running, aborted, or completed.

Note the <JobID> from the above command, which is the only variable you need for the following commands.

For a job that is still running, use the command:

sstat -j <JobID> --format=JobID,AveVMSize,MaxRSS This command will produce an output with your job's id, average virtual memory used, and maximum memory used by the job up to that point.

For a job that has completed, use one of the two commands:

seff -j <JobID> This command produces an output with report on efficiency of resources used relative to what was requested. The line with the tag "Memory Utilized:" gives the memory used by the job, and the line with the tag "Memory Efficiency:" tells what percent of total memory you had requested was actually used. We want the "Memory Efficiency" to be near 100%.

sacct -j <JobID> --format=JobID,User%15,State,ExitCode,ReqMem,MaxRSS outputs job's id, username(15 chars to accommodate our long DUA specific usernames, state of the job, job's exit code,requested memory, and maximum memory used by the job. (You can get information on multiple jobs by listing comma separated multiple Job IDs in the command, e.g. -j JobID1,JobID2,...)

sacct -u <username> -Snow-2days --format=JobID,User%15,State,ExitCode,ReqMem,MaxRSS

gives the same output as the previous sacct command but for all your jobs from the last 2 days. You can then compare Requested Memory and MaxRSS to evaluate and refine your calculations for memory request for similar jobs.

Please note that seff and sacct might give slightly different values of maximum memory used because of the way they calculate. One would rely on sacct for a slightly more accurate number, whereas on seff for a more wholistic number.

How to estimate memory requirements?

We will start with the very basic first step. A rough estimate for a data frame/set is:

Memory Size (in bytes) ≈ Number of Rows × Number of Columns × 8 bytes.

The data frame/set can be in Stata, R or python of any other language. We have assumed that every column (variable) is using a double precision floating point number. Let's refine this a bit. Not every column/variable will be a floating point number. We could have string variables, or a date variable, or a dummy variable in byte format.

Typically, an ASCII string variable occupies 1 byte per character. If the variable is UTF-8, it could be up to 4 bytes. A date variable can take about 8 bytes (although it really depends on the format), and a byte variable as the name suggests will take about 1 byte.

Whatever format the variable/column is in, the calculation is similar.

Memory Size (in bytes) ≈ Number of Rows × (sum over each column (column × # bytes for that format/type))

Then to convert Memory Size (in bytes) into Memory in KB, MB, GB, or TB, divide by 1024 per thousand bytes.

For example, if you have 10 million obs/rows, 1 ID variable of str 15, 1 date variable, and 15 floating point numbers, and 7 dummy variables, the memory required will be 1.39 GB.

In stata you can use describe to find out about each variable and its type/format.

In R, you can use object.size() or mem_used() to get total memory used by an object, or use mem_change() to display change in memory usage while running an R expression.

In Python you can use sys.getsizeof() from the sys library, or you can use memory_profiler or trace_malloc() library.

The next part gets a bit more complicated. Often, researchers are either creating interaction variables, or dummy variables, or running fixed effects models. This means that the job will require additional memory for these over your initial estimate of the size of your dataset. You can follow the same type of calculation by planning out all the various analysis you would be running and calculating the additional memory requirements for your job.

Particular libraries/modules within a statistical software may also contribute space allocation differently. For example, stata’s xtreg or reghdfe for its internal calculations use double precision variable types, which means it will use up 8 bytes for each. reghdfe is more efficient than xtreg or areg. You may want to see Daniel Feenberg's notes: Multiple Fixed Effects and -reghdfe-

We will rely on your own expertise in statistical software to help us provide such titbits for all our users. Please do share your knowledge and we will try to update this page accordingly.

Please contact it-support@nber.org if you have any questions or suggestions.