Advice for Using Medicare Claims Data in Research on the Economics of Aging
The notes in this document were written in the context of a workflow leveraging task-based directory structures, code-tracking with Git, build automation with Make, and computing resources on the University of Chicago Midway and Randi servers. For an example of such a workflow, see Jonathan Dingel's project template available on GitHub at https://github.com/jdingel/projecttemplate, from which this document is adapted. Terminology specific to the University of Chicago servers—such as ‘Midway’, ‘Randi’, hostnames, paths, and login instructions—may require adaptation when applying these notes to other computing environments, such as those at the NBER or elsewhere. Some comments are relevant on your local client but not on the remote server. Nevertheless, the majority of the guidance and tips provided here are broadly applicable to a variety of research settings that utilize standard scientific computing tools.
1. Organizing Research Code
We organize a research project as a series of tasks, so our organization of code and data takes a task-based perspective. After writing code, we automate its execution via make. We track our code (and the rest of the project) using Git, a version control system. Collaboration occurs via issue/task assignments, pull requests, and logbook entries that share research designs and results. See https://github.com/jdingel/ projecttemplate for details omitted from this abridged documentation.
Our approach assumes that you’ll use Unix/Linux/MacOSX. Plain-text social science lives at the *nix command line. Gentzkow and Shapiro: “The command line is our means of implementing tools.” Per Janssens (2014): “the command line is: agile, augmenting, scalable, extensible, and ubiquitous.” Here are four intros to the Linux shell:
• https://ryanstutorials.net/linuxtutorial/
• http://swcarpentry.github.io/shell-novice/
• Grant McDermott’s “Learning to love the shell” via his Data science for economists
• William E. Shotts Jr’s “Learning the Shell”
Getting started at the command line can be a little overwhelming, but it’s well worth it.
Use a good text editor like SublimeText, Atom, or VSCode to write code, slides, and papers. Word pro- cessors are not the same as text editors. Your text editor should, at minimum, offer you syntax highlighting, tab autocomplete, and multiple selection. We recommend VSCode, which supports Git, remote development via SSH, and a GitHub extension.
Section 2 provides tips for using the *nix command line, which is essential when working on high- performance computing clusters. Section 3 introduces Julia, which we often use for numerical optimization (i.e., solving economic models). Section 4 has a few notes on Stata quirks we have encountered. Section 5 is a guide to working on the high-performance computing infrastructure at UChicago’s Center for Research Informatics that hosts the Medicare claims data.
2. Unix/Linux Shell Commands
After mentioning some tools you should install, this section documents basic Unix/Linux shell commands and then some clever combinations for tasks we often encounter during research. At the command line, type man <<command>> to get the manual page for <<command>>.
2.1 Tools to install
• The dot command used in some graphing scripts like task_graph requires Graphviz. Please follow these instructions to install Graphviz.
• The convert command used to handle some images requires ImageMagick. On a Mac using HomeBrew, run brew install imagemagick.
• The epstopdf command requires Ghostscript. On a Mac using HomeBrew, run brew install ghostscript.
• We generate CSV reports using GNU awk. On a Mac using HomeBrew, run brew install gawk.
• You should run the setup_environment task when you initially clone the repository. This task in- stalls required packages for Julia, Stata, R, etc. This task should be revised as the project’s package requirements evolve.
• Our Makefiles assume that you can run the command stata-se from the command line. You need to know about adding your Stata application to your PATH variable. You might want to know about symbolic links placed in /usr/local/bin. For example, on one of my machines, the symbolic link is
/usr/local/bin/stata-se -> /Applications/Stata/StataMP.app/Contents/MacOS/stata-mp
• We use Stata packages like ppmlhdfe that depend on the gtools package. If you get an error on your Mac that says something like “Could not load gtools_macosx_v3.plugin”, see https://github.com/ mcaceresb/stata-gtools/issues/73.
• Install Textidote via Homebrew or https://github.com/sylvainhalle/textidote to support its use in paper/reviewing
• For using Git at the command line, a very simple text editor like Nano suffices. Git’s default text editor is Vim, which has a steep learning curve. We recommend you just set git config –global core.editor "nano" (notes).
2.2 Navigating the file system
• pwd: identify the “present working directory”
• cd: “change directory” to the named destination (e.g., cd <<destination>>)
• ls -lht: lists the current directory’s contents. The -lht options list the files in detail, with human- readable file sizes, ordered by time last modified.
• tab completion: You do not have to type a complete filename. Starting typing the file name and hit the tab key. Commands with long or difficult to spell filenames can be entered by typing the first few characters and pressing a completion key, which completes the command or filename.
• To recall a command from your history, type ctrl-R to search and type a fragment of the command
• hashtag comments: comments in the shell are set off by #. add a comment to your command to tag it for easier retrieval via search in the future
• copy files using cp, move files using mv
• copy files across different servers using scp
2.3 Navigating text
• ctrl-A jumps to beginning of line
• ctrl-E jumps to end of line
• ctrl-K kills content (cuts) from cursor to end of line
• ctrl-U kills content (cuts) from cursor to beginning of line
2.4 Piping and writing to file
A pipeline is a sequence of processes chained together by their standard streams, so that the output of each process feeds directly as input to the next one. Pipe using |.
To write output to a file, use >. This overwrites the file if it already exists. Use >> to append to an existing file.
• parse a directory listing using grep: the command
ls -l | grep 'key'
will output the directory listing and select only the lines containing the phrase “key”
• write hello world to a file:
echo 'hello world' > file.txt
• look for missing files in a numbered sequence:
ls ../output/isoindices_{1..500}.dta > /dev/null
This returns only the files that are not found in that sequence. (the null device is a device file that discards all data written to it but reports that the write operation succeeded)
2.5 Text processing
• cat: Reads files sequentially, writing them to standard output. The name is derived from its function to concatenate files. At the command line, think of this as “print the file”.
• head -n <filename> outputs the first n lines of the file. the default is ten lines
• tail -n <filename> outputs the last n lines of the file. the default is ten lines
• grep: returns all lines of a file matching a specified expression (use -v option to return all lines not containing the expression)
– how to detect invalid utf8 unicode/binary in a text file: grep -axv ’.*’ file.txt
• sed: stream editor with many functions; I mostly use it to substitute one expression for another
• tr: change or delete characters. It is useful for changing filenames (e.g. deleting whitespace).
• awk: find and replace text, print columns, a number of other text editing functions
– An intro to the great language with the strange name (Daniel Robbins, 1 Dec 2000)
– Why you should learn just a little Awk (Greg Grothaus, 29 Sep 2010)
• paste: horizontally concatenate files with equal number of lines
Combine these well and you get something like “Command-line Tools can be 235x Faster than your Hadoop Cluster”.
2.6 Text editing on a Linux compute cluster
Most often, you’ll be an environment where you get to choose your own text editor. However, in some (i.e., confidential) computing environments, you will not be free to install arbitrary software. Nano, Emacs, and Vim will typically be installed everywhere, so it it worth knowing some basic info.
Nano is the easiest tool. If you need to make a few edits to a modest file quickly without looking up the text editor’s documentation, just use nano.
Emacs is a family of text editors, dating to the 1970s, that is “the extensible, customizable, self- documenting, real-time display editor.” But there’s a learning curve. Even the introductions can be over- whelming.
With regard to Emacs key notation, C means the “control” key and M means the alt/option key.
• Quitting/exiting: C-x C-c
• Saving: C-x C-s
• Copy and paste: The selected region is where your cursor is relative to where you set a mark. Set a mark with C-space. Then move your cursor to end of region and hit C-w to cut (kill in Emacs terminology) or M-w to copy. Paste using C-y (yank in Emacs terminology).
If you know how to use Vim, you are likely already familiar with the material in this introductory guide. If you accidentally wind up in Vim (e.g., Git often uses Vim as the default editor), exit Vim by pressing the Esc key, typing :q!, and pressing Enter.
2.7 Multiple “windows”
You might think that you need to connect to the server to create another session. Not so. You should SSH into the server once and use tmux (cheat sheet) to have multiple command-line sessions that will persist even if you get disconnected from the server.
Alternatively, Screen is “a full-screen window manager that multiplexes a physical terminal between several processes, typically interactive shells.” Consider using this if you want, e.g., to run Stata interactively in the bottom half of your screen while working on your do file in a text editor in the top half.
2.8 SLURM (Simple Linux Utility for Resource Management)
SLURM is a job scheduler used on many computing clusters. Read the RCC introduction and then head over to the official documentation. To understand how the RCC compute ecosystem and get futher information on RCC usage, read slides from UChicago Slurm workshop. This two-page PDF lists almost all the relevant commands you might need.
• batch: this command submits jobs (.sbatch scripts) to the job scheduler on the cluster
– Instead of writing a separate .sbatch file for each script you might want to run using SLURM, you can pass the particular script/filename as an argument using the –export option.
– It is possible to set job dependencies in Slurm: Slurm will not start a job until the specified dependencies are satisfied. To set job dependencies, specify the dependency type and job ID in –dependency option.
• sinteractive: start an interactive session on the server
• squeue –user=username: list running and queued jobs for the relevant users
• Our preferred squeue command is the following:
squeue --user=username --format="%.17i %.13P %.20j %.8u %.8T %.9M %.9l %.6D %R"
This provides a good bit more information about each job.
• rcchelp sinfo: produce a summary of the partitions on Midway
• How to only see really costly jobs:
rcchelp usage --byjob | grep '[0-9][0-9][0-9]\.[0-9][0-9][[:blank:]]|'
Without options, rcchelp usage –byjob provides a complete history of job submissions. Piping it to grep to select only lines containing a number in the form ###.## returns only jobs that used at least 100 service units.
2.9 Mac OS X is Unix and POSIX-compliant, not GNU/Linux
Mac OS X is based on the Darwin operating system, which is based on BSD. It counts as UNIX and is POSIX-compliant. But it is not GNU/Linux. You will therefore run into annoying differences when trying to use some utilities, mostly when you see a solution on StackExchange that works on GNU/Linux but doesn’t work on Mac OS X. Here’s a short list:
• tree not available in OS X by default, please run brew install tree to install it via Homebrew.
• cut lacks the complement option in OS X
• wc lacks the -L, –max-line-length option available in Linux.
• Unfortunately, sed differs considerably between GNU and OS X.
– The -i option syntax differs between GNU and OS X
– GNU sed supports \|, \+, and \? in regular expressions but OS X (and POSIX) don’t.
– Other warnings (1,2), like the -e flag will give you extended regular expressions in OS X but isn’t really compatible with Linux.
• OS X uses BSD grep, which differs from GNU grep.
– When debugging an issue that stems from this, you can install GNU grep on OS X with brew install grep and invoke it with ggrep.
– BSD grep and GNU grep may respond differently to non-ASCII text encoding. On GNU grep, this produces the warning message [file]: binary file matches. If faced with this issue, use grep -a to interpret the file as text and produce the same behavior across distributions.
2.10 Data compression (Mac vs Linux)
In macOS, using the Finder to compress files produces hidden files that are useless to non-Mac users. (For details, see https://perishablepress.com/remove-macosx-ds-store-zip-files-mac/). In addition, the Finder does not use the ZIP64 extension to compress files what may make large ZIP files unreadable by other machines. To avoid these issues, compress data by using the Terminal command line instead of the Finder. The following Unix utilities are available to compress or decompress data:
•tar for tar archive format
• zip for zip archive format.
• gzip for gz archive format.
When compressing files, it might be also useful to store the names of saved file without storing the directories. In zip utility, it can be done using option -j to junk the paths.
2.11 Jumping between MacOS GUI and Terminal
A few tips if you’re using Terminal on your Mac but exclusively using the command line:
• You can drag the path of a Mac folder into Terminal (or Stata) by dragging the folder icon at the top of its Finder window into the Terminal prompt (via Luke Stein)
• Typing open . in any directory in the Terminal will open that folder in the Finder (via Florian Oswald)•
2.12 Other resources
• https://unix.stackexchange.com/questions/6/what-are-your-favorite-command-line-features- or-tricks
• MIT’s Hacker Tools course
• UChicago RCC Workshops and Training
3. Notes on Julia
3.1 Julia resources
We recommend starting with the QuantEcon introduction to Julia.
Here are some helpful resources:
1. julialang.org: for installation and general info about julia: blogs, publications, conference
2.docs.julialang.org/en/stable/manual/introduction: excellent manual for julia
3.lectures.quantecon.org/jl: an excellent manual for Julia with a macro vibe
4.github.com/bkamins/Julia-DataFrames-Tutorial: excellent tutorial on how to read/use DataFrames in Julia
5.juliabloggers.com: to keep up with advancements in Julia
6. johnmyleswhite.com: for interesting discussions about performance in Julia List of very useful packages:
•DataFrames.jl: DataFrames in Julia
•ReadStat.jl: read dta files in Julia
•JLD2.jl: fantastic way of storing output in Julia
•Optim.jl: General optimization in julia
•Calculus.jl: for automatic diffentiation in Julia (very practical to check your analytical expressions of gradients and hessians)
3.2 Notes on using Julia on UChicago’s Midway computing cluster
• You cannot connect to the external internet from Midway’s compute nodes. Julia’s packages are pulled from Github, so you must install packages in Julia (and in Stata) when running on a login node.
• You can Pkg.activate(".") on a compute node, but you cannot Pkg.instantiate() on a compute node. You need to install packages, define the Project.toml, Pkg.instantiate() and precompile (using commands once) on a login node before moving to the compute nodes.
3.3 Notes on managing package dependencies in Julia code
Running Julia code will involve packages. Here’s how we manage them in our repositories.
• Project.toml and Manifest.toml
– These two files are central to Pkg, Julia’s builtin package manager. They make it possible to instantiate the exact same package environment on different machines.
– Project.toml describes the project on a high level. The package dependencies and compatibility constraints are listed in the project file.
– Manifest.toml is an absolute record of the state of the packages in the environment. This is not pleasant to read.
• In each repository, we create a task called setup_environment that defines all package dependencies using the Project.toml and Manifest.toml files.
• In a Julia script, we set the active environment using the following commands:
import Pkg
Pkg.activate("../input/Project.toml")
using package_name
While we only explicitly name Project.toml as an input in this script, this presumes that the corre- sponding Manifest.toml file is available in the same directory.
•In each Makefile that executes a Julia script, we provide these toml files as inputs by creating symbolic links to setup_environment:
../input/Project.toml: ../../setup_environment/output/Project.toml | ../input/Manifest.toml ../input
ln -s $< $@
../input/Manifest.toml: ../../setup_environment/output/Manifest.toml | ../input
ln -s $< $@
Note that we define Manifest.toml as a pre-requisite for Project.toml. If a list of pre-requisities is generated by parsing the Julia script for all files that start with ../input/, only Project.toml will be flagged. Making Manifest.toml a pre-requisite of Project.toml ensures that both files appear in the input folder.
4. Notes on Stata
When using Stata, beware of Stata gotchas. Stata also has a number of shortcomings, such as no proper package management, requiring researchers to manually ensure replicability across package versions.
4.1 Command-line execution
For old versions (before Stata 16), a few Stata commands only work in “interactive” mode and not when Stata is invoked at the command line.
• graph export: “ps and eps are available for all versions of Stata; png and tif are available for all versions of Stata except Stata(console) for Unix; pdf is available only for Stata for Windows and Stata for Mac.” See https://www.stata.com/manuals13/g-2graphexport.pdf#g-2graphexport. We usually choose eps format.
4.2 Temporary files
Stata (prior to version 16) loads at most one dataset at a time, thus opening a dataset will cause Stata to discard the dataset that is currently in memory. This constraint may push you to consider saving lots of intermediate output or temporary files to the output folder of your task. We strongly recommend against doing this. Use Stata’s tempfile command instead.
Here is a trivial example
clear
use "exampw1.dta", clear
tempfile temp1 // create a temporary file
save "`temp1'" // save memory into the temporary file use "exampw2.dta", clear
merge 1:1 v001 v002 v003 using "`temp1'" // use the temporary file
4.3 Evaluating inequalities
In Stata, missing numeric observations take the value of positive infinity. Thus, when evaluating inequalities,
. > X is true for any X. I prefer using the command inrange( ) to avoid this issue. Consider the following two ways of generating a dummy indicating that x1 takes a positive value:
gen byte positive1 = 1 if x1>=0
gen byte positive2 = (inrange(x1,0,.)==1)
The first command will cause positive1 to be true if x1 is non-negative or if x1 is missing. By contrast, the second command will set positive2 to true only if x1 is non-negative and non-missing.
4.4 fontface
The Stata command to use Times New Roman in your PNG file is graph set svg fontface "Times New Roman". Apparently Stata groups all rasterized image file types into svg.
4.5 Data in memory after an error
Stata can be befuddling. If you use the preserve command, the data in memory after your script returns an error may not be the data that was in memory at the time of the error: there is a restore operation after the error.
Your merge that fails an assert may not show the observations that caused the assertion to be false:
clear
tempfile tf1
set obs 15
gen id = _n
save `tf1'
keep if inrange(id,1,10)
merge 1:1 id using `tf1', keep(match) assert(master match)
This fails because the assert is false. The unfortunate thing is that it is hard to debug because the ids 11-15 are not in memory, so when you look at the data after it breaks, you only see matches. It’s like it applies the keep() after failing the assert!
. tab _merge
_merge | Freq. Percent Cum.
------------------------+-----------------------------------
matched (3) | 10 100.00 100.00
------------------------+-----------------------------------
Total | 10 100.00
5. CRI Servers for Medicare Claims Data
A guide to working on the CRI servers.
5.1 Accessing CRI servers
There are two servers: Randi and Randi Premium.
• randi.cri.uchicago.edu: All open-source software (e.g., Python, R, and Julia). Uses the SLURM job scheduler.
• randi-prem.cri.uchicago.edu.: Commercial software like Stata. These are loaded and run directly on the node.
Our shell_functions.make is set to use the SLURM scheduler for Python, R, and Julia jobs, while directly executing Stata scripts. You can type make on Randi-Prem and the job will be routed correctly depending on the software. If you type make on Randi and it attempts to run a Stata script, it will (harmlessly) return an error.
First, connect to the UChicago VPN. Then use one of the approaches listed below. A more updated version of these instructions can be found here. However, these instructions are not specific to the CRI servers.
• Terminal: (example with user username):
– ssh username@randi.cri.uchicago.edu or ssh username@randi-prem.cri.uchicago.edu
– Enter password if needed.
– To find your project files and files, navigate to your project labshare folder using cd.
– Location of project labshares on CRI servers: /gpfs/data/cms-share/duas/{dua_num}/.
• (PREFERRED) Visual Studio: allows you to open your code in a convenient text editor.
– Install Visual Studio Code: Instructions
– Set up a private key on your machine and the server, through directly accessing the server (see above for direct access).
∗ Add SSH key for gitlab on Randi
∗ View SSH key on Randi
∗ Copy ssh key to gitlab
∗ Make sure git config user name and email are set up
∗ Restart terminal
– Here are step-by-step guides to install the SSH extension in Visual Studio:
∗ Visual Studio SSH Documentation
∗ Visual Studio SSH Tutorial
∗ Third-Party Visual Studio Setup Guide
– Following these instructions, you should be able to SSH into Randi and Randi Premium and work on the terminal from Visual Studio normally.
Documentation on CRI servers (for old servers):
• Wiki
•Mathew Stephens’ GitHub Gardner Page
Requesting assistance from IT:
• Randi support is currently available through Slack (https://criscientific-dzi9891.slack.com)
5.2 Best practice for coding on the CRI servers
CRI policies forbid remotely mounting the lab share on your desktop.
Here are strategies to be able to write code while checking it in an interactive session:
• Visual Studio - Preferred
– You can open your code in a convenient text editor while running the lines in the terminal in an interactive session (see instructions below). From the terminal in Visual Studio, type code -r yourcode.do to open the file in the same window. Note that you cannot be on a compute node to use commands like code -r yourcode.do. You can edit the code from the text editor, save and then run it from terminal using the job scheduler. You can also check it interactively as you write code by opening another terminal, getting to a compute node as described below, and opening the necessary software in the terminal.
• Doing everything from the command line [to investigate further if necessary]
– Using tmux with one terminal to edit the code from a command line text editor and the other terminal to run the code interactively
To run code interactively
• Get to a compute node by typing srun –pty bash -l (You can add additional options to allocate memory, a number of CPUs, etc. See documentation for srun).
NEVER run any analysis from a login node!!
• On Randi: example to run R
– Load the necessary modules: module load gcc/11.3.0 R/4.2.1 (You can check available modules by typing module avail. You can check details on a particular module by typing module spider R/4.2.1)
– Open R by simply typing R. This will open R in the terminal, and you can run R code directly from there.
• On Randi Premium: example to run Stata
– Load the necessary module:
module load stata/18
You can check available modules by typing module avail. You can check details on a particular module by typing module spider stata/18.
– By default, the stata command will invoke Stata Basic Edition. Optionally, create an alias for Stata to run a specific version. This command adds an alias to your /.bashrc file to run Stata-MP instead:
echo "alias stata=`stata-mp'" >> ~/.bashrc
– Reload the shell configuration file to apply the alias:
source ~/.bashrc
– Now, you can open Stata by simply typing stata. This will open an interactive Stata session in the terminal.
To view and move outputs: use Cyberduck or Filezilla.
• Cyberduck: download here.
• Filezilla: download here. Use Port 22.
• To move files, you can also type from your computer’s terminal (example for user username): scp -r username@randi-prem.cri.uchicago.edu:location_of_file/yourfile.eps location_on_your_computer
5.3 Parallel teamwork on CRI servers
Do not work on the local repository in /gpfs/data/cms-share/duas/dua_numrepo_name/. A starter guide:
- Create a folder with your username in our DUA folder (example with user username):
mkdir /gpfs/data/cms-share/duas/dua_num/{repo_name}_local_repos/username. Creating a sep- arate folder with a copy of the repo for each user allows us to work in parallel without getting in the way of each other when we switch branches, for example.
2. Clone the repository in your user directory. Navigate to your personal folder and type:
git clone https://rcg.bsd.uchicago.edu/gitlab/medicaid-max/{repo_name}{dua}{dua_num}.git
to clone the repository via HTTPS.
Communicating with git requires a personal access token in place of a password which can be generated on GitLab.
3. To avoid re-producing outputs of tasks that are computationally demanding, we link to the main repository (/gpfs/data/cms-share/duas/dua_num/repo_name/) outputs by modifying Makefiles lo- cally. To do so, use the Makefile personal_repos_outputs in task tasks/setup_environment/code by running: make -f personal_repos_outputs.mak main_repo_paths.
!!WARNING!!: do NOT commit these changes to the Makefiles in your personal repository.
To avoid inadvertently committing these changes when submitting a PR, undo these changes by running
make -f personal_repos_outputs.mak main_repo_paths_UNDO in task tasks/setup_environment/code. After this, running git status would only list Makefiles for which these changes have been inadver- tently committed to the branch.
After this, you should be all set.
5.4 Additional Workflow notes
The organization of the repository is identical to the GitHub organization. To note:
• Packages installation: tasks/setup_environment/code
• Creating functions so that calling R, Python, or Stata loads the necessary modules and uses the job scheduler:
– Bash file to use the job scheduler (SLURM) on Randi and Randi Premium: tasks/setup_environment/code/run.sbatch
– Shell script defining functions loading the necessary modules and using the job scheduler (using bash file above): tasks/shell_functions.sh
– Makefile to define how to call on shell script functions in the Makefiles in tasks: tasks/shell_functions.make
You can link to current versions of output files without re-creating them in your personal copy of the repo.
• The repo copy rerun in /gpfs/data/cms-share/duas/dua_num/{repo_name}_local_repos/ contains a recent run of all the code for the project, including output folders.
• You can symbolically link from some of the upstream outputs in rerun to your folder and use those out-puts to run downstream code. The list of tasks that can be linked to is specified in personal_repos_outputs.mak.
• To do so, go to setup_environment/code and run make -f personal_repos_outputs.mak main_repo_paths. This will update Makefiles in your repo. Do not commit these changes!
• Run the downstream task you want to work on. This will create the symbolic links.
• Return to setup_environment/code and run make -f personal_repos_outputs.mak main_repo_paths_UNDO. This will undo the changes to Makefile links.
5.5 GitLab
Our repository is on GitLab, hosted by the BSD in the Medicaid MAX group:
• Sign-In
The work flow is similar to the GitHub workflow. However, we communicate on issues through GitHub. GitLab is used to write code and submit merge requests.
When submitting a merge request, if outputs are created that should appear in the logbook or slides:
- Submit the merge request on GitLab. After it has been merged, you can work on the public-data repo.
2. Create a new branch on the public-data repository on GitHub with an identical branch name as in GitLab.
3.!! Check that anything that is moved out of CRI servers meets CMS minimum patients criteria for patient identifiable data!!
4.If meets the minimum patients per cell criteria, move and commit to this public-data repository branch the necessary outputs.
5. Submit a merge request on GitHub.
5.6 Modules for interactive session
Always get to a compute node before running code interactively: srun –pty bash -l. On Randi:
• Python: module load gcc/12.1.0 python/3.10.5. Then type python3.
• R: module load gcc/11.3.0 R/4.2.1. Then type R. On Randi Premium:
• Stata: module load stata/18. Then type stata.
5.7 Notes on Stata batch jobs with SLURM on Randi Premium
Our research infrastructure used to force Stata jobs to run interactively, while using the SLURM scheduler for Python, R, and Julia batch jobs. The interactive jobs helped us get around a problem with SLURM environment variables that prevented us from submitting Stata batch jobs. However, we have now set up a way to run Stata jobs with SLURM as well. This section describes the SLURM problem and how we worked around it.
To submit SLURM jobs, we used to use a bash script shell_functions.sh that contained the following line:
sbatch -W –export=command1="$command1",command2="$command2" [...] run.sbatch
Here, $command1 and $command2 are previously-defined variables containing instructions to load the necessary modules and run the script. The –export flag passes these to the SLURM job as environment variables. While a SLURM job typically inherits environment variables from the parent shell, using the –export flag removed some environment variables that would otherwise get passed to the SLURM job. This included the removal of environment variables that define the location of Stata on Randi-Premium. So when we used the –export flag, SLURM jobs would not know where to find Stata, and they would fail.
As a workaround, we now call export command1 and export command2 in the bash script before calling “sbatch -W [...] run.sbatch”. This ensures that these are added as environment variables locally, which then get inherited in the SLURM job. By not using –export, the other necessary environment variables (like the location of Stata) are still inherited in the SLURM job.
In implementation, calling $(STATA) in a Makefile will run Stata in batch mode. If you want to run Stata in- teractively, you can use the $(STATA_INTERACTIVE) environment variable in a Makefile. See shell_functions.make and shall_functions.sh for details.
Supported by the National Institute on Aging grants #P01AG005842 and #3P01AG005842-32S1
Related
Topics
Investigators
Data Categories
More from NBER
In addition to working papers, the NBER disseminates affiliates’ latest findings through a range of free periodicals — the NBER Reporter, the NBER Digest, the Bulletin on Retirement and Disability, the Bulletin on Health, and the Bulletin on Entrepreneurship — as well as online conference reports, video lectures, and interviews.

- Feldstein Lecture
- Presenter: Cecilia E. Rouse

- Methods Lectures
- Presenter: Susan Athey