Aging Servers - version control usage - Git
Modern workflow to keep, manage, and update code that also keeps prior versions, and tracks changes between versions is primarily done using "version control" tools such as git or subversion(svn). Increasingly, many researchers use public web platforms such as github.com, Bitbucket.org, gitlab.com, git.nber.org(based on Gitlab) for git, and/or TortoiseSVN, or VisualSVN for svn. A key advantage of these facilities is that researchers can publicly share their code, or have a collaborating team work on the code for their analysis. Typically, a researcher would create a ssh key. Using this ssh key and a ssh "hook" that these web platforms provide, they would clone or branch a repository created on these public platforms on to the NBER servers and use the same ssh hook to update/commit changes back to the repository (origin repo) on these web platforms.
The NBER's "Aging" computing environment hosts confidential data such as the CMS' Medicare and Medicaid claims data. Data agreements for these data require us to block any transfer of data between this environment and outside. Under these settings, it is not possible to directly operate with these public version control web platforms. So, what can be done?
There are two methods that we propose.
Method 1: This method is slightly convoluted and may be less convenient but allows use of these public web platforms in an indirect way. We will leverage Moving-files-aging-network and use of rsync for this method. The NBER has a set of servers (nber0.nber.org ... nber12.nber.org) similar to the servers in the "Aging" network that are outside the "Aging" environment, and they allow interactions with external environment using ssh. A potential workflow under this method would look like the following:
- Login to one of the outside "Aging" servers, e.g. nber1.nber.org using the same username as your username you use when you RDP to agerdp1 or agerdp2. Note, that you cannot login (ssh) to the "non-Aging" servers from inside your "Aging" RDP session. This is by design. You will need to start a separate ssh session from outside your RDP session on your device.
- Create a project folder - let's call it project1 (mkdir project1), and cd project1
- git clone ..., For example, if you are using github, a command you can use is "git clone ssh://git@ssh.github.com:443/YOUR-USERNAME/YOUR-REPOSITORY.git my-working-copy). The result of this will give you a folder my-working-copy in your project1 folder. (Note that YOUR-USERNAME and YOUR-REPOSITORY.git are all on the github you have already established)
- Now, hop over to your RDP session on the "Aging" network - agerdp1 or agerdp2.
- cd to your code folder for your project. e.g. cd /disk/agediskX/medicare.work/PI-DUA#####/your-DUA-username/project1/code
- rsync /disk/exthomedirs/your-DUA-username/project1/my-working-copy/ /disk/agediskX/medicare.work/PI-DUA#####/your-DUA-username/project1/code/ This will sync what is in your "project1/my-working-copy" on the "external to Aging" NBER servers to your "project1/code" folder in your DUA folder. An important point to take note of is that on the "Aging" side, everything under /disk/exthomedirs/... is in "readonly" form. This means that if you edit your code in your "project1/code" folder inside your DUA folder, you will not be able to sync it back into the "project1/my-working-copy" folder in "exthomedir." The sync can only be one way. If you add --delete (as in rsync --delete .....) to the command, it will delete any file in your project1/code folder in your DUA folder, that is not in the project1/my-working-copy folder. So, for example, if you have log files saved in your code folder as well, they will get wiped out if you use --delete option.
- A slightly alternate workflow process that avoids use of rsync in step 6 is thus. Instead of creating a code folder in your DUA folder, you could directly run your code from the /disk/exthomedirs/.... path. However, you have to ensure that all output is redirected to your DUA folder on the "Aging" network because /disk/exthomedirs is "readonly." In particular, for example, if you run a stata job thus, "stata -b /disk/exthomedirs/.../x.do &," stata will try to create a log file in /disk/exthomedirs/..../x.log and it will fail. So, you would have to use "log using ...." to redirect your output elsewhere.
Method 2: Direct command line operations on the "Aging" servers. Git is installed on our servers. So is SVN. A potential workflow under this method would look like the following: Suppose you are a team of 2 users - A, and B, on a DUA and you are working on a project jointly and want to store all your code jointly into a repository. Let's say the DUA folder is /disk/agedisk1/medicare.data/pi-DUA12345. Let's say you have a project folder project1. In this scenario, here is how you might setup a repository and use it:
- cd /disk/agedisk1/medicare.work/pi-DUA12345/project1
- mkdir project1-code-repo
- cd project1-code-repo; git init --bare
- The above is a one time setup of repository named "project1-code-repo" It resides in the folder /disk/agedisk1/medicare.work/pi-DUA12345/project1
- User A would then do the following, cd /disk/agedisk1/medicare.work/pi-DU12345/A/project1
- git clone file:///disk/agedisk1/medicare.work/pi-DUA12345/project1/project1-code-repo code-working-copy. This will clone project1-code-repo into code-working-copy of User A's project1 folder inside their DUA folder.
- cd /disk/agedisk1/medicare.work/pi-DUA12345/A/project1/code-working-copy
- Use git status to check the condition of this copy relative to the master.
- Run git pull to sync with the master.
- Edit an existing code as you would normally do, or edit a new code file.
- If you edit a new file and want it to be added to the repository, use the command git add <newfile name>
- git commit -m "Comment if you like to put for the edits you made" <newfile name> This command commits changes you made to the file.
- git push This will push any changes you made to the files (edited old files or added new files and used git add to dd the new file) to the master.
- User B would then do something similar to User A. cd /disk/agedisk1/medicare.work/pi-DUA12345/B/project1
- git clone file:///disk/agedisk1/medicare.work/pi-DUA12345/project1/project1-code-repo code-working-copy.
- Now this is User B's working copy.
- If either of user A or user B edits/updates/adds files in their respective working copies, adds/commits/pushes them, those updates are submitted to the master.
- If either of user A or user B runs git pull from their working copy folder, it will pull all the updated and new files from the master. A good practice is to start with a gut pull before you make your own edits.
- One can keep other files in your working copy that are not added or committed. They will remain in your working copy but will not be available from the master repository.
- We are only providing a very simply outline of usage. There are more sophisticated git operations, like creating branches etc. We will not give all the details.
- Finally, git gui is also available on the "Aging" servers. You can get to it either from command line or from The Linux/Gnome Desktop top right "Applications"->"Programming" -> "Git GUI" This GUI does not have all the bells and whistles of a public git platform, but can still be useful.
- This method keeps all files local to the "Aging" environment. Since the folder are identical across any of the servers, you can run any of your git commands on any of the servers. When you want to publish your code, you would have to submit it for output review.