Lesson preamble

Lesson objectives:

  • Learn a bit about how version control and Git works
  • Learn the 4 main Git commands and the Git-RStudio interface
  • Know why a standard/consistent folder and file structure is important
  • Project work

Lesson outline:

  • Version control using Git (35 min)
    • (Basic) introduction to Git
    • Common commands for Git
  • Quick introduction to using GitHub (10 min)
  • Folders and files structure (15 min)
    • R projects in RStudio
    • Using Git in RStudio
  • Project work
    • Get started/work on assignment 7 (20 min)

Git

A short introduction

PhD Comics on filenaming

PhD Comics on filenaming

The comic above is teasing about prematurely naming your thesis “FINAL”, but it also highlights a common method that many people use to keep track of different versions of their files. This method involves making a copy of a file, naming it differently from the original, and editing the contents of that new file, repeated as many times as to get the document finished. This is the basic concept of “version control”, though quite primitive. More formal and structured methods of version control keep track of what exactly changed, when, and by whom (like how Google Docs). Git is one such system of version control. Git is specially powerful when your project involves more than just a single file or has any type of code or data involved.

Git is a software that keeps detailed tracking of your files and folders, keeping a history of exact characters that have been changed. It’s very powerful, open source, and is very popular among software developers, programmers, R coders, and scientists. This is one of the main reasons why we will be learning Git, because it has this massive online community, support, and documentation. Git is especially powerful with collaborating on more complex projects (such as a research project).

Git works by using several commands to track files (‘add’), save (‘commit’) them to history, check the ‘status’ of your folders and files (in a project/parent folder, called a ‘repo’), and see the history (the ‘log’) of what was saved. You can do some much more advanced and powerful things in Git, but this lesson we’ll focus on the basics (which you many only ever use).

Exercise

Follow the getting setup steps from command line:

Replace “Your Name”, with, well, your first and last name. Replace the fake email address with the email address you normally use.

Confirm it works by running:

The basic commands of Git

There are a few commands that you will use regularly when you start using Git. In fact, you may never need to use more than this depending on what you do.

  • git add
  • git commit
  • git log
  • git status

For your final assignment, you will be using a few more commands that we will cover a bit in this lesson and much more next week.

  • git remote
  • git push
  • git pull
  • git checkout
  • git branch (maybe)

But first, let’s focus on the first four basic commands.

  • git add tells Git to start ‘tracking’ or watching the file(s) you point to and put them into the ‘staging area’ (see the image below).
  • git commit tells Git to take the changes made to the files you select and save them into the history, moving files from the ‘staging area’ into the history.
  • git log shows you the history of your git repository (all the changes made to the files; this is all contained in the .git/ folder).
  • git status gives you a brief overview of what is going on in your git repository.
Image of the Stages from the Git Basics Book

Image of the Stages from the Git Basics Book

RStudio has an excellent Git interface, mainly for the basic things. However, for slightly more complicated things you will need to use the terminal. So we’ll start by learning how to use the terminal, then switch to using the RStudio interface.

We’ll go through the process of creating a Git repository (repo) and using the basic four commands during class.

Exercise

Create a new Markdown file called bio.md in the Git repo we created in class. Add your name and one or two sentence blurb about yourself. Using the command line, add this new file and commit to the history. Check the status and log.

GitHub and your project

GitHub is a commercial, Git repository hosting service. For public repositories, GitHub is free. (yeay open source!) GitHub is an extremely popular site for R packages, with many of the most popular and powerful packages all on there (all of the tidyverse packages are there, like dplyr and ggplot2). Your final assignment is also on GitHub. When collaborating on GitHub, generally you’ll need to make a ‘fork’ of the original repository. In GitHub, a fork is a copy of the original repo. While you could make edits to the files in the GitHub repo using GitHub, its much faster to download (‘clone’) the repo onto your computer to edit there. The image below shows the different repos and their interactions. Because you are downloading (‘cloning’) your Forked GitHub repo, you need to tell your computer where to look for the fork and original GitHub repo. That’s what the ‘remote origin’ (your fork) and ‘remote upstream’ (the original) mean. Git stores these links to the GitHub repos in these ‘origin’ and ‘upstream’ pointers.

We’ll all make a fork of the team final assignment. Then we’ll use git clone to download that repo onto your computer.

Exercise

Because we’ll be using GitHub and the final assignment team repositories, you’ll need to add something to your Git settings. Add this remote (link) to your Git repo of the final assignment:

File and folder structure

You’ll notice in your final assignment folder that there are several files that already exist. Many of these files are standard files that are included when you make your own R package. The reason we are using it for this assignment is because it provides an infrastructure to be able to build and create simple to complex projects, in a way that is standardized and already well-documented. Here’s a description of each of the files/folders:

  • R/ folder will contain custom functions and other bits of R code that you’ll use for your final report in the doc/ folder. Sometimes your code gets too complex and you need to break it down into multiple (or one) functions. Those functions you will save in this folder.
  • data/ folder will contain any small to medium sized datasets you may use or create. Most of the practice datasets are too massive to store on Git/GitHub, so what you’ll need to do is get everyone to have a copy of that dataset saved in the data/ folder but don’t commit it to the Git history. This is also a great place to save results output, since you probably don’t want to run analyses on your massive datasets every time to create your results.
  • doc/ contains all the R Markdown documents relevant to finishing your final report.
  • .Rbuildignore is a specific file for when you want to load all the R files in R/. We’ll talk about this more as you do your final report.
  • .gitignore is a special file used by Git to tell it to ignore certain files from tracking, such as files or figures that get generated from R code and data that you don’t need to save.
  • DESCRIPTION is a file that contains some metadata that is in the format used when making R packages. It contains information on who the authors are, details about the project, what R packages you need to install for the project, and more.
  • README.md is, as the name says, a file to read. GitHub by default will show this file as the first file to view in a repo. The README contains important and detailed information about the project.
  • *.Rproj tells RStudio that this folder is a project. You open this when you work on your project.

R projects and Git in RStudio

R projects are files that set a standard for how you and others use the files in your repo, when they use RStudio. Opening this file opens RStudio with the options and settings for your project.

Let’s create a new R project called testing. We’ll also create a Git repo in this project. The nice thing about using this project style is that the Git interface is quite nice. Create a new file called bio2.md and switch to the Git interface (either the Git tab or Git button at the top). You’ll see that the common commands (status, add, commit, log) are all there. Let’s add and commit this new file.

Exercise

Add your name and a random string of text to the bio2.md file. View what was changed on the RStudio Git interface. Add and commit that file. Then look in History to see what you changed and what was changed in the file.

Now, add more lines of text and also delete some of the original text from the file. Check out the Git interface and see what was changed. Add and commit. Check the history.

Project work exercises

Open your final assignment Rproj files, so that RStudio opens in that folder. Try to work through some of Assignment 7 (on the website), using RStudio and the Git interface.

Resources

  • Excellent tutorial on using Git from Software Carpentry. First five sections on the Software Carpentry website are relevant for this lesson.

This work is licensed under a Creative Commons Attribution 4.0 International License. See the licensing page for more details about copyright information.