This article is part of the R for Researchers series. For a list of topics covered by this series, see the Introduction article. If you're new to R we highly recommend reading the articles in order.

Overview

This article will introduce you to R commands, R programs called scripts, and Git central repositories.

This introduction covers commands used to get help and prepare an R session for your work.

Git central repositories are useful to provide a backup or your project and to coordinate the work done on multiple computers. Eventually most of us will need to work with others on projects. These Git tools will allow you the freedom to work on your part independently while easily coordinating with the others.

Preliminaries

You will get the most from this article if you follow along with the examples in RStudio. Working the exercises will further enhance your skills with the material. The following steps will prepare your RStudio session to run this article's examples.

  • Start RStudio and open your RFR project.
  • Confirm that RFR (the name of your project) is displayed in the upper left corner of the RStudio window.
  • Confirm that there is a Git tab in one of the tab panes.
  • Copy the Datasets folder with all of its content from "X:\SSCC Tutorials\RFR" on the SSCC network and paste the folder into your RFR project folder. If you are accessing this article from the web, you will need to create a Datasets folder in your RFR project directory. Then copy the files individually from RFR/Datasets and save them to your datasets folder.

Workflow

R is an interpreted language. This means R commands can be entered and run individually, without being part of a compiled program. This allows great flexibility to interactively explore and analyze data. While exploring data interactively is effective, a researcher's work also needs to be reproducible. The workflow used in this article series allows for both interactive exploration and reproducibility. This is achieved by doing our exploration using scripts and source control of the scripts.

Our typical work flow will be

  • Open a script, either new or existing.
  • Add or change a single command, or a few commands, in the script.
  • Run the commands from the script.
  • Update the commands to repair errors or adjust the analysis and re-run them.
  • Repeat the add and update cycle as you explore your data and build an analysis.
  • When you have some functionality working, add the R code to R Markdown and write the supporting text in R markdown.
  • Commit the changed source files.

The R Markdown and source control steps are important for their roles in reproducibility. We will practice these steps through this article series to help you incorporate them into your work habits.

Some will find it more natural to interleave creating chunks of R code with writing the accompanying document text. Others will find it more natural to write all or most of the R code and then write the document. The nature of your research project might also influence when to create the markdown document. When your document is highly influenced by the analysis, delaying writing the document may be more optimal. When the document form is not dependent on the analysis, there may be benefits to writing the document more in parallel with the analysis. With either approach there are typically some adjustments to the R code and Markdown document at the end to clarify the results of the analysis in the paper.

R commands and functions

The console is where R commands are entered, run, and text results displayed. Commands can be entered into the console at the prompt or sent to the console from a script or R markdown file. To support reproducibility, we will be entering our commands in a script and then send them to the console. Even though we will not be entering many commands at the console, you will need to know a little about how the console works to use it with your scripts.

The basics of using the console are as follows.

  • > is the command prompt. R will not display the command prompt until it has completed running the prior command. If the prompt is not displayed, R is not ready for a new command.
  • + is the prompt for the continuation of a command. If R reaches the end of a line and the current R command is not complete, R assumes the next line continues the prior line. Splitting some commands across multiple lines can improve the readability of you source code by allowing the structure of the command or data to be seen visually.
  • The escape key will end a command. This is handy if R thinks the current command is not finished and you see an error in what has already been entered.
  • The page up and page down keys are used to scroll through the history of prior commands. A prior command can be recalled from the history, edited if needed, and then run again.

Commands

R commands are similar to commands from general computing languages like C++ or python. This is a little different from the syntax of languages such as Stata or SAS. R commands typically either assign values to an object or control which commands will get run. R objects will be covered in the next article. For now you can think of an object as a variable.

Functions

An R function is similar to a Stata and SAS command. A function performs some action and the action taken is adjusted based on the parameters given. In this series we use the term command in a very loose sense to refer to a function as well as to formal R commands.

  • Syntax and use of functions

    functionName(parameterList)

    functionName is the name that identifies the function in R.

    parameterList is a list of parameters. Parameters in the list are separated by commas. Parameters can be identified by either their position in the list or by a name. In most instances using the parameter name enhances the readability of your code. We will primarily use parameter names in this article series. There is one case where we will drop the use of parameter names. This is when the function name makes it clear what the first parameter is. An example of this is provided below with the help function.

    An R function returns an object as it's result. The returned object must be saved if it is to be used again.

    The use of "()" directly following a name identifies the name as a function and the contents of the parentheses as parameters. The parentheses are needed for a function even if there are no parameters. Parentheses can also be used in an expression to identify order of operations. When used for order of operations, the parentheses will not directly follow a name. In the Data preparation article you will also see the use of the square brackets "[]". When "[]" follows a name, the name identifies a data object and not a function. The use of these brackets will be demonstrated further in this and the following lesson.

Expressions

An Expression in R is any text which, when interpreted in R, results in a data object. Beneath this simple definition is one of the powerful constructs of R. An expression can also be used anywhere that a value is expected. This allows simple functions to be linked together to do much more sophisticated operations. This approach of building more specific results using this feature will be demonstrated in this article series as we learn more commands.

Expressions include numeric, logical, or character values with their associated operators. These values can be a variable, a constant, or the returned value from a function.

  • The numeric operators include +, -, *, /, ^, log(expression), exp(expression) which are addition, subtraction, multiply, divide, natural log, and the constant \(e\) raised to the power given by expression respectively.

  • The logical operators include: == the logical test for equality, < less than, <= less than or equal, > and >= similarly are greater, | logical or, & logical and.

  • The character operators are more special purpose and will not in general be covered. A few of them will be introduced where needed in subsequent articles.

Our first commands

The first commands you will need are provided below.

  • Implied display command. This is used to display an R object at the console.

    expression

    The value of expression will be displayed on the console.

    Enter the following at the console.

    (3 * 5 + 1) / 2

    The following will be displayed at the console

    [1] 8

    Note, since the "()" did not follow a name, they were used as grouping operators and not as a means of identifying a parameter list.

  • Assignment command

    object <- expression

    Object is set to the value of expression.

    • <- is the assignment operator.

    Enter the following at the console.

    x <- (3 * 5 + 1) / 2

    Note, nothing was displayed. To see what was assigned to x, you would have to enter "x" at the console.

  • A Comment is text to remind yourself, and others, of how to use your code and how it works. Comments are ignored (not treated as commands) by R.

    # reminder text

    • Comments start with a # and continue to the end of the current line. Comments can start at the first character of a line or may follow a command.
  • The help and examples commands provide assistance with an object. Help provides a description of an object. The object may be a function or dataset. Example shows an example of the use of a function.

    help(topic)

    • Help results are displayed in the help tab.

    example(topic)

    • Example text is displayed in the console.

    Help() and examples() are functions. There are a number of parameters which could be used with these functions. We are only interested in the first parameter of these functions which is named topic.

    We will use help to get some information on the read.table function

    • Enter the following at the command prompt in the console.

      help(read.table)
    • Press the enter key to run the command.

    • Select the Help tab in the tools pane.

    • This could have also been entered as help(topic=read.table). The use of "topic=" does not enhance the readability of the code. This is one of the cases were we will identify the parameter by position.

    • A description of the function and it parameters will be displayed as is seen in the image below.

      Help tab

R's commands (functions) were written by many individuals over a large number of years. No central authority exists which controls naming conventions. This has lead to differences in names for similar objects (functions, parameters, etc.) This causes no problems for R. A little extra time is typically needed to learn R's parameter and function names. Use the help function to remind yourself as you learn the names in R.

Scripts

An R script is a series of commands in a file. R scripts have a file extension of .R. Scripts are ordinary text files and can be written using any text editor. We will use RStudio's editor to write our scripts. Using R or RStudio's editors makes it easy to work interactively, running commands as they are written. The editors in R and RStudio do not automatically save changes made to a scripts. You will need to save on a regular basis when you use either of these editors.

Keeping your R scripts to a reasonable length makes them easier to work with. It is easier to find code if the file is smaller. Also you will likely want to use some, but not all, of your code in your R Markdown files. Segmenting your work with this in mind will also make creating your Markdown documents easier. Multiple scripts can be collected in a single script. This allows all your code to be run together. Running a script from another script will be demonstrated in the Data presentation article using of the source() function.

We are going to create a script for you to enter and run the example commands from this lesson.

  • Open a new R script file.

  • Save it as SalAnalysis.

Setting up your R session

An R session is started when R or RStudio is started. R creates a workspace for each session. The workspace contains the objects that R knows about. Your data and functions are added to the workspace as you create them in your session. R loads a set of core functions as part of starting a session.

There are two aspects to getting your session ready to use. The first is loading any non-core commands you need. R commands (functions) are grouped in packages. A package typically includes a set of related functions. It is typical to load at least a few packages when starting R. The second aspect of session preparation is setting the work directory, where R will look for files.

Packages

The packages which make up the core functions and commands of R are loaded when R is started. There are many packages which extend R's commands beyond the core commands. These extension packages need to be loaded in each R session before you can use the functions they contain. The functions in these extensions range from widely used functions to obscure functions used by only a small number of people.

A package needs to be installed on your computer before you can load it into your session. R and RStudio manage a library of packages that have been installed on your computer.

Winstat has a number of common packages installed for you. The packages installed in your library can be seen in the packages tab.

Installing a package is shown here with the ggplot2 package being installed.

  • Select the Packages tab. The tab should look like the image below with a list of packages already in your library.

    Packages tab

  • Select the Install icon. An Install Packages window will open. Enter the package you want installed in the Packages box. The text will autocomplete once you have entered enough characters to distinguishing the package.

    Packages install window

  • Leave the other boxes as the defaults.

  • Click the install button in the Install Packages window and the package will be installed.

The ggplot2 package is now installed and is in your library. The ggplot2 package can now be loaded in an R session.

  • The command to load a package into your R session is

    library(packageName)

    PackageName is the package which is to be loaded.

There a few packages we will use in this article series. We will add code to your script to load these packages.

  • Enter the following commands into your SalAnalysis script. (The lines can be copy and pasted from this file into your script. Remember to save after updating your script.)

    #####################################################
    #####################################################
    ##
    ##   Demonstration from the R For Researchers series
    ##   
    ##   The focus of the analysis in these articles is
    ##   on demonstrating the use of R functions in the
    ##   analysis of data.  These analyses are not 
    ##   complete analysis.  They include only the steps
    ##   needed to demonstrate the use of the R
    ##   functions.
    ## 
    ##   Name        Date
    ##
    #####################################################
    #####################################################
    
    #####################################################
    #####################################################
    ##
    ##   Session Setup
    ##
    #####################################################
    #####################################################
    
    library(faraway)      # glm support
    library(MASS)         # negative binomial support
    library(car)          # regression functions
    library(lme4)         # random effects
    library(ggplot2)      # plotting commands
    library(reshape2)     # wide to tall reshaping
    library(xtable)       # nice table formatting
    library(knitr)        # kable table formatting
    library(grid)         # units function for ggplot
  • These commands need to be sent to the console to be run. Highlight the commands you wish to run in the editor.
  • Click the run icon, which is on the right in the top row of icons in the source pane. The run icon has a green arrow pointing to the right. If you hover over the icon you will see the text "Run the current line or selection" as seen in the image below.

    Run script lines

  • The results will be displayed in the console. The results should be similar to the following.

    
    Attaching package: 'car'
    
    The following objects are masked from 'package:faraway':
    
        logit, vif
    
    Loading required package: Matrix

Notice that these packages required other packages to be loaded. R loaded these packages automatically. If any of the packages do not load, the package likely needs to be installed on your computer.

Several comment blocks were added to the script with the library functions. The first is used to identify what this script is for. The second is used to identify the beginning of the setup section of the script. Comment blocks make it easier to find sections of functionality in scripts. Its a good practice to use comment blocks to separate sections of your scripts.

It is also a good practice to load packages at the beginning of a script or R markdown file. If while working you discover you need another package loaded, add the library command for it at beginning of the file with the other library commands. By keeping the package loading at the beginning of the files, commands can be used anywhere in the file. This avoids having to look through a script to see if you have loaded a package for a function you need.

Work directory

Setting the work directory allows you to reference files without giving a full path name to the file. There are several advantage to setting your working directory. The most important advantage occurs when the working directory is the same a the project folder. If you move the project folder, you have one line in your script to change to point to the new folder. The script and project becomes more portable with this approach. Setting the work directory will save you typing in your scripts, since you will not need to enter the full path to the file. RStudio sets the work directory to the project directory when a project is opened. If a script will be run outside of the project, the work directory will need to be set.

We are going to set our work directory in our script.

  • Enter the following commands into your scripts and run them.

    saveDir <- getwd()  # get the current working directory
    saveDir             # show me the saved directory
    
    wd <- "u:/RFR"      # path to my project
    setwd(wd)           # set this path as my work directory
  • Your console should display a similar working path (Note your path will be different than the displayed path.)

    [1] "u:/RFR"

Your script now has the code needed to set up your session. This is a good time to commit the changes to SalAnalysis.

  • Commit your changes to SalAnalysis using the commit message "Added session setup to SalAnalysis".

Exercises

  1. Create a new script titled AlfAnalysis. This script will be used for the exercises in this article series.

  2. Set up the session for the AlfAnalysis. Load the same packages as were used in SalAnalysis.

  3. Commit your changes to AlfAnalysis.

Solutions

Git accessing prior versions

Git has a number of commands to access prior versions of your source code. RStudio has implemented only one of these functions, which RStudio calls revert. This function changes the working directory file to match the state of the head. The choice of the name "revert" for this function is unfortunate in that this is not what a Git revert does. The RStudio revert is like a Git reset with the hard option. This article series will refer to this function as RStudio's revert to distinguish it from a Git Revert.

It is important to recognize that the intent of an RStudio revert is to overwrite files in the working directory. The overwritten changes in the working directory would not have been committed and as such are not part of the project's history. The overwritten changes are permanently removed from the working directory and the project's history. Rstudio's revert needs to be used with great care.

There are times when RStudio's revert is what is needed. An example would be trying a new approach to a calculation for an analysis and determining it is not as good as the prior approach. Going back to the prior version of the calculation and not saving the failed improvement code might be what you want.

We recommend the use of a Git GUI to access source files from commits prior to the branch's head, the last commit.

We will make a meaningless change to our script and use RStudio's revert to restore the file to its committed state.

  • Add the following line to your SalAnalysis script.

    # Silly comment use to test RStudio's revet

  • Save the file, but do not commit the change.
  • Select the Git tab in the tabs pane.
  • Select the SalAnalysis file in the Git tab.
  • From the tools menu in the Git tab, Select revert.

    Git Revert

  • The comment line has been removed from the file.

This example would have been easier to do using the undo function in the editor. There are times when the editor undo is either not an option or would be difficult to use. For example when a file has been closed undo is no longer available, or when many changes involving multiple files have been made.

Git central repositories

A central repository is a repository which is used solely to store the project. No development is done in a central repository. Development is done in local repositories. Our RFR repository is a local repository. Central repositories are useful as a backup for a local repository or to coordinate work done in multiple local repositories.

A central repository is a remote repository. It is remote because its remote with respect to our local repository. Remote does not necessarily mean far away. The remote repository might be saved to the same storage device as the local repository.

We will create two folders, named cen and home, to demonstrate the use of a central repository. The cen folder will be where our central repository will be stored. The home folder will hold a second local repository for the RFR project. These two additional repositories will be stored on the U drive as is your current RFR project, all on the same device. This is done for convenience of this example. In practice, additional development repositories, local repositories, would likely be stored on different devices.

  • Create the cen and home folders on your U drive. If you do not have a U drive on your computer, put them in the documents folder.

We need to create the central repository for RFR and then connect our local repository to the central repository. RStudio does not support this functionality. We will do these steps using the shell.

  • Select the Shell option from the tools drop down menu in the Git tab.

    Git Shell

  • A shell window will open. The prompt shows the current folder. This is the folder where commands will be executed in. The prompt should include the path /u/RFR in the shell.

    Git Shell Prompt

  • We need to change our folder to where the central repository will be stored and then create the repository.

    Enter the following commands in the shell.

    cd ../cen
    git init --bare RFR.git

    The init --bare command and parameter tells Git that the new repository will not be a development repository. RFR.git is used to identify the name of the repository. We will call all three of the repositories RFR. We will use the repositories location to distinguish them and not their name.

  • We now need to connect our local RFR repository to the central repository.

    Enter the following commands in the shell.

    cd ../RFR
    git remote add central ../cen/RFR.git
    git push -u central master
    

    The remote add command and parameter tell git to add a path to a remote repository. The added path will be named central in the RFR repository and the path to it is ../cen/RFR/git.

    The push -u command and parameter copies the local RFR repository to the central repository.

  • The shell should now look similar to the following image.

    Git Shell Done

  • Close the shell window by clicking the red X in the upper right corner.

  • The Pull and Push icons, in the top row of icons on the Git tab, should be fully displayed and not greyed out. You may need to click the refresh button in the Git tab to fully see the push and pull icons.

  • Click on the history icon in the Git tab.

    Git Central Log

  • In the Review Changes window you should see that the most recent commit, the top one, now has three identifiers associated with it. The first identifier HEAD, indicates this commit is the head of our currently checked out branch. The second identifier central/master, indicates this commit is the head of the master branch in the central repository. The third identifier master, indicates this commit is the head of the master branch in this repository. When there is only one branch in a repository, the HEAD and master will point to the same commit.

Clicking the Push button will now move to the central repository any commits in the local repository which are not in the central repository.

If your computer is connected to the SSCC network, we recommend that you use one of the network drives for your project, such as U, V, etc. If you are working on a computer not connected to the network, we recommend you set up a central repository on a network drive and push to the repository on a regular basis.

A central repository is also useful for coordinating work in multiple local repositories. One situation for this would be a project in which you work both on a University computer connected to the network and a computer not typically connected to the network, such as a laptop or home computer. This situation will be demonstrated in the next example. The procedure would be similar if the repositories were associated with different members of the team.

We will set up another repository to work on the RFR project. We will set up this second repository in the home folder you created above.

  • Open a second instance of RStudio.

  • From the File drop down menu, select New Project.

  • Select Version Control from the Create project menu in the New Project window.

  • Select Git from the Create project from Version Control menu in the New Project window.

  • Enter "file:///U:/cen/RFR.git" in the Repository URL box. The path to the central repository needs to include "file:///" to address some issues with network drives when cloning a repository. This prefix is not needed after the repository has been cloned.

  • Enter "RFR" in the Project directory name box.

  • Navigate to "U:/home" in the Create project as a subdirectory of box.

    Git Home Repo Create

  • Click the Create Project icon at the bottom of the New Project window.

  • The path "U:/home/RFR - master" should now be displayed in the upper left of the RStudio window next to the RStudio icon.

  • Select the Git tab in the tools pane.

  • Click on the history icon.

  • The Git log can be seen in the Review changes window.

    Git Home History

  • All of the commits we made in our primary repository can be seen in the new home/RFR repository. The name of the remote repository here is origin. This is the same repository that is named central in our primary RFR repository.

  • Close the Review Changes window.

There are now three RFR repositories on your U drive. Two of these repositories are local repositories. The first is our primary working repository, U:/RFR, and second is the home repository, U:/home/RFR. Work done in one of these repositories can be shared with the other by pushing and pulling through the central repository.

As an example of sharing project work, we will make a change in the home repository and move the change to our primary working repository.

  • You should have both of the local RFR projects open. If they are not both open, open them. The path displayed in the upper left corner of the RStudio window will identify which project the window is associated with. The home RFR project will display the path "U:/home/RFR".

  • Open the SalAnalysis.R script in the home RFR project.

  • Add the following comment line to the SalAnalysis.R script in the home RFR project. Add this line after the other content in SalAnalysis.

    # Comment added in home/RFR
  • Save the SalAnalysis file

  • Commit the change to SalAnalysis file with the commit message "Added comment in home project".

  • Open the Git history in the home RFR project and the log will show this new commit in the U:/home/RFR repository and not in the origin repository, which is our central repository.

    Git Home Added Commit

  • Open the Git history in our primary RFR project and the log will not show this new commit in the repository.

    Git Primary No Pull

  • We will now move the new commit from the home RFR repository to the central repository. Click the Push icon in the Git tab of the U:/home/RFR project.

  • Click the close button in the Git Push window.

  • Click the refresh icon in the Review changes window for the U:home/RFR project.

    Git Home Push

  • The U:/home/RFR Git log shows that the origin repository now contains the new commit.

  • Clicking on the refresh icon in the Review changes window for our primary RFR project will show that this repository has not changed. Also note that there is no indication that there are changes in the central repository waiting to be pulled into this repository.

  • We will now move the new commit from the central repository to primary RFR project. Click the Pull icon in the Git tab of the U:/RFR project.

  • Click the close button in the Git Pull window.

  • Click the refresh icon in the Review changes window for the U:/RFR project.

    Git Primary Pull

  • The commit which was done in the U:/home/RFR project is now seen in the log of the U:/RFR repository.

  • Click on the SalAnalysis.R file. You will see that the the change made to this file in the U:home/RFR project is now in this file.

  • The changes made in the U:/home/RFR project have been applied to the U:/RFR project.

This example worked this easily because there was no conflicts in the file pulled from the central repository. If there are conflicts, you would need to resolve the conflicts. There are tools to support this process. We will not cover these tool in this article series. While the tools can help, the heart of the conflict resolution process is you deciding what changes will be made in your project.

Next: Data preparation

Previous: R Markdown

Last Revised: 11/24/2015