This article is part of the R for Researchers series. For a list of topics covered by this series, see the Introduction article. If you're new to R we highly recommend reading the articles in order.

Overview

This article introduces sourcing R script files, formatting character variables, and formatting tables and graphs for documents. This article also will expand on the use of the R Markdown commands introduced in the R Markdown article.

The R chunks used in this article contain some option lists which are rather long. These chunk options can not be split across multiple lines. As you read this article in your web browser, you will see that the long chunk options extend beyond their display box. There is a slider bar at the bottom of these display boxes which will allow you to scroll to the right to see the remainder of the chunk option list.

Preliminaries

You will get the most from this article if you follow along with the examples in RStudio. Working the exercise will further enhance your skills with the material. The following steps will prepare your RStudio session to run this article's examples.

  • Start RStudio and open your RFR project.
  • Confirm that RFR (the name of your project) is displayed in the upper left corner of the RStudio window.
  • Open your SalAnalysis script.
  • Run all the commands in SalAnalysis script.

We will use markdown to document some of the results from our evaluation of the Salaries dataset. Use the following steps to create a new markdown file and get it ready for use.

  • Open a new R markdown file with the Default Output Format of PDF. Use the title "Analysis of Professor Salaries".
  • Save the markdown file as SalDoc
  • Delete the example text and code (the text following the metadata section.)
  • Where the example text was, enter

    ## Data overview
    
    
    The Salaries data set is the basis for this analysis.
    It is a sample of academic salaries of Professors at
    United States universities from the 2008-9 school year.
    The salary information was collected to monitor the
    difference in salaries between men and women.
  • Knit the markdown file

  • The pdf previewer will open and the text you entered will be displayed.

Include R scripts in Rmd files

We want the markdown document to make use of your SalAnalysis code. There are two general approaches which can be used to include code. The first is to pull only the parts of the R code which are needed into the markdown file. This makes the markdown file run efficiently. The drawback is that you must either make another copy of part of your code or you must break your script into smaller scripts so that just the needed script files can be included in the markdown. The other approach is to include all of your script in your Markdown file. This is the easiest approach and is the approach we will use.

The R function to run code from another file is the source() function.

  • Syntax and use of the source() function

    source(file)

    Returns all commands and results from the file to the console by default.

    The file parameter here is like the file parameter in read.table().

    The source() function can also be used to include a script within another script.

We do not want to see the code or any of its output right here in the document. We will use the following R chunk options to source the file quietly.

echo=FALSE, results="hide", message=FALSE, warning=False, fig.show="hide"

Using source() with these options will allow the R Markdown session to have access to all data objects and functions defined and loaded by the included script without showing any of this information in the document.

  • Enter the following lines right after the metadata section of SalDoc (before the text starts.)

    ```{r, echo=FALSE, results="hide", message=FALSE, warning=FALSE, fig.show="hide"}  
    source("SalAnalysis.R")  
    ```
  • Knit the document again to make sure your script runs. The document should be generated again. There should be no change in the document. It should look the same as it did after the prior knit.

    If your script does not run, use the Clear Workspace function in the Sessions drop down menu. This will clear the session of the variables you have already loaded. This will not close your session or any of your files. Then source your script in the console to find your error.

Metadata options

Metadata options allow for some control over how your final document is created.

We will set several pandoc and output options. Pandoc is the program which creates the .pdf file from the .md file created during knit. The pandoc options control how the pdf file is created. The output options control how the md file is created. These two processes together make up the knit process for a pdf target document.

  • Pandoc options

    fontsize: sizept
    geometry: margin=marin
    documentclass: type

    Fontsize sets the size of the font to size.

    The geometry: margin options sets all page margins to mar.

    Documentclass: sets the type of pdf document to be created. We will use "article" for the type.

  • Output options

    fig_caption: yes

    Fig_caption enables the display of figure captions in the document. The default for fig-caption is no.

  • Replace the metadata section with the following.

    ---
    title: "Analysis of Professor Salaries"
    fontsize: 12pt
    geometry: margin=1in
    documentclass: article
    
    output: 
      pdf_document:
        fig_caption: yes
    ---
  • Knit the document again to check for errors. You should see the font size increase in the viewer. No other changes should be noticeable at this point in the development of the document.

Including text results

In the R Markdown article we have already seen how to include text results from the console in line. The text we need in our document may not be as readily available as the examples from the R Markdown lesson. Creating character variables to use as text is a common solution. The paste() function is a useful tool for putting together character variables. The paste() function is also useful for creating and modifying variables during data preparation.

The paste function concatenates a set of character elements into one character element/scalar.

  • Syntax and use of the paste() function

    paste(char1, char2, ..., charn, sep=sepChar, collapse=colChar)

    Returns a scalar or vector of type character.

    The char1, char2 through charn are scalars or vectors which can be coerced to character. If one of the char parameters is a vector, all the char parameters are coerced to vectors of the same length. Any scalar will be repeated for each element of the vector.

    The sepChar is a character scalar which is placed between each of the char elements. If char are vectors, then sepChar is placed between each of the elements in a row across the vectors. The default value for sep is " ", one space.

    The colChar is used when char are vectors. The character of colChar will be placed between the concatenated rows. This results in a character scalar being returned. The default is NULL, no collapsing of the rows. If the rows of the vector are not collapsed, the vector of characters is returned.

We are going to use paste() to concatenate a vector of names. Since we want the names to read as a list in the document we will collapse the names using ", " ( a comma followed by a space) as the separator. The last element in a list should have " and" before it instead of ", ". To achieve this, we will collapse all but the last element in the vector and add that element after the ", and". We will create character vectors for the set of names we want to use.

  • Enter the following text in the SalDoc file after the current paragraph. This means that there should be at least one blank line after the existing paragraph of text in the SalDoc file.

    ```{r, echo=FALSE, results="hide", message=FALSE, warning=FALSE, fig.show="hide"}  
    vars <- colnames(SalariesIn)  
    dscplNames <- levels(salary$dscpl)
    rankNames <- levels(salary$rank)
    ```
    
    The data set included `r length(salary$salary)` professors.
    The variables included in the data set are 
    `r paste(vars[-length(vars)],collapse=", ")`,
    and `r vars[length(vars)]`. 
    The sampled professors had ranks of
    `r paste(rankNames[-length(rankNames)],collapse=", ")`,
    and `r rankNames[length(rankNames)]`
    and were from disciplines 
    `r paste(dscplNames[-length(dscplNames)],collapse=", ")`
    and `r dscplNames[length(dscplNames)]`.
  • The code chunk created the three vectors of names we are interested in. This chunk produces no text in the SalDoc.

  • The paste(vars[-length(vars)],collapse=", ") code uses the first n-1 elements of the vector vars. It returns a character by collapsing the elements of the vector using collapse.

  • Knit the document.

  • The new paragraphs should look the the following

    The data set included 397 professors. The variables included in the data set are rank, discipline, yrs.since.phd, yrs.service, sex, and salary. The sampled professors had ranks of AsstProf, AssocProf, and Prof and were from disciplines A and B.

Table and graph numbering in R Markdown

R Markdown does not have commands to provide captions or numbering for tables and figures. Remember that one of the advantage of Markdown is that it is easy to use due to its small command set. Features such as numbering tables and figures would complicate the syntax of Markdown.

There are several methods for referencing table and figure numbers in the text. The first is to enter the numbers as text. Although this approach is easy, it is error prone as tables get added, deleted, or moved within a document. Another approach is to write and use R functions to automatically number the tables and figures and another function to reference them. Although this is the most like to be error free, it is complicated to implement. The approach we are going to take balances the above two approaches.

We recommend the use of a code chunk at the beginning of the file which assigns a table or figure number to a variable for each table and figure. The variable can then be used in the document text. If a figure or table is added, deleted, or moved there is only one place in the document that needs to be edited to change the numbering. The rest of the document is generated from these variables. While this approach still requires some manual maintenance, the effort is fairly small. This solution is suitable for modest size documents, such as for articles. More involved documents are best done using a formatting program such as LaTeX.

The tables and figures that will be used in our document have already been determined. We can enter the code chunk listing all the table and figure variables now. Typically the variables for the table and figure numbers would be added as the tables are created.

  • Add the following R chunk after the metadata section and before other code chunks or text.

    ```{r, echo=FALSE}
    #
    # table numbers
    #
    tabNumRankByDscpl <- 1
    
    #
    # figure numbers
    #
    FigNumYrsFacRank <- 1
    FigNumBoxDscplFacRank <- 2
    FigNumGenYrsFacDscpl <- 3
    FigNumBoxGenRankFacDscpl <- 4
    ```
  • Knit the document to check for errors.

  • There should be no change in the document from this code chunk being added. We'll use variables later.

Table formatting

The Data exploration article introduced the table() and aggregate() functions. The presentation tools that are introduced below can be applied to the results from table(), aggregate(), or any other object which is two dimensional.

Table totals

We will include in SalDoc a table of the count of professors by rank and discipline. We will add the totals for each rank and discipline to the table which is produced by table(). We will use the rowSums() and colSums() functions.

  • Syntax and use of rowSums() and colSums() functions.

    rowSums(X)
    colSums(X)

    Returns a vector of the sums of either rows or columns of X.

    X is typically two dimensional object. These functions can be used on objects with dimensions higher than 2.

    There are similar functions for the means of row and column.

  • Add the following commands to your SalAnalysis script.

    tempTab <- table(salary$rank,salary$dscpl)
    rankByDscpl <- as.data.frame.matrix(tempTab)
    rankByDscpl$Totals <- rowSums(tempTab)
    rankByDscpl["Totals",] <- colSums(rankByDscpl)
    rankByDscpl
  • The table() function creates the two dimensional object of type table. The as.data.frame.matrix() changes the type to a data.frame. The colSums() includes the column with the row totals, thus creating a cumulative total element.

  • Run these commands from the script.

  • The following will be displayed on the console

                A   B Totals
    AsstProf   24  43     67
    AssocProf  26  38     64
    Prof      131 135    266
    Totals    181 216    397
  • The table is now ready to be formatted for the document.

kable()

There are two functions we will look at for formatting tables. The first is the kable() function. This function will produce markdown formatting for tables. This formatting, like all of markdown, is basic, but kable() formatted tables will be appropriate for many documents.

  • Syntax and use of the kable() function

    kable(*object*,  
          format=*formatting*,  
          digits=*num*,  
          align=*colAlgn*,  
          caption=*cap*  
          )

    Returns plain text. The plain text includes the formatting commands that will be processed by R Markdown to produce the table in the formatted document.

    tab is an object with a 2 dimensional structure. The contents of the tab become the displayed table.

    The object parameter is the 2 dimensional structure from which the table with be made.

    formatting specifies what formatting command set to use. The values it can take are "latex", "html", and "markdown". This parameter is not needed when kable is used in a knit Rmd document. Knitr manages this parameter for you.

    Digits is an optional parameter. num is the maximum number of digits for numeric columns. It can also be a vector containing the column widths for all columns.

    Align is an optional parameter. colAlgn is a vector of type character. Each element is "r", "c", or "l" for right, center, or left alignment. The vector has length equal to the number of columns of data in the table.

    Caption is an optional parameter. cap is a character scalar. The tables are not automatically numbered in HTML documents.

As an example we will use kable() to format the rankByDscpl data.frame.

  • Add the following code at the bottom of the SalDoc.

    ```{r, echo=FALSE}
    capRankByDscpl <- "Count of Professors by discipline and Rank"
    kable(rankByDscpl, caption = capRankByDscpl)
    ```
  • Knit the SalDoc.

  • The following table will follow the last paragraph in the pdf document.

    Table 1: Count of Professors by discipline and Rank
    A B Totals
    AsstProf 24 43 67
    AssocProf 26 38 64
    Prof 131 135 266
    Totals 181 216 397
  • In your pdf document the table will have different formatting than in this html document and the caption will be below the table.

xtable()

The xtable() function, like kable(), returns text with the formatting for the table. Xtable() provides LaTeX or HTML formatting commands, but not Markdown table commands. This allows xtable() to have greater control of the formatting of the table.

Both kable() and xtable() return plain text with formatting commands. R Markdown and kable() are built to work together. R Markdown passes the kable results directly into the .md file. To get xtable() results passed directly into the .md file, the "asis" chunk option needs to be set to TRUE.

  • This would look like

    ```{r, results="asis"}

  • Other chunk options my be included as needed.

We will use the print.xtable() function with xtable() to create the table in our generated document. The print() function can be used in place of print.xtable(). We will use print.xtable() to make it clear which print function we are using and to make it easier to use help() on it. There are options for print.xtable() which are needed for formatting the xtable() tables.

  • The syntax and use of the xtable() function.

    xtable(*object*, 
           digits=*num*, 
           align=*colAlgn*, 
           caption=*cap*, 
           display=*dis*
           )

    Returns plain text. The plain text includes the formatting commands to produce the table in the formatted document.

    The object parameter is the 2 dimensional structure from which the table with be made.

    Digits is an optional parameter. Num is the maximum number of digits for numeric columns. It can also be a vector of the column widths for all columns.

    Align is an optional parameter. ColAlgn is a vector of type character. Each element is "r", "c", or "l" for right, center, or left alignment. The vector has length equal to the number of columns of data in the table. Vertical lines can be included in the table by including "|" in the list of alignment characters where the vertical line are to be placed.

    Caption is an optional parameter. Cap is a character scalar. The tables are not automatically numbered in HTML documents.

    Display is an optional parameter. The common values for dis are "s" for character, "d" for integers, "f" for floating point.

  • The syntax and use of the print.xtable() function.

    print.xtable(*xtable*,
                 type=*formatting*,
                 caption.placement=*loc*, 
                 hline.after=*lines*,
                 comment=*logical* 
                 )

    Returns text. The text is the source formatting code and table content.

    formatting is either "html" or "latex". There is no support for R Markdown tables. Remember Markdown consist of a small set of formatting commands. Controlling the formatting of tables typically requires more control than is simply done with markdown.

    Caption.placement is optional. The value of loc is either "top" or "bottom".

    Hline.after is optional. The vector lines contains the numbers of the rows which will have horizontal lines after them. Use -1 to produce a line before row 0.

    comment is optional. You will need this set to FALSE to prevent extra text from being displayed in your document.

Since one of our goals for using R markdown is to have source code which can easily switch target types, such as pdf (done using Latex) or html, we will make an R variable to use for the type of formatting.

  • Insert the following R command in the code chunk which has the table and figure number variables. This keeps all the R variables we need to manage for this document together at the top of the source file.

    targetType <- "latex"  

We will now use xtable() to format the rankByDscpl data.frame, replacing the kable() version.

  • Add the following code right after the last paragraph in the document. Add the text with no blank lines so that it is a continuation of the last paragraph.

    The numbers of professors from each of these groups is
    displayed in table `r tabNumRankByDscpl`.
    There are about 20 percent more members of discipline B
    included in the sample than of discipline A.
    The table shows there are a greater proportion of Discipline B 
    members with a rank of associate or assistant professor.
    
    ```{r, echo=FALSE, results="asis"}
    xtab <- xtable(rankByDscpl,
                       caption="Count of Professors by discipline and Rank",
                       display=c("s","d","d","d"),
                       align="|r|rr|r|"
                       )
    print(xtab, 
          type=targetType, 
          caption.placement="top", 
          hline.after=c(-1,0,3,4),
          comment=FALSE
          )
    ```
  • Remove the code chunk with the kable() function.

  • Knit SalDoc

  • The updated paragraph should look the the following

    The data set included 397 professors. The variables included in the data set are rank, discipline, yrs.since.phd, yrs.service, sex and salary. The sampled professors had ranks of AsstProf, AssocProf and Prof and were from disciplines A and B. The numbers of professors from each of these groups is displayed in table 1. There are about 20 percent more members of discipline B included in the sample than of discipline A. The table shows there are a greater proportion of Discipline B members with a rank of associate or assistant professor.

    Table 1: Count of Professors by discipline and Rank
    A B Totals
    AsstProf 24 43 67
    AssocProf 26 38 64
    Prof 131 135 266
    Totals 181 216 397
  • The formatting of the table and caption of the table will be a little different in the pdf document than in this HTML document.

Presentation of graphs

The graphs we produced using ggplot in the Data exploration article are ready to be published. We will apply additional formatting to fit them into the final document. This formatting will size them for the document and control the amount of empty space around them.

The margins around a figure are controlled with the plot.margin parameter of the theme plot element. We will shrink the amount of white space around the plots.

  • The following function is added to the ggplot to shrink the white space.

    theme(plot.margin=unit(c(.1,.5,0,.5),"lines"))

We will be adding four graphs to the SalDoc. The code to create these graphs is provided below.

  • Add the following commands to your SalAnalysis script.

    plotYrsFacRank <- ggplot(data=salary, aes(x=yrSer, y=salary)) +
      geom_point(aes(color=dscpl)) +
      facet_wrap(~rank) +
      theme_bw() +
      ggtitle("Salaries for each discipline") +
      theme( plot.title=element_text(vjust=1.0) ) +
      xlab("Years of service") +
      theme( axis.title.x = element_text(vjust=-.25) ) +
      ylab("Salary in $1000") +
      theme( axis.title.y = element_text(vjust=1.0) ) +
      theme(strip.background = element_rect(fill = "White")) +
      theme(plot.margin=unit(c(.1,.5,.15,.5),"lines")) +
      theme(legend.position = "right")
    
    boxDscplFacRank <- ggplot(data=salary, aes(x=dscpl, y=salary)) +
      geom_boxplot(aes(color=dscpl)) +
      facet_wrap(~rank) +
      theme_bw() +
      ggtitle("Salary summary by discipline and rank") +
      theme( plot.title=element_text(vjust=1.0) ) +
      xlab("Years of service") +
      theme( axis.title.x = element_text(vjust=-.25) ) +
      ylab("Salary in $1000") +
      theme( axis.title.y = element_text(vjust=1.0) ) +
      theme(strip.background = element_rect(fill = "White")) +
      theme(plot.margin=unit(c(.1,.5,.15,.5),"lines")) +
      theme(legend.position = "right") 
    
    plotYrsFacDscpl <- ggplot(data=salary, aes(x=yrSer, y=salary)) +
      geom_point(aes(color=sex)) +
      facet_wrap(~dscpl) +
      theme_bw() +
      ggtitle("Salaries for each gender") +
      theme( plot.title=element_text(vjust=1.0) ) +
      xlab("Years of service") +
      theme( axis.title.x = element_text(vjust=-.25) ) +
      ylab("Salary in $1000") +
      theme( axis.title.y = element_text(vjust=1.0) ) +
      theme(strip.background = element_rect(fill = "White")) +
      theme(plot.margin=unit(c(.1,.5,.1,.5),"lines")) +
      theme(legend.position = "right")
    
    boxGenRankFacDscpl <- ggplot(data=salary, 
                                 aes(x=rank, y=salary)) +
      geom_boxplot(aes(color=sex)) +
      facet_wrap(~dscpl) +
      theme_bw() +
      ggtitle("Salary summary by gender and discipline") +
      theme( plot.title=element_text(vjust=1.0) ) +
      xlab("Rank of professor") +
      theme( axis.title.x = element_text(vjust=-.25) ) +
      theme(axis.text.x = element_text(angle = 30, hjust = 1)) +
      ylab("Salary in $1000") +
      theme( axis.title.y = element_text(vjust=1.0) ) +
      theme(strip.background = element_rect(fill = "White")) +
      theme(plot.margin=unit(c(.1,.5,0,.5),"lines")) +
      theme(legend.position = "right")

The display size of a plot in the target document is controlled through the R chunk options. These chunk options were provided in the R Markdown article. We will use fig.cap, fig.height, and fig.width to provide figure captions and control the size of the figure.

Figure captions will not be displayed in the target document unless the fig_caption metadata option is set to "yes". This was already done for the SalDoc.

  • Add the following code after the table R chunk in SalDoc.

    The relationship between years of service and salary can be
    seen in figure `r FigNumYrsFacRank`.
    The panels seem to indicate that for associate and assistant 
    professors most of the members of discipline B have higher 
    salaries than most of the members of discipline A, at the
    same rank.
    The box plots, figure `r FigNumBoxDscplFacRank`, also supports 
    this observation.
    
    ```{r, echo=FALSE, fig.cap="Salaries  by discipline and rank", fig.height=2.75, fig.width=5.5 } 
    plotYrsFacRank
    ```
    
    ```{r, echo=FALSE, fig.cap="Salaries summarized by discipline and rank", fig.height=2.75, fig.width=5.5 } 
    boxDscplFacRank
    ```        
    
    The plots of salaries with color indicators for gender, figure
    `r FigNumGenYrsFacDscpl`, indicates there may be a gender
    difference in salary in discipline A.
    The box plots in figure `r FigNumBoxGenRankFacDscpl`
    indicate that the difference between the genders in
    discipline A is greatest at the rank of assistant professor.
    
    The plots of salaries with color indicators for gender, figure
    `r FigNumGenYrsFacDscpl`, indicates there may be a gender
    difference in salary in discipline A.
    The box plots in figure `r FigNumBoxGenRankFacDscpl` indicate
    that the difference between the genders in discipline A is 
    greatest at the rank of assistant professor.
    
    ```{r, echo=FALSE, fig.cap="Salaries by gender and discipline", fig.height=2.75, fig.width=5.5 } 
    plotYrsFacDscpl
    ```
    
    ```{r, echo=FALSE, fig.cap="Salaries summarized by gender and discipline", fig.height=3, fig.width=5.5 } 
    boxGenRankFacDscpl
    ```
  • Knit SalDoc

The complete SalDoc.Rmd file can be seen at SalDoc.Rmd.

Commit your changes to SalAnalysis and SalDoc files.

HTLM documents

Most of the techniques from this article work the same way for HTML documents. The most significant exception is that captions for tables and figures are not supported through R Markdown. There are also a few differences in tables generated by xtable() due to the differences between HTML and LaTex.

Table and figure captions can be created in HTML documents by writing the caption as its own paragraph and starting the paragraph with "Table" or "Figure". The table or figure number can be added using inline R code just as was done in the references done above.

  • Example of a table caption in HTLM.

  • The following R markdown code

    **Table `r tabNumRankByDscpl`:** My caption text.
  • Results in the following displayed in the document

    Table 1: My caption text.

This will not center the caption. Although this solution is less than ideal, the resulting document is quite usable and it correctly identifies the tables and figures.

There is a new package, called printr, which will support captions for tables. This package is not in CRAN at the time of the writing of this document. R Markdown is still evolving. There will likely be other solutions which will make R Markdown useful for a wider set of documents in the future.

Next: Regression (ordinary least squares)

Previous: Data exploration

Last Revised: 8/26/2015