SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

1.2 Functions, Packges, and Getting help

1.2.1 Programming skills

1.2.1.1 Packages and Libraries

R and Python have a base set of functionality when started. The functionality of both languages can be expanded by the use of packages. Each of these languages has a repository that holds packages (CRAN for R and PyPI for Python.) Packages in these repositories meet a minimum standard level. There is variation between packages in the amount of user documentation available.

This book uses the tidyverse package for R and the pandas package for Python. (The tidyverse is a set of package gathered together as one.) While there are tools to accomplish wrangling tasks in the base functionality of these languages, these packages provide a set of consistent and integrated tools that ease wrangling tasks. This book uses the functionality of these packages when it exists and uses base functionality only when it is needed.

To use the functionality of a package, the package first needs to be installed from the repository to a local library on your computer. This only needs to be done once. If you are using an SSCC computer, these packages and a number of other packages have been installed for you.

R and Python when started will only have the base functionality ready for use. You will need to load any additional packages you plan to use after each restart of the program.

1.2.1.2 Functions and their parameters

Functions are a set of commands that have been given a name. The commands of a function are designed to accomplish a specific task. A function may need data to accomplish its task. Data objects that are passed to a function are called parameters. Functions can return a data object when they have completed their task. In summary, data is given to a function. The function does a specific task using this data, and returns data.

For example consider a hypothetical function that generates a sequence of numbers from a starting value to an ending value by a step amount. (Note, this example is not a function in either R or Python. Both languages can do this, but with different functionality.) A call to such a function might look like the following:

    generate_sequence(start = 1, stop = 11, step_by = 3)

The name of this imaginary function is "generate_sequence". It takes three data objects as parameters, "start", "stop", and "step_by". These parameters are all numeric values. Parameters can be of any type supported by the programming language. This function call would return the numbers 1, 4, 7, and 10.

The parameters in this example are identified by their name. That is, the function knew to make the first number 1 because we set "start" to 1. Named parameters can be given to a function in any order. The generate_sequence example could have been written as:

    generate_sequence(stop = 11, step_by = 3, start = 1)

The parameters that a function will accept are named when the function is defined.

Parameters to a called function can also be identified by the order they are defined in the function. In our imaginary function "generate_sequence", If the parameters were defined in the order "start", "stop", and "step_by" our example could have also been written

    generate_sequence(1, 11, 3)

Each parameter would be mapped to the correct name by its position in the parameter list. Named parameters and position parameters can be mixed in a call to a function. When mixed, then all position-identified parameter must be given before any named parameters. The imaginary "generate_sequence" could have also been called using,

    generate_sequence(1, 11, step_by = 3)

But the following would generate an error

    generate_sequence(start = 1, 11, 3)

It is often clear from the name of the function what the first one, or sometimes two, parameters are. In these cases the parameters are identified by position. Parameters after the first few are best identified by name.

Functions are also objects. This allows for some advanced programming that can be done with functions. This is beyond the scope of this book.

1.2.1.3 Methods

Methods are similar to functions in that they preform a specific set of operation on data and can return data. They differ from functions mostly in how they are called.

Methods are useful because they allow a common name to reference the same kind of result for different different classes (kinds) of objects. For example, the head() method can be used to display the beginning portions of a data set. The head() method could also be used to display the beginning of a list. The name head is used for both since they are expected to do similar operations on an object. There are two different head methods inside the language and which one is used is determined by wheather it is acting on a data set or a list (or another class of object).

In Python the syntax for a method is as follows,

   <name>.<method>(<parameters>)

This applies the method given by <method> that is for the class of <name> to the object given by <name>, with the additional parameters of <parameters>. This can be thought of as a function of the form,

   <method for a particular class>(<name>, <parameters>)

That is the <name> object can be thought of as a parameter to the <method> function.

In R the syntax for a method is as follows,

   <method>(<name>, <parameters>)

This applies the method given by <method> that is for the class of <name> to the object given by <name>, with the additional parameters of <parameters>. This can be thought of as a function of the form,

   <method>.<class>(<name>, <parameters>)

That is the function that is run has a name of <method>.<class> and object <name> is a parameter to the <method>.<class> function.

There is a little more going on under the hood of Python and R than this. But this is a useful conceptualization of a methods in R and Python method.

1.2.1.4 Disambiguation of function and method names

Adding packages to R or Python extends the set of names of functions and methods supported.

R has only one name space (list of what R knows about) and the package functions and methods are added to this name space. This allows the names of functions and methods to be used without regards for what package they came from. This approach can result in a function being replaced in the name space when a package gives a function the same name as one that already exists. To work through these kinds of name conflicts, R provides the :: operator to specify which package a function is located within in. For example, if you wanted to reference a function in dplyr (one of the packages that make up the tidyverse,) you could use dplyr::<name>. The dplr name is the name of the package.

Python manages the namespace for each package separately. It is always completely clear which function you are referencing. The cost for this clarity is that a reference to the package name must be included with the name of the function, when the function is in a package. This is done with the . operator. For example, pandas is usually imported with the name pd, so if you want to use a function in pandas you use pd.<name>()).

1.2.1.5 Getting help

Help is available in a variety of places. What follows is some approaches to looking for help.

You may need help in knowing how to accomplish something. That is, you do not know what functions/methods to use or maybe the steps needed. A good place to start here is the cheat sheets for the tidyverse or pandas. These can be scanned quickly to see what is provided by the package. What you need to do may already be directly implemented. You may also notice functions/methods that do part of what is needed and leave you something smaller that you do not know how to do. If you do not find what you need on the cheat sheets, googling is your best option. We suggest starting the google search with the package name (pandas or tidyverse) and follow this with what you are trying to do. This may not get you the help you need if there is a technical name for something that you do not know. In this case you may have to read through several of the initial responses to see how others talk about doing what you are doing. This is a good way to learn more of the lingo of programming and wrangling. Additional google searches can be done based on the use of new key words you see in the initial responses. You can also ask a question on help sites such as stackexchange or stackoverflow. You will likely have seen these kind of sites in response to your initial google searches.

When you know how you want to accomplish something but do not remember the function/method to use, the cheat sheets would be a good place to start. If you do not find what you need on the cheat sheet, reviewing the table of contents of this book may help you find what you need. You can also google the package name followed by what you want to do.

When you know the name of the function/method you want to use and need help with what the parameters are or details of how the function/method works, the function documentation is a good place to start. This documentation can be found by googling the package name followed by the name of the function or method you want help with.

If you are an SSCC member, you can also send a question to the help desk or come in and see one of the consultants for help if you are not able to resolve the issue for yourself. Information on getting help from the SSCC can be found at the SSCC website.

1.2.2 Examples - R

1.2.2.1 Packages examples

  1. In RStudio, packages are installed using the install button on RStudio's package tab. The base R function to install a package in your library is install.packages(). The key parameter for this function is the name of the package to be installed. This is a function you should not need to use.

    The R function to check a package out from the library for use is library(). The key package we will be using in this article series is the tidyverse. The following code checks out (loads) the tidyverse package.

    library(tidyverse)

    Note there are a number of packages that are attached when the tidyverse is loaded. This is not uncommon when loading packages. Also, loading the tidyverse has caused a few conflicts with functions in other packages that were previously loaded. The functions of the tidyverse's are now available for use.

    You will likely see code using require() instead of library(). The require() function is designed to be used as a conditional event. The result is that the script continues running even if the package did not load. (Yes, the name of the function is really misleading.) This is an undesirable behavior and can cause issues when sharing your work. The library() function is the proper function to use to load a package and not require().

1.2.2.2 Function examples

  1. One function that is handy for displaying data is the round() function. It can be used to reduce the number of digits that gets displayed. This is particularly helpful when the data to be displayed is real numbers with many significant digits.

    x <- 5/3
    x
    
    round(x, digits = 3)
    [1] 1.666667
    [1] 1.667

1.2.2.3 Getting help example

  1. We will try to find out how to import a csv data file. We start by entering tidyverse import csv into a google search. The following was the first few google results on 12/3/2018.

    Read a delimited file (including csv & tsv) into a ... - readr - Tidyverse
    https://readr.tidyverse.org/reference/read_delim.html
    read_csv() and read_tsv() are special cases of the general read_delim() . They're useful for reading the most common types of flat file data, comma separated ...
    GitHub - tidyverse/readr: Read flat files (csv, tsv, fwf) into R
    
    https://github.com/tidyverse/readr
    The goal of readr is to provide a fast and friendly way to read rectangular data (like csv, tsv, and fwf). ... To accurately read a rectangular dataset with readr you combine two pieces: a function that parses the overall file, and a column specification. ... read_csv(): comma separated ...
    Issues 63 9 releases NEWS.md
    11 Data import - R for Data Science
    
    https://r4ds.had.co.nz/data-import.html
    If you've used R before, you might wonder why we're not using read.csv() . There are ... It doesn't fit quite so well into the tidyverse, but it can be quite a bit faster.

    These results show two possible functions, read_csv() and read.csv(). We can also see that the url address for the first result looks to be from documenation from the tidyverse. This would be a logical place to do the next search.

    Enter tidyverse read_csv function into the google search. The following was the first google results on 12/3/2018.

    read_csv() - readr - Tidyverse
    https://readr.tidyverse.org/reference/read_delim.html
    read_csv() and read_tsv() are special cases of the general read_delim() . They're useful for reading the most common types of flat file data, comma separated ...

    This is the documentation page for the read_csv() function. This is the function we need. Note, we also learned the the tidyverse package to import data is readr. This can be useful to include in future tidyverse google searches about importing data.

    Sometimes you may need to look at a few of the pages to determine what key words to use for your next search. It can sometimes take a few of these kind of searches before you learn what key words describe what you are looking for. This is an important skill and one that will get easier with practice.

1.2.3 Examples - Python

1.2.3.1 Packages examples

The Python packages are not installed with Python commands. There are several ways these packages can be installed. Using pip is one common tool. These tools are outside the scope of this book. If you are using an SSCC computer or have loaded Python through Anaconda, the packages used in this book will already be installed and tells Python that pd will be used to refer to pandas.

  1. The Python command to check a package out from the library for use is import. The key package we will be using in this article series is pandas. The following code checks out (loads) the pandas package.

    import pandas as pd
    import os

    The functions of the pandas packages are used by prefixing the function name with pd..

1.2.3.2 Function examples

  1. One function that is handy for displaying data is the round() function. It can be used to reduce the number of digits that gets displayed. This is particularly helpful when the data to be displayed is real numbers with many significant digit.

    x = 5/3
    print(x)
    1.6666666666666667
    print(round(x, ndigits=3))
    1.667

1.2.3.3 Getting help example

  1. We will try to find out how to import a csv data file. We start by entering pandas import csv into a google search.

    The following was the first google results on 12/3/2018.

    pandas.read_csv - pandas 0.23.4 documentation
    https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
    pandas. read_csv (filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None,
    index_col=None, usecols=None, squeeze=False, prefix=None, ...

    This is the documentation page for the read_csv() function. We found what we needed on the first search. You will not always find what you are looking for on the first search. You may need to look at a few of the pages to determine what key words to use for your next search. It can sometimes take a few of these kinds of searches before you learn what key words describe what you are looking for. This is an important skill and one that will get easier with practice.

1.2.4 Exercises

  1. Find out how to create an object with a class of list. There is a function in both R and Python to make a list object.

    A list is an object that holds an ordered set of values. List objects are included in the base functionality of both R and Python.

    List are useful stuctures in both R and Python. List objects and their uses will be explained further in later chapters.

  2. Use the function you found in problem 1 to create a list object with only one item, an a, as the sequence of values. Name this list object my_list.

  3. Display the contents of my_list