SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

0.1 Organization of the book and chapters

The chapters of this book are a collection of related discourses. Each discourse covers a data wrangling task, or in a few cases a prerequisite programming skill. A discourse begins with instruction on new data and programming skills that are to be used by the wrangling task, followed by worked examples demonstrating the wrangling task. The discourses end with exercises for users to check their understanding and to reinforce the skills of the wrangling task.

We have organized the tasks of data wrangling into five activities; acquiring, cleaning, transforming, exploring, and relating. The instruction portion of a discourses is organized by these five activities as well as data concepts and programming skills. The data concepts and programming skills subsections review prerequisite skills for wrangling data. If you already have programming and data science skills, you may be able to skim or skip these sections. The remaining five topics are our five Data Wranglers activities. This organization is designed to provide the reader an understanding of how the current material fits into the skill set of a Data Wrangler. Not all of the seven topics are included in each discourse.

The chapters and their discourses are designed to be useful both collectively as a complete introduction to wrangling and individually as a reference. If you are new to programming or data science, we recommend that you to work through the chapters in order. If you already have some wrangling skills and need to expand your skills in one the wrangling activity areas, an individual chapter or discourse can be worked though on its own.

The discourses of this book are designed to teach wrangling skills using either the tidyverse for R or pandas for Python. These packages are extension of the base language. There will be some introductory level instruction on using the base language tools. This will be enough to use the wrangling packages and solve real world wrangling problems. The instructions of this book should not be thought of as learning either R or Python, since much of these languages will not be taught. While you will not be an R or Python programmer at the end of this book, you will be a tidyverse or pandas programmer.

This skills first and language second approach of this book allows for a focus on the essential skill of wrangling data with out being tied to an implementation associated with a language. There are enough similarities in R and Python and the packages used in this book, that once you have learned to wrangling in one moving to the other would not be too difficult. We recommend that you select one of these languages\packages and follow the examples for that language\packages your first time through the articles.

The RStudio IDE (integrated development environment) is used in this course for the examples in both R and Python. RStudio has integrated support for R and Python and allows Python and R code to be used together in one wrangling task. If you are not familiar with the RStudio IDE, you can get a brief overview of what you need for these discourses by reviewing The RStudio IDE and RStudio Projects section of the R - A Brief Introduction knowledge base article.