R for Data Science

Logistics

Person Role Contact
Alejandro Schuler Instructor alejandro.schuler@berkeley.edu

Overview

This course will make you an expert at data I/O, transformation, programming, and visualization in R. We will use a consistent set of packages for these tasks called the tidyverse.

This is not a traditional programming or computer science course. It is meant to be an applied tour of how to actually use R for your data science needs. We also will not cover statistical analysis of data in this course, but the curriculum is a useful prerequisite for subsequent courses on statistics or machine learning.

This course is not graded, nor are there any assignments or homework. The lectures are just to get you started- they will be frequently interrupted by active learning exercises that you will be asked to complete in pairs or small groups. That’s where the real learning will happen!

Prerequisites

No prior experience with R is expected. Those with experience using R will still likely find much of value in this course since it covers a more modern style of R programming that has gained traction in the past decade.

We will use R through the RStudio interface. The easiest way to access RStudio is through the cloud: posit.cloud. It’s fast and easy- just go the link, click “get started” and create an account. Once you’re in, click “new project” near the upper-right and the RStudio interface will open.

Alternatively, you can install R and RStudio on your own computer: Follow this link and click on the appropriate options for your operating system to install R, then do the same to install RStudio.

Learning Goals

By the end of the course, you will be able to:

  • comfortably use R through the Rstudio interface
  • read and write tabular data between R and flat files
  • subset, transform, summarize, join, and plot data
  • write reusable and readable programs
  • seek out, learn, and integrate new packages and code into your analyses

Textbook

I recommend the fantastic book R for Data Science (R4DS:2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund (O’Reilly Media, 2017); it is online and also available in hardcopy. In the schedule below I have mapped the book chapters to the modules in the course if you want to do your own reading.

Schedule and Slides

Module Topic Learning Goals Packages Reading
1 Intro and Plotting
  • issue commands to R using the Rstudio REPL interface
  • load a package into R
  • read some tabluar data into R
  • visualize tabluar data using ggplot geoms, aesthetics, and facets
  • ggplot2
R4DS ch. 1, 9-11
2 R Programming
  • save values to variables
  • find and call R functions with multiple arguments by position and name
  • recognize and index vectors and lists
  • recognize, import, and inspect data frames
  • issue commands to R using the Rstudio script pane
  • tibble
  • readr
R4DS ch. 2, 4, 6, 8, 20
3 Tabluar Data
  • filter rows of a dataset based on conditions
  • arrange rows of a dataset based on one or more columns
  • select columns of a dataset
  • mutate existing columns to create new columns
  • use the pipe to combine multiple operations
  • dplyr
R4DS ch. 3, 12-16, 18
4 Advanced Tabular Data
  • group and summarize data by one or more columns
  • transform between long and wide data formats
  • combine multiple data frames using joins on one or more columns
  • dplyr
  • tidyr
R4DS ch. 3, 5, 19
5 Functional Programming
  • write your own functions
  • iterate functions over lists of arguments
  • purrr
R4DS ch. 25, 26