R Programming

Alejandro Schuler

Learning Goals:

save values to variables
find and call R functions with multiple arguments by position and name
recognize and index vectors and lists
recognize, import, and inspect data frames
issue commands to R using the Rstudio script pane

Programming Basics

We’ve seen code like

genes = read_csv("https://tinyurl.com/cjkuecnc")

We know this reads a .csv from a file and creates something called a “data frame”
We’ve been using this data frame in code like

ggplot(genes) + 
  geom_bar(aes(x = ancestry, fill = phenotype))

But what does this syntax really mean? Is it useful outside of making plots?

Assignment

To do complex computations, we need to be able to give names to things.

genes = read_csv("https://tinyurl.com/cjkuecnc")

This code assigns the result of running read_csv("https://tinyurl.com/cjkuecnc") to the name gene
You can do this with any values and/or functions

x = 1

R prints no result from this assignment, but what you entered causes a side effect: R has stored the association between x and the result of this expression (look at the Environment pane.)

Variables

[1] 1

x / 5

[1] 0.2

When R sees the name of a variable, it uses the stored value of that variable in the calculation.
We can break complex calculations into named parts. This is a simple, but very useful kind of abstraction.

Two ways to assign

In R, there are (unfortunately) two assignment operators. They have subtly different meanings (more details later).

<- requires that you type two characters but better captures spirit of assignmnet
= is easier to type but incorrectly suggests mathematical equality
You will see both used throughout R and user code.

x <- 10
x

[1] 10

x = 20
x

[1] 20

Assignment has no undo

x = 10
x

[1] 10

x = x + 1
x

[1] 11

If you assign to a name with an existing value, that value is overwritten.
There is no way to undo an assignment, so be careful in reusing variable names.

Naming variables

It is important to pick meaningful variable names.
Names can be too short, so don’t use x and y everywhere.
Names can be too long (Main.database.first.object.header.length).
Avoid silly names.
Pick names that will make sense to someone else (including the person you will be in six months).
ADVANCED: See ?make.names for the complete rules on what can be a name.

There are different conventions for constructing compound names. Warning: disputes over the right way to do this can get heated.

stringlength
string.length
StringLength
stringLength
string_length (underbar)
string-length (hyphen)

To be consistent with the packages we will use, I recommend snake_case where you separate lowercase words with _
Note that R itself uses several of these conventions.
One of these won’t work. Which one and why?

a = 1
A # this causes an error because A does not have a value

Error: object 'A' not found

R cares about upper and lower case in names.
names can’t start with numbers

for = 7 # this causes an error

for is a reserved word in R. (It is used in loop control.)
ADVANCED: see ?Reserved for the complete rules.

Exercise: birth year

Make a variable that represents the age you will be at the end of this year
Make a variable that represents the current year
Use them to compute the year of your birth and save that as a variable
Print the value of that variable

Assignment and Reference

x = 2
y = x
y

[1] 2

x = 1
y

[1] 2

What do you observe?

x = 2
y = x
y

[1] 2

x = 1
y

[1] 2

Functions

Calling functions

To call a function, type the function name, then the argument or arguments in parentheses. (Use a comma to separate the arguments, if more than one.)

sqrt(2)

[1] 1.414214

Arguments

Functions transform inputs to outputs
internally, however, they have an environment just like the one you see in your workspace
when you call a function, you tell it how to connect the variables in your environment to the ones it expects to have so that it can do its job
the names the function calls these inputs inside itself will be different than what you call them on the outside

aes(x=EIF3L, y=VAPA)

Named and positional arguments

Arguments can be supplied by name using the syntax variable = value.
you can see the names of the arguments in the help page for each function
When using names, the order of the named arguments does not matter.

ggplot(data=genes) + 
  geom_point(mapping=aes(y=EIF3L, x=VAPA))

If you leave the names off, R defaults to a positional order that is specific to each function (e.g. for aes(), x comes first, then y)
you can see the default order of the arguments in the help page for each function

ggplot(genes) + 
  geom_point(aes(VAPA, EIF3L))

Optional arguments

Many R functions have arguments that you don’t always have to specify. For example:

file_name = "https://tinyurl.com/cjkuecnc"
genes_10 = read_csv(file_name, n_max=10) # only read in 10 rows
genes = read_csv(file_name)

n_max tells read_csv() to only read the first 10 rows of the dataset.
If you don’t specify it, it defaults to infinity (i.e. R reads until there are no more lines in the file).

Exercise [together]

Why does this code generate errors?

ggplot(the_data=genes) + 
  geom_point(mapping=aes(y_axis=EIF3L, x_axis=VAPA))

Error in `geom_point()`:
! Problem while computing aesthetics.
ℹ Error occurred in the 1st layer.
Caused by error:
! object 'EIF3L' not found

Exercise [together]

I’m trying to generate this plot:

But when I use this code, I get:

ggplot(data=genes) + 
  geom_point(aes(VAPA, EIF3L))

What am I doing wrong?

Functions, assignment, and reference

x = 2
x^2

[1] 4

[1] 2

What do you observe?

x = 2
x^2

[1] 4

[1] 2

functions generally do not affect the variables you pass to them (x remains the same after sqrt(x))

Vectors

Repetitive calculations

x1 = 1
x2 = 2
x3 = 3

Let’s say I have these variables and I want to add 1 to all of them and save the result.

y1 = 1 + x1
y2 = 1 + x2
y3 = 1 + x3

This does the trick but it’s a lot of copy-paste

Vectors

Vectors solve the problem

x = c(1,2,3)
y = x + 1
y

[1] 2 3 4

A vector is a one-dimensional sequence of zero or more values
Vectors are created by wrapping the values separated by commas with the c( ) function, which is short for “combine”
Many R functions and operators (like +) automatically work with multi-element vector arguments.

Elementwise operations

This multiplies each element of c(1,2,3) by the corresponding element of c(4,5,6)

c(1,2,3) * c(4,5,6)

[1]  4 10 18

Many basic R functions operate on multi-element vectors as easily as on vectors containing a single number.

sqrt(c(1,2,3))

[1] 1.000000 1.414214 1.732051

c(1,2,3)^3

[1]  1  8 27

log(c(1,2,3))

[1] 0.0000000 0.6931472 1.0986123

Summaries

some R functions take vectors as arguments and summarize them instead of applying elementwise

numbers <- c(9, 12, 6, 10, 10, 16, 8, 4)
numbers

[1]  9 12  6 10 10 16  8  4

sum(numbers)

[1] 75

sum(numbers)/length(numbers)

[1] 9.375

mean(numbers)

[1] 9.375

Exercise: subtract the mean

A particular class has two quizzes which are taken by the same three students. Thier scores are below:

quiz_1_scores = c(70, 90, 55)
quiz_2_scores = c(76, 88, 70)

Write code to see if the quiz 1 average is higher than or lower than the quiz 2 average
Are students improving? Subtract the quiz 1 scores from the quiz 2 scores and take the average of the resulting differences to find out.

Exercise: a vector of variables [together]

Predict the output of the following code:

a = 1
b = 2
x = c(a,b)

a = 3
print(x)

Ranges

1:50

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

The colon : is a handy shortcut to create a vector that is a sequence of integers from the first number to the second number (inclusive).
Ranges can go the other way too and include negative numbers, e.g. 5:-5
Long vectors wrap around. (Your screen may have a different width than what is shown here.)

Indexing

x = c("a", "b", "c", "d")
x[1] # same as x[c(1)] since 1 is already a vector (of length 1)

[1] "a"

x[2:4]

[1] "b" "c" "d"

x[c(3, 1)]

[1] "c" "a"

x[c(1,1,1,1,1,1,4)]

[1] "a" "a" "a" "a" "a" "a" "d"

Indexing returns a subsequence of the vector. It does not change the original vector. Assign the result to a new variable to save it if you neeed it later.
R starts counting vector indices from 1.
You can index using a multi-element index vector.
You can repeat index positions

Exercise: reading vector code

What does this code do?

x = c("a", "b", "c", "d", "e") # some vector
x[length(x):1]

read inside out: first figure out what length(x) does, then think about what the output of length(x):1 should do, and then finally x[length(x):1]

Indexed Assignment

you can assign into an indexed position

[1] "a" "b" "c" "d"

x[1] = 'Z'
x

[1] "Z" "b" "c" "d"

or multiple

[1] "Z" "b" "c" "d"

x[c(1,2)] = c("Z", "X")
x

[1] "Z" "X" "c" "d"

x[c(1,2)] = "Q"
x

[1] "Q" "Q" "c" "d"

Data Types

Strings

text data in R is called a “string”

my_string = "hello"

when using data that is text in R, you have to refer to it using quotation marks (why?)

my_string = hello # what does this code do?

you can have a vector of strings, and functions can operate on these too:

words = c("hello", "how", "are", "you", "?")
paste(words, collapse=" ")

[1] "hello how are you ?"

Factors

library(forcats)

factors represent categorical data

seasons_str = c("spring", "summer", "fall", "winter") # string vector
seasons_str

[1] "spring" "summer" "fall"   "winter"

seasons_fct = fct(seasons_str) # factor vector
seasons_fct

[1] spring summer fall   winter
Levels: spring summer fall winter

this is useful to tightly control data and prevent accidents

seasons_str[1] = "Jan"

seasons_fct[1] = "Jan"

Logicals

c(-2, -1, 0, 1, 2) > 0

[1] FALSE FALSE FALSE  TRUE  TRUE

c(TRUE, TRUE, FALSE)

[1]  TRUE  TRUE FALSE

logical vectors can only be TRUE or FALSE
we’ll see more about this later

Coercion

If you try to do something to a vector of the wrong data type, R will often do its best to “make it work” by converting to another type

TRUE + 2

[1] 3

numbers = c(1,2,3)
numbers[1] = '5'
numbers + 2

Error in numbers + 2: non-numeric argument to binary operator

this is a frequent source of unexpected errors!

Exercise: data types [together]

What types are each of the following vectors? Are they all fundamentally the same, or are they different?

v1 = c(0,1)
v2 = c(FALSE, TRUE)
v3 = c("FALSE", "TRUE")
v4 = fct(v3)

Which of these lines of code will run and which will produce an error?

v1 + 1
v2 + 1
v3 + 1
v4 + 1

NA

R has a special value that represents missing data- it’s called NA

c(1,2,NA,4)

[1]  1  2 NA  4

NA can appear anywhere that R would expect some other kind of data
NA usually ruins computations:

1 + NA + 3

[1] NA

The result makes sense because if I don’t know what I’m adding together, I don’t know the result either
some functions have options to ignore the missing values in vectors:

mean(c(1,2,NA,4), na.rm=TRUE)

[1] 2.333333

Lists

A list is like an atomic vector, except the elements don’t have to be the same type of thing

a_vector = c(1,2,4,5)
maybe_a_vector = c(1,2,"hello",5,TRUE)
maybe_a_vector # R converted all of these things to strings!

[1] "1"     "2"     "hello" "5"     "TRUE"

You make them with list() and you can index them like vectors

a_list = list(1,2,"hello",5,TRUE)
a_list[3:5]

[[1]]
[1] "hello"

[[2]]
[1] 5

[[3]]
[1] TRUE

Anything can go in lists, including vectors, other lists, data frames, etc.
In fact, a data frame (or tibble) is actually just a list of named column vectors with an enforced constraint that all of the vectors have to be of the same length. That’s why the df$col syntax works for data frames.

Getting elements from a list

You can also name the elements in a list

a_list = list(
    first_number = 1,
    second_number = 2,
    a_string = "hello",
    third_number = 5,
    some_logical = TRUE)

and then retrieve elements by name or position

# returns the element named "thrid_number"
a_list$a_string

[1] "hello"

a_list[['a_string']]

[1] "hello"

# returns the 3rd element
a_list[[3]]

[1] "hello"

# subsets the list, so returns a list of length 1 that contains a single element (the third)
a_list[3]

$a_string
[1] "hello"

Examining lists

Use str() to dig into nested lists and other complicated objects

nested_list = lm(hp ~ ., mtcars)
str(nested_list)

List of 12
 $ coefficients : Named num [1:11] 79.048 -2.063 8.204 0.439 -4.619 ...
  ..- attr(*, "names")= chr [1:11] "(Intercept)" "mpg" "cyl" "disp" ...
 $ residuals    : Named num [1:32] -38.68 -30.63 13.01 -15.75 -8.22 ...
  ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
 $ effects      : Named num [1:32] -829.8 296.3 124.8 -19.6 90.3 ...
  ..- attr(*, "names")= chr [1:32] "(Intercept)" "mpg" "cyl" "disp" ...
 $ rank         : int 11
 $ fitted.values: Named num [1:32] 149 141 80 126 183 ...
  ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
 $ assign       : int [1:11] 0 1 2 3 4 5 6 7 8 9 ...
 $ qr           :List of 5
  ..$ qr   : num [1:32, 1:11] -5.657 0.177 0.177 0.177 0.177 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
  .. .. ..$ : chr [1:11] "(Intercept)" "mpg" "cyl" "disp" ...
  .. ..- attr(*, "assign")= int [1:11] 0 1 2 3 4 5 6 7 8 9 ...
  ..$ qraux: num [1:11] 1.18 1.02 1.29 1.19 1.05 ...
  ..$ pivot: int [1:11] 1 2 3 4 5 6 7 8 9 10 ...
  ..$ tol  : num 1e-07
  ..$ rank : int 11
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 21
 $ xlevels      : Named list()
 $ call         : language lm(formula = hp ~ ., data = mtcars)
 $ terms        :Classes 'terms', 'formula'  language hp ~ mpg + cyl + disp + drat + wt + qsec + vs + am + gear + carb
  .. ..- attr(*, "variables")= language list(hp, mpg, cyl, disp, drat, wt, qsec, vs, am, gear, carb)
  .. ..- attr(*, "factors")= int [1:11, 1:10] 0 1 0 0 0 0 0 0 0 0 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:11] "hp" "mpg" "cyl" "disp" ...
  .. .. .. ..$ : chr [1:10] "mpg" "cyl" "disp" "drat" ...
  .. ..- attr(*, "term.labels")= chr [1:10] "mpg" "cyl" "disp" "drat" ...
  .. ..- attr(*, "order")= int [1:10] 1 1 1 1 1 1 1 1 1 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(hp, mpg, cyl, disp, drat, wt, qsec, vs, am, gear, carb)
  .. ..- attr(*, "dataClasses")= Named chr [1:11] "numeric" "numeric" "numeric" "numeric" ...
  .. .. ..- attr(*, "names")= chr [1:11] "hp" "mpg" "cyl" "disp" ...
 $ model        :'data.frame':  32 obs. of  11 variables:
  ..$ hp  : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
  ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
  ..$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
  ..$ disp: num [1:32] 160 160 108 258 360 ...
  ..$ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
  ..$ wt  : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
  ..$ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
  ..$ vs  : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
  ..$ am  : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
  ..$ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
  ..$ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language hp ~ mpg + cyl + disp + drat + wt + qsec + vs + am + gear + carb
  .. .. ..- attr(*, "variables")= language list(hp, mpg, cyl, disp, drat, wt, qsec, vs, am, gear, carb)
  .. .. ..- attr(*, "factors")= int [1:11, 1:10] 0 1 0 0 0 0 0 0 0 0 ...
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:11] "hp" "mpg" "cyl" "disp" ...
  .. .. .. .. ..$ : chr [1:10] "mpg" "cyl" "disp" "drat" ...
  .. .. ..- attr(*, "term.labels")= chr [1:10] "mpg" "cyl" "disp" "drat" ...
  .. .. ..- attr(*, "order")= int [1:10] 1 1 1 1 1 1 1 1 1 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(hp, mpg, cyl, disp, drat, wt, qsec, vs, am, gear, carb)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:11] "numeric" "numeric" "numeric" "numeric" ...
  .. .. .. ..- attr(*, "names")= chr [1:11] "hp" "mpg" "cyl" "disp" ...
 - attr(*, "class")= chr "lm"

Data Frames

Making data frames

use tibble() to make your own data frames from scratch in R

my_data = tibble(
  person = c("carlos", "nathalie", "christina", "alejandro"),
  age = c(33, 48, 8, 29)
)
my_data

# A tibble: 4 × 2
  person      age
  <chr>     <dbl>
1 carlos       33
2 nathalie     48
3 christina     8
4 alejandro    29

Data frame properties

dim() gives the dimensions of the data frame. ncol() and nrow() give you the number of columns and the number of rows, respectively.

dim(my_data)

[1] 4 2

ncol(my_data)

[1] 2

nrow(my_data)

[1] 4

names() gives you the names of the columns (a vector)

names(my_data)

[1] "person" "age"

glimpse() shows you a lot of information, head() returns the first n rows

glimpse(my_data)

Rows: 4
Columns: 2
$ person <chr> "carlos", "nathalie", "christina", "alejandro"
$ age    <dbl> 33, 48, 8, 29

head(my_data, n=2)

# A tibble: 2 × 2
  person     age
  <chr>    <dbl>
1 carlos      33
2 nathalie    48

Writing data frames

write_csv(my_data, "~/Desktop/my_data.csv")

after running this, you’ll see a new file called my_data.csv (or whatever you chose to name it) appear in the specified location on your computer (e.g. Desktop)
you can read and write .csv files in lots of programs (e.g. google sheets)
to read and write other formats look at documentation and use google + chatGPT!

Scripts

Using the script pane

Writing a series of expressions in the console rapidly gets messy and confusing.
The console window gets reset when you restart RStudio.
It is better (and easier) to write expressions and functions in the script pane (upper left), building up your analysis.
There, you can enter expressions, evaluate them, and save the contents to a .R file for later use.
Look at the RStudio ``Code’’ menu for some useful keyboard commands.

Create a script pane: File > New File > R Script
Put your cursor in the script pane.
Type: 1:10^2
Then hit Command-RETURN (Mac), or Ctrl-ENTER (Windows).
That line is copied to the console pane and evaluated.
You can save the script to a file.
Explore the RStudio Code menu for other commands.

Comments

## In this section, we make a vector and reverse its order
x = 1:3 * 10                # make a vector [10, 20 ... ]
x_reversed = x[length(x):1] # reverse its order

Use a # to start a comment.
A comment extends to the end of the line and is ignored by R.
comments are complemented by good code style!

RStudio Pro-tip: scrolling and multicursors

You should also be aware of cmd-<arrow> and alt-<arrow> for moving the cursor (by line and by word)
and cmd-shift-<arrow> and alt-shift-<arrow> for selecting text (by line and by word)
these also combo with shift (to select) and delete
RStudio’s script pane supports multi-cursors! Hold alt and drag your mouse up and down
You can also set a keyboard shortcut for find and add next
These features make it much easier to rename variables, etc.

Exercise: Plotting a parabola

Write an R script that starts with:

A = 1
B = 2
C = 3

In the rest of the script, do the following:

generate an evenly-spaced sequence of 100 values between -5 and 5 (find an R function that does this). Call this x
generate the corresponding vector of y-values y by computing the formula $y = Ax^2 + Bx + C$ elementwise
create a data frame with x and y as columns
use ggplot to create a line plot of x vs y

Run your script to see the generated plot. Try changing the values of A, B, and C at the top of the script and re-running to see how the plot changes.

Your result should look like: