[1] 3
The R console window is the left (or lower-left) window in RStudio. The R console uses a “read, eval, print” loop. This is sometimes called a REPL.
3
is the answer
Ignore the [1]
for now.
R performs operations (called functions) on data and values
These can be composed arbitrarily
?function_name
gives you information about what the function doestype: prompt incremental: true
Solutions to a polynomial equation \(ax^2 + bx + c = 0\) are given by \[x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}\]
Figure out how to use R functions and operations for square roots, exponentiation, and multiplication to calculate x given a=3, b=14, c=-5.
tidyverse
package has a function called read_csv()
that lets you read csv (comma-separated values) files into R.# I have a file called "lupusGenes.csv" on github that we can read from the URL
genes = read_csv("https://tinyurl.com/4vjrbwce")
Error in read_csv("https://tinyurl.com/4vjrbwce"): could not find function "read_csv"
tidyverse
packageread_csv()
requires you to tell it where to find the file you want to read in
"C:\Users\me\Desktop\myfile.csv"
"/Users/me/Desktop/myfile.csv"
"http://www.mywebsite.com/myfile.csv"
genes
is now a dataset loaded into R. To look at it, just type# A tibble: 59 × 11
sampleid age gender ancestry phenotype FAM50A ERCC2 IFI44 EIF3L RSAD2
<chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 GSM3057239 70 F Caucasian SLE 18.6 4.28 18.0 182. 25.5
2 GSM3057241 78 F Caucasian SLE 20.3 3.02 21.1 157. 37.2
3 GSM3057243 64 F Caucasian SLE 21.4 4.00 488. 169. 792.
4 GSM3057245 32 F Asian SLE 17.1 4.49 34.0 149. 60.7
5 GSM3057247 33 F Caucasian SLE 20.9 5.00 34.4 224. 60.8
6 GSM3057249 46 M Maori SLE 15.8 3.96 466. 111. 1382.
7 GSM3057251 45 F Asian SLE 18.9 6.04 299. 157. 926.
8 GSM3057253 67 M Caucasian SLE 27.6 4.77 21.8 265. 20.6
9 GSM3057255 33 F Caucasian SLE 15.4 3.88 700. 98.6 1652.
10 GSM3057257 28 F Caucasian SLE 19.9 7.21 278. 217. 972.
# ℹ 49 more rows
# ℹ 1 more variable: VAPA <dbl>
This is a data frame, one of the most powerful features in R (a “tibble” is a kind of data frame). - Similar to an Excel spreadsheet. - One row ~ one instance of some (real-world) object. - One column ~ one variable, containing the values for the corresponding instances. - All the values in one column should be of the same type (a number, a category, text, etc.), but different columns can be of different types.
# A tibble: 59 × 11
sampleid age gender ancestry phenotype FAM50A ERCC2 IFI44 EIF3L RSAD2
<chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 GSM3057239 70 F Caucasian SLE 18.6 4.28 18.0 182. 25.5
2 GSM3057241 78 F Caucasian SLE 20.3 3.02 21.1 157. 37.2
3 GSM3057243 64 F Caucasian SLE 21.4 4.00 488. 169. 792.
4 GSM3057245 32 F Asian SLE 17.1 4.49 34.0 149. 60.7
5 GSM3057247 33 F Caucasian SLE 20.9 5.00 34.4 224. 60.8
6 GSM3057249 46 M Maori SLE 15.8 3.96 466. 111. 1382.
7 GSM3057251 45 F Asian SLE 18.9 6.04 299. 157. 926.
8 GSM3057253 67 M Caucasian SLE 27.6 4.77 21.8 265. 20.6
9 GSM3057255 33 F Caucasian SLE 15.4 3.88 700. 98.6 1652.
10 GSM3057257 28 F Caucasian SLE 19.9 7.21 278. 217. 972.
# ℹ 49 more rows
# ℹ 1 more variable: VAPA <dbl>
This is a subset of a real RNA-seq (GSE112087) dataset comparing RNA levels in blood between lupus (SLE) patients and healthy controls.
Let’s say we’re curious about the relationship between two genes RSAD2 and IFI44.
ggplot(dataset)
says “start a chart with this dataset”+ geom_point(...)
says “put points on this chart”aes(x=x_values y=y_values)
says “map the values in the column x_values
to the x-axis, and map the values in the column y_values
to the y-axis” (aes
is short for aesthetic)ggplot
is short for “grammar of graphics plot”
ggplot()
and geom_point()
are functions imported from the ggplot2
package, which is one of the “sub-packages” of the tidyverse
package we loaded earlierMake a scatterplot of phenotype
vs IFI44
(another gene in the dataset). The result should look like this:
Let’s say we’re curious about the relationship between RSAD2 and IFI44.
Can you recreate this plot?
What will this do? Why?
?geom_point
to see what aesthetics are expected or allowedggplot
function directly instead of to each geom individuallyUse google or other resources to figure out how to receate this plot in R:
ggplot(genes) +
...
facet_wrap
is good for faceting according to unordered categoriesfacet_grid
is better for ordered categories, and can be used with two variablesUse ggplot to investigate the relationship between gene expression and lupus using any combination of any kinds of plots that you like. Which genes are most associated with lupus? Does this vary by ancestry, age, or assigned sex at birth?
For some plots it may be helpful to reformat your data using this code (we’ll learn how to do this on day 4):
gene_names = names(genes)[6:11]
reformatted_genes = pivot_longer(genes, all_of(gene_names), names_to='gene', values_to='expression')
reformatted_genes
# A tibble: 354 × 7
sampleid age gender ancestry phenotype gene expression
<chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 GSM3057239 70 F Caucasian SLE FAM50A 18.6
2 GSM3057239 70 F Caucasian SLE ERCC2 4.28
3 GSM3057239 70 F Caucasian SLE IFI44 18.0
4 GSM3057239 70 F Caucasian SLE EIF3L 182.
5 GSM3057239 70 F Caucasian SLE RSAD2 25.5
6 GSM3057239 70 F Caucasian SLE VAPA 159.
7 GSM3057241 78 F Caucasian SLE FAM50A 20.3
8 GSM3057241 78 F Caucasian SLE ERCC2 3.02
9 GSM3057241 78 F Caucasian SLE IFI44 21.1
10 GSM3057241 78 F Caucasian SLE EIF3L 157.
# ℹ 344 more rows