What does this code do? (note that df$col
is a shortcut for df |> pull(col)
)
df$a = (df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b = (df$b - min(df$b, na.rm = TRUE)) /
(max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$c = (df$c - min(df$c, na.rm = TRUE)) /
(max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d = (df$d - min(df$d, na.rm = TRUE)) /
(max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))
rescale()
is informative and communicates what it doesmin()
is being computed twice in the function body, which is inefficientrange()
instead of getting and saving the results of min()
and max()
separately?We can also test our function in cases where we know what the output should be to make sure it works as intended before we let it loose on the real data
[1] 0 0 0 0 0 1
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
[1] TRUE
To write a function, just wrap your code in some special syntax that tells it what variables will be passed in and what will be returned
FUNCTION_NAME
now isn’t a data frame, vector, etc, it’s a function object that gets created by the function(..) {...}
syntaxreturn()
the value, or R will automatically return the result of the last line of code in the body that gets runFUNCTION_NAME <- function(ARGUMENTS...) { CODE }
function(...)
part is how the function will refer to these inputs internally and specify how it should be called using named argumentsaes = function(x,y) {...} # defining
aes(x=EIF3L, y=VAPA) # calling
To add an optional argument, add an =
after you declare it as a variable and write in the default value that you would like that variable to take
rescale = function(x, na.rm=TRUE) {
x_rng = range(x, na.rm=na.rm)
(x - x_rng[1])/(x_rng[2] - x_rng[1])
}
vec = c(0,1,NA)
rescale(vec)
[1] 0 1 NA
[1] 0 1 NA
[1] NA NA NA
type:prompt - It’s annoying that the sum()
function returns NA
if any values of the input vector are NA
. You can fix this by passing in the optional argument na.rm=T
every time you call sum()
, but it’s inconvenient to type that every single time. - Write a new function (called sum_obs()
, short for “sum observed”) that takes a vector and returns the sum of all the non-NA values. Your function should call the usual sum()
internally.
type: prompt - Write a function called both_na() that takes two vectors of the same length and returns the total number of positions that have an NA in both vectors - Make a few vectors and test out your code
min()
and then max()
?function(x, na.rm=TRUE) {
x_rng = range(x, na.rm=na.rm)
(x - x_rng[1])/(x_rng[2] - x_rng[1])
}
<bytecode: 0x126031510>
data_frames
is a list of three data frames. Use map to output:
for
loop, which is common to many other languages as well. It looks like this:for
loop is very flexible and you can do a lot with itfor
loops are unavoidable when the result of one iteration depends on the result of the last iterationmap()
-style solution:for
loop, the map()
syntax is much more concise and eliminates much of the “admin” code in the loop (setting up indices, initializing the list that will be filled in, indexing into the data structures)map()
syntax also encourages you to write a function for whatever is happening inside the loop. This means you have something that’s reusable and easily testable, and your code will look cleaner\(ARGUMENTS) BODY
function(ARGUMENTS) {BODY}
Earlier we saw this example of reading in multiple files:
map()
typically returns a list (why?)My friend is interested in whther people prefer vanilla or chocolate ice cream in San Francisco so he survyed 20 random people, 14 of which preferred vanilla. Based on the overwhelming majority in the survey, he concludes that most people in SF like vanilla.
Could he be wrong? Is it possible he got a lucky (or unlucky) sample and would have gotten a different answer if he repeated his survey? Let’s presume that, in reality, only 49% of people prefer vanilla (in other words, most people actually like chocolate and my friend is wrong). If we could observe that data it would look like this:
TRUE
or FALSE
). This simulates my friend’s survey.map
to run this function 500 times (hint: pass 1:500 as the first argument to map
) and record all the results as a logical vectorpmap()
. For example, imagine you want to draw a random numbers between a
and b
as both of those vary:imap()
lets you operate on the names of the input list.$class_A
# A tibble: 5 × 2
grade class
<dbl> <chr>
1 90 class_A
2 87 class_A
3 92 class_A
4 78 class_A
5 69 class_A
$class_B
# A tibble: 7 × 2
grade class
<dbl> <chr>
1 88 class_B
2 85 class_B
3 76 class_B
4 78 class_B
5 77 class_B
6 97 class_B
7 91 class_B
expand_grid()
gives you every combination of the items in the list you pass itFill in the missing parts of the code below to programmatically create the following table:
Error: The pipe operator requires a function call as RHS (<text>:8:3)
My collaborator has an online folder of experimental results named results
that can be found at "https://raw.githubusercontent.com/alejandroschuler/r4ds-courses/summer-2023/data/results"
. In that folder, there are 20 sub-folders that represent the results of each repetition of her experiment. These sub-folders are each named rep_n
, so, e.g. results/rep_14
would be one sub-folder. Within each sub-folder, there are 3 csv files called a.csv
, b.csv
c.csv
that contain different kinds of measurements. Thus, a full path to one of these files might be results/rep_14/c.csv
.
write code to read these all into one long list of data frames. str_c()
or glue()
will be helpful to create the required file names
Unfortunately, that wasn’t helpful because now you don’t know what data frames are what results. Consider just the “a
” files. Write code that reads in only the “a
” files and concatenates them into one data frame. Include a column in this data frame that indicates which experimental repetition each row of the data frame came from (use imap()
).
Turn your code from above into a function that takes as input the file name ('a'
, for example) and returns the single concatenated file. Iterate that function over the different file names to output three master data frames corresponding to the file types 'a'
, 'b'
, and 'c'
.