Evaluation |
Points |
---|---|
Not submitted. | 0 |
Turned in but low effort, ignoring many directions. | 1 |
Decent effort, followed directions with some minor issues. | 2 |
Submitted | 3 |
Introduction to R, RStudio, and Quarto
CS&SS 508 • Lecture 1
Jess Kunke (slides adapted from Victoria Sass)
Let’s start by getting to know each other a bit better. Share the following with your neighbor:
Name and pronouns
Program and year
Experience with programming (in R or more generally)
One word that best describes your feelings about taking this class
If you could have any superpower, what would you choose?
Pair up with someone nearby and introduce yourself to one another (~ 5 min).
My research:
Making pretty maps, doing analyses
Teaching materials (these slides and the website are made with Quarto!)
The syllabus (as well as lots of other information) can be found on our course website
Feel free to follow along online as I run through the syllabus!
This course is intended to give students a foundational understanding of programming in the statistical language R. This knowledge is intended to be broadly useful wherever you encounter data in your education and career. General topics we will focus on include:
By the end of this course you should feel confident approaching any data you encounter in the future. We will cover almost no statistics, however it is the intention that this course will leave you prepared to progress in CS&SS or STAT courses with the ability to focus on statistics instead of coding. Additionally, the basic concepts you learn will be applicable to other programming languages and research in general, such as logic and algorithmic thinking.
Lecture: Tuesdays 4:30-6:20pm, Thomson Hall Room 325
Office Hours: Wednesdays 4-6pm, Savery Hall Room 117 (CSSCR Lab)
How to Contact Me
Please post your questions on Ed Discussion (accessible through Canvas) rather than emailing me. This helps ensure I won’t miss them!
Communication
Learning is collaborative! In addition to being the place to communicate with me, please use Ed Discussion to ask each other questions, share resources, etc.
Homework & Peer-Reviews
We will be using Canvas only for homework & peer review submissions and to house lecture recordings and Zoom links.
Course Content
All course content (lecture slides and homework instructions) will be accessible on our course website.
None 😎
I recommend bringing a laptop to class so you can follow along and practice during class.
Keep In Mind
The versions of R
, RStudio, and Quarto (as well as any packages you have installed) will not necessarily be the same/up to date if you do your work on different computers. My advice is to consistently use the same device for homework assignments or to make sure to download the latest versions of R
, RStudio, and Quarto when using a new machine.
Textbooks: This course has no textbook. However, I will be suggesting selections from R for Data Science to pair with each week’s topic. While not required, I strongly suggest reading those selections before doing the homework for that week.
Credit/No Credit (C/NC); You need at least 60% to get Credit
9 total homeworks; assessed on a 0-3 point rubric. Assigned at the end of lecture sessions and due a week later.
Evaluation |
Points |
---|---|
Not submitted. | 0 |
Turned in but low effort, ignoring many directions. | 1 |
Decent effort, followed directions with some minor issues. | 2 |
Submitted | 3 |
One per homework, assessed on a binary satisfactory/unsatisfactory scale. Due 5 days after homework due date.
Evaluation |
Points |
---|---|
Didn’t follow all peer-review instructions. |
0 |
Peer review is at least several sentences long, |
1 |
Homework/peer grading instructions and deadlines can be found on the Homework page of the course website. All homework will be turned in on Canvas by 4:30pm the day it is due.
No late assignments will be accepted to ensure you receive feedback at regular intervals and stay on track. The grading is lenient, so submit whatever you have by each deadline, and your grade will be fine if you miss submitting an assignment or two due to illness or emergencies.
Peer reviews are randomly assigned when the due date/time is reached. Therefore, if you don’t submit your homework, you will not be given a peer’s homework to review and vice versa.
Yes, because:
You will write your reports better knowing others will see them
You learn alternative approaches to the same problem
You will have more opportunities to practice and have the material sink in
How to peer review:
Academic integrity is essential to this course and to your learning. Violations of the academic integrity policy include but are not limited to:
I hope you will collaborate with peers on assignments and use Internet resources when questions arise to help solve issues. The key is that you ultimately submit your own work.
Anything found in violation of this policy will be automatically given a score of 0 with no exceptions. If the situation merits, it will also be reported to the UW Student Conduct Office, at which point it is out of my hands. If you have any questions about this policy, please do not hesitate to reach out and ask.
I’m committed to fostering a friendly and inclusive classroom environment in which all students have an equal opportunity to learn and succeed. This course is an attempt to make an often difficult and frustrating experience (learning R
for the first time) less obfuscating, daunting, and stressful. That said, learning happens in different ways at at a different pace for everyone. Learning is also a collaborative and creative process and my aim is to create an environment in which you all feel comfortable asking questions of me and each other. Treat your peers and yourself with empathy and respect as you all approach this topic from a range of backgrounds and experiences (in programming and in life).
Names & Pronouns: Everyone deserves to be addressed respectfully and correctly. You are welcome to send me your preferred name and correct gender pronouns at any time.
Diversity: Diverse backgrounds, embodiments, and experiences are essential to the critical thinking endeavor at the heart of university education. Therefore, I expect you to follow the UW Student Conduct Code in your interactions with your colleagues and me in this course by respecting the many social and cultural differences among us, which may include, but are not limited to: age, cultural background, disability, ethnicity, family status, gender identity and presentation, body size/shape, citizenship and immigration status, national origin, race, religious and political beliefs, sex, sexual orientation, socioeconomic status, and veteran status.
Accessibility & Accommodations: Your experience in this class is important to me. If you have already established accommodations with Disability Resources for Students (DRS), please communicate your approved accommodations to me at your earliest convenience so we can discuss your needs in this course. If you have not yet established services through DRS, but have a temporary health condition or permanent disability that requires accommodations (conditions include but not limited to; mental health, attention-related, learning, vision, hearing, physical or health impacts), you are welcome to contact DRS at 206-543-8924, uwdrs@uw.edu, or through their website.
Religious Accommodations: Washington state law requires that UW develop a policy for accommodation of student absences or significant hardship due to reasons of faith or conscience, or for organized religious activities. The UW's policy, including more information about how to request an accommodation, is available at Religious Accommodations Policy. Accommodations must be requested within the first two weeks of this course using the Religious Accommodations Request form.
Getting Help: If at any point during the quarter you find yourself struggling to keep up, please let me know! I am here to help. A great place to start this process is by chatting before or after class, attending office hours, and/or reaching out on Ed Discussion. As much as possible, I encourage you to use my office hours.
Also, help one another as you navigate this course! Use Ed Discussion to discuss questions with each other, and feel free to form study/practice groups.
Feedback
If you have feedback on any part of this course or the classroom environment I want to hear it! You can submit anonymous feedback here. I will also send out a mid-quarter feedback survey.
Don’t ask like this:
tried lm(y~x) but it iddn’t work wat do
Instead, ask like this:
y <- seq(1:10) + rnorm(10) x <- seq(0:10) model <- lm(y ~ x)
Running the block above gives me the following error, anyone know why?
Error in model.frame.default(formula = y ~ x, drop.unused.levels = TRUE) : variable lengths differ (found for 'x')
FYI
If you ask me a question directly on Ed Discussion or in office hours, I may send out your question (anonymously) along with my answer to the whole course.
Bold usually indicates an important vocabulary term. Remember these!
Italics indicate emphasis but also are used to point out things you must click with a mouse.
Code
represents R code or keyboard shortcuts you can use to perform actions.
Ctrl-P
to open the print dialogue.”Code chunks that span the page represent actual R code embedded in the slides.
Bold usually indicates an important vocabulary term. Remember these!
Italics indicate emphasis but also are used to point out things you must click with a mouse.
Code
represents R code you could use to perform actions.
Ctrl-P
to open the print dialogue.”Code chunks that span the page represent actual R code embedded in the slides.
Since the lectures for this class were created using Quarto, there are numerous built-in features meant to facilitate your learning, particularly of R
.
R
code embedded in the slides you will see a clipboard which you can click to copy the code. You can then paste it in your own Quarto document or R
script to run it in your session of RStudio.R is a programming language built for statistical computing.
If one already knows Stata or similar software, why use R?
R Studio is a “front-end” or integrated development environment (IDE) for R that can make your life easier.
We’ll show RStudio can…
It can also…
Manage git
repositories (version control)
Run interactive tutorials
Handle other languages like C++, Python, SQL, HTML, and shell scripting
Built upon many of the developments of the R Markdown ecosystem, Quarto distills them into one coherent system and additionally expands its functionality by supporting other programming languages besides R, including Python and Julia.
The ability to create Quarto files in R is a powerful advantage. It allows us to:
If you don’t already have R and RStudio on your machine, now is the time to do so! Follow the instructions in the syllabus.
Open up RStudio now and choose File > New File > R Script.
Then, let’s get oriented with the interface:
Top Left: Code editor pane, data viewer (browse with tabs)
Bottom Left: Console for running code (>
prompt)
Top Right: List of objects in environment, code history tab.
Bottom Right: Tabs for browsing files, viewing plots, managing packages, and viewing help files.
There are several ways to run R code in RStudio:
Ctrl+Enter
or ⌘+Enter
to run them all.There are several ways to run R code in RStudio:
Highlight lines in the editor window and click Run at the top right corner of said window or hit
Ctrl+Enter
or ⌘+Enter
to run them all.
With your caret1 on a line you want to run, hit Ctrl+Enter
or ⌘+Enter
. Note your caret moves to the next line, so you can run code sequentially with repeated presses.
Enter
.The console will show the lines you ran followed by any printed output.
If you mess up (e.g. leave off a parenthesis), R might show a +
sign prompting you to finish the command:
Finish the command or hit Esc
to get out of this.
In the console, type 123 + 456 + 789
and hit Enter
.
The [1]
in the output indicates the numeric index of the first element on that line.
Now in your blank R document in the editor, try typing the line sqrt(400)
and either clicking Run or hitting Ctrl+Enter
or ⌘+Enter
.
sqrt()
is an example of a function in R.
Arguments are the inputs to a function. In this case, the only argument to sqrt()
is x
which can be a number or a vector of numbers.
The basic template of a function is
function_name(argument1, argument2 = value2, argument3 = value3...)
Something to Note
Functions can have a wide range of arguments and some are required for the function to run, while others remain optional. You can see from each functions’ help page which are not required because they will have an =
with some default value pre-selected. If there is no =
it is up to the user to define that value and it’s therefore a required specification.
If we didn’t have a good guess as to what sqrt()
will do, we can type ?sqrt
in the console and look at the Help panel on the bottom right.
If you’re trying to look up the help page for a function and can’t remember its name you can search by a keyword and you will get a list of help pages containing said keyword.
Help files provide documentation on how to use functions and what functions produce. They will generally consist of the following sections:
R stores everything as an object, including data, functions, models, and output.
Creating an object can be done using the assignment operators <-
or =
: . . .
Operators like <-
are functions that look like symbols but typically sit between their arguments (e.g. numbers or objects) instead of having them inside ()
like in sqrt(x)
.
We do math with operators, e.g., x + y
.
+
is the addition operator!
You can display or “call” an object simply by using its name.
Object names must begin with a letter and can contain letters, numbers, .
, and _
.
Try to be consistent in naming objects. RStudio auto-complete means long, descriptive names are better than short, vague ones! Good names save confusion later!
_
is a common and practical naming convention that I strongly recommend.Remember that object names are CaSe SeNsItIvE!!
Also, TYPOS MATTER!
An object’s name represents the information stored in that object, so you can treat the object’s name as if it were the values stored inside. . . .
A vector is one of many data types available in R
. Specifically, it is a series of elements, such as numbers, strings, or booleans (i.e. TRUE
, FALSE
).
You can create a vector using the function c()
which stands for “combine” or “concatenate”. . . .
If you name an object the same name as an existing object, it will overwrite it.
There are other, more complex data types in R which we will discuss later in the quarter! These include matrices, arrays, lists, and dataframes.
Most data sets you will work with will be read into R
and stored as a dataframe, so this course will mainly focus on manipulating and visualizing these objects.
Let’s try making an Quarto file:
---
title: "ggplot2 demo"
author: "Norah Jones"
date: "5/22/2021"
format:
html:
fig-width: 8
fig-height: 4
code-fold: true
---
## Air Quality
@fig-airquality further explores the impact of temperature on ozone level.
```{r}
#| label: fig-airquality
#| fig-cap: "Temperature and ozone level."
#| warning: false
library(ggplot2)
ggplot(airquality, aes(Temp, Ozone)) +
geom_point() +
geom_smooth(method = "loess")
```
Elements of a Quarto document include:
---
s).```
s) and/or their output.The header of a .qmd file is a YAML1code block, and everything else is part of the main document. Try adding some of these other fields to your YAML and re-render it to see what it looks like.
To mess with global formatting, you can modify the header2.
This is all basic markdown syntax which you can learn about here.
Include math \(y= \left( \frac{2}{3} \right)^2\) inline.
Or centered on your page like so:
\[\frac{1}{n} \sum_{i=1}^{n} x_i = \bar{x}_n\]
Or write code-looking font
.
Or a block of code:
y <- 1:5
z <- y^2
Quarto docs can be modified in many ways. Visit these links for more information.
Inside Quarto, lines of R code are called chunks. Code is sandwiched between sets of three backticks and {r}
. This chunk of code…
Produces this output in your document:
speed dist
Min. : 4.0 Min. : 2.00
1st Qu.:12.0 1st Qu.: 26.00
Median :15.0 Median : 36.00
Mean :15.4 Mean : 42.98
3rd Qu.:19.0 3rd Qu.: 56.00
Max. :25.0 Max. :120.00
Add this code chunk to your document!
Chunks have options that control what happens with their code. They are specified as special comments at the top of a block. For example:
Some useful and common options include:
echo: false
- Keeps R code from being shown in the document
eval: false
- Shows R code in the document without running it
include: false
- Hides all output but still runs code (good for setup
chunks where you load packages!)
output: false
- Doesn’t include the results of that code chunk in the output
cache: true
- Saves results of running that chunk so if it takes a while, you won’t have to re-run it each time you re-render the document
fig.height: 5, fig.width: 5
- modify the dimensions of any plots that are generated in the chunk (units are in inches)
fig.cap: "Text"
- add a caption to your figure in the chunk
Try adding or changing the chunk options for the chunk in my_first_Rmd.qmd
and re-render your document to see what happens.
Sometimes we want to insert a value directly into our text. We do that using code in single backticks starting off with r
.
Four score and seven years ago is the same as `r 4*20 + 7` years.
Four score and seven years ago is the same as 87 years.
The value of `x` rounded to the nearest two decimals is `r round(x, 2)`.
The value of x
rounded to the nearest two decimals is 8.77.
Having R dump values directly into your document protects you from silly mistakes:
In your YAML header, make the date come from R’s Sys.time()
function by changing:
date: "March 26, 2024"
to
date: "`r Sys.time()`"
R
and PackagesR
Simply by downloading R
you have access to what is referred to as Base R
. That is, the built-in functions and datasets that R
comes equipped with, right out of the box.
Examples that we’ve already seen include <-
, sqrt()
, +
, Sys.time()
, and summary()
but there are obviously many many more.
You can see a whole list of what Base R
contains by running library(help = "base")
in the console.
R
Dataset: cars
In the sample Quarto document you are working on, we can load the built-in data cars
, which loads as a dataframe, a type of object mentioned earlier. Then, we can look at it in a couple different ways.
data(cars)
loads this dataframe into the Global Environment.
View(cars)
pops up a Viewer tab in the source pane (“interactive” use only, don’t put in Quarto document!).
R
Dataset: cars
str()
displays the structure of an object:
R
Dataset: cars
str()
displays the structure of an object:
'data.frame': 50 obs. of 2 variables:
$ speed: num 4 4 7 7 8 9 10 10 10 11 ...
$ dist : num 2 10 4 22 16 10 18 26 34 17 ...
summary()
displays summary information 1:
R
is pretty…Basichist()
generates a histogram of a vector. Note that you can access a vector that is a column of a dataframe using $
, the extract operator.
R
is pretty…BasicWe can try and make this histogram a bit more appealing by adding more arguments and their specifications.
R
is pretty…BasicWe can also make scatterplots to show the relationship between two variables.
plot(dist ~ speed, data = cars,
xlab = "Speed (mph)",
ylab = "Stopping distance (ft)",
main = "Speeds and stopping distances of cars",
pch = 16) # Point shape
abline(h = mean(cars$dist), col = "firebrick") # add horizontal line (y-value)
abline(v = mean(cars$speed), col = "cornflowerblue") # add vertical line (x-value)
Note
dist ~ speed
is a formula of the type y ~ x
. The first element (dist
) gets plotted on the y-axis and the second (speed
) goes on the x-axis. Regression formulae follow this convention as well!
R
is pretty…BasicWe can also make scatterplots to show the relationship between two variables.
plot(dist ~ speed, data = cars, xlab = "Speed (mph)", ylab = "Stopping distance (ft)", # add y-axis label main = "Speeds and stopping distances of cars", pch = 16) # Point shape abline(h = mean(cars$dist), col = "firebrick") # add horizontal line abline(v = mean(cars$speed), col = "cornflowerblue") # add vertical line
plot(dist ~ speed, data = cars, xlab = "Speed (mph)", ylab = "Stopping distance (ft)", # add y-axis label main = "Speeds and stopping distances of cars", pch = 16) # Point shape abline(h = mean(cars$dist), col = "firebrick") # add horizontal line abline(v = mean(cars$speed), col = "cornflowerblue") # add vertical line
plot(dist ~ speed, data = cars, xlab = "Speed (mph)", ylab = "Stopping distance (ft)", # add y-axis label main = "Speeds and stopping distances of cars", pch = 16) # Point shape abline(h = mean(cars$dist), col = "firebrick") # add horizontal line abline(v = mean(cars$speed), col = "cornflowerblue") # add vertical line
R
Dataset: swiss
Let’s look at another built-in dataset.
First, run ?swiss
in the console to see what things mean.
Then, load it using data(swiss)
swiss
What makes R
so powerful though is it’s extensive library of packages. Due to it’s open-source nature, anyone (even you!) can write a package that others can use.
Packages contain pre-made functions and/or data that can be used to extend Base R
’s capabilities.
Base R
/Package Analogy
Base R
is like creating a recipe from scratch: going to the store and buying all the ingredients and cooking it by yourself. Using a package is more akin to using a meal-kit service: you still have to cook but you’re provided with the ingredients and step-by-step instructions for making the recipe.
As of this writing there are 22,254 available packages!
To use a package outside of Base R
you need to do two things:
To use a package outside of Base R
you need to do two things:
CRAN
(The C
omprehensive R
A
rchive N
etwork) by running the following in your console1:install.packages("package_name")
To use a package outside of Base R
you need to do two things:
CRAN
(The C
omprehensive R
A
rchive N
etwork) by running the following in your console1:install.packages("package_name")
This downloads the package to your local machine (or the server of whatever remote machine you’re using). Thus, you only every need to do it once for each package2!
To use a package outside of Base R
you need to do two things:
CRAN
(The C
omprehensive R
A
rchive N
etwork) by running the following in your console1:install.packages("package_name")
This downloads the package to your local machine (or the server of whatever remote machine you’re using). Thus, you only every need to do it once for each package2!
R
so you can use it. You’ll do this by putting the following in an R
Script or embedded in a code chunk in a Quarto file:gt
PackageLet’s make a table that’s more polished than the code-y output R
automatically gives us. To do this, we’ll want to install our first package called gt
. In the console, run: install.packages("gt")
.
Nesting Functions
Note that we put the summary(swiss)
function call inside the as.data.frame.matrix()
call which all went into the gt()
function. This is called nesting functions and is very common. I’ll introduce a method next week to avoid confusion from nesting too many functions inside each other.
What’s as.data.frame.matrix()
Doing?
gt()
takes as its first argument a data.frame
-type object, while summary()
produces a table
-type object. Therefore, as.data.frame.matrix()
was additionally needed to turn the table
into a data.frame
.
Fertility | Agriculture | Examination | Education | Catholic | Infant.Mortality |
---|---|---|---|---|---|
Min. :35.00 | Min. : 1.20 | Min. : 3.00 | Min. : 1.00 | Min. : 2.150 | Min. :10.80 |
1st Qu.:64.70 | 1st Qu.:35.90 | 1st Qu.:12.00 | 1st Qu.: 6.00 | 1st Qu.: 5.195 | 1st Qu.:18.15 |
Median :70.40 | Median :54.10 | Median :16.00 | Median : 8.00 | Median : 15.140 | Median :20.00 |
Mean :70.14 | Mean :50.66 | Mean :16.49 | Mean :10.98 | Mean : 41.144 | Mean :19.94 |
3rd Qu.:78.45 | 3rd Qu.:67.65 | 3rd Qu.:22.00 | 3rd Qu.:12.00 | 3rd Qu.: 93.125 | 3rd Qu.:21.70 |
Max. :92.50 | Max. :89.70 | Max. :37.00 | Max. :53.00 | Max. :100.000 | Max. :26.60 |
gt
’s Version of head()
and tail()
Fertility Agriculture Examination Education Catholic
Courtelary 80.2 17.0 15 12 9.96
Delemont 83.1 45.1 6 9 84.84
Franches-Mnt 92.5 39.7 5 5 93.40
Moutier 85.8 36.5 12 7 33.77
Neuveville 76.9 43.5 17 15 5.16
Porrentruy 76.1 35.3 9 7 90.57
Infant.Mortality
Courtelary 22.2
Delemont 22.2
Franches-Mnt 20.2
Moutier 20.3
Neuveville 20.6
Porrentruy 26.6
Fertility | Agriculture | Examination | Education | Catholic | Infant.Mortality | |
---|---|---|---|---|---|---|
1 | 80.2 | 17.0 | 15 | 12 | 9.96 | 22.2 |
2 | 83.1 | 45.1 | 6 | 9 | 84.84 | 22.2 |
3 | 92.5 | 39.7 | 5 | 5 | 93.40 | 20.2 |
4..44 | ||||||
45 | 35.0 | 1.2 | 37 | 53 | 42.34 | 18.0 |
46 | 44.7 | 46.6 | 16 | 29 | 50.43 | 18.2 |
47 | 42.8 | 27.7 | 22 | 29 | 58.33 | 19.3 |
👋 Bye Bye as.data.frame.matrix()
We no longer need as.data.frame.matrix()
since we’re no longer using summary()
. Both head()
and gt_preview()
take a data.frame
-type object as their first argument which is the same data type as swiss
.
Comments
Anything written after
#
1 will be ignored by R.Comments help collaborators and future-you understand what, and more importantly, why you are doing what you’re doing with that specific line/chunk of code.
Additionally, comments allow you to explain your overall coding plan and record anything important that you’ve discovered along the way.