Lecture 6: Better plots and some terminology review

CSSS 508

Author

Jess Kunke

Published

November 5, 2024

Many R packages help you make better plots and tables. Illustration by Allison Horst.

Many R packages help you make better plots and tables. Illustration by Allison Horst.

Updates and reminders

  • Homework and peer reviews are graded for completion
  • Office hours are now Mon 4-5pm (Zoom and my office) and Fri 2-3pm (Zoom only)

Outline for today

  • Review of terminology
  • Data types and object types
  • Better plots

Terminology vocab list

Here are some terms we’ve been using so far.

R/coding Terminology:

  • Operation: a general term for something you do; can be conceptual steps or coding steps
  • Operator: a general term for a code construct to do a particular task/operation
    • We’ve seen lots of these! e.g. the extract operator $, the multiplication operator *, the equality comparison operator ==
    • Functions like select() and n() are a type of operator that takes zero or more arguments (inputs) supplied within parentheses () and carries out one or more tasks/operations
  • Index (plural: indices): the position of an element in an order
    • A row index tells you which row based on its order
    • The indices in gapminder[2,3] specify the second row and the third column
  • Object: a general term for a specific something (essentially some set of data) that you have defined
    • Objects are things/data, whereas functions and operations are instructions or recipes
  • Assignment: the process of storing an object under a name, such as in japan_data = filter(gapminder, country == "Japan")
  • Call (an object or function): to invoke/display (an object) or execute/“run” (a function); basically means to use it
  • Object name (e.g. japan_data) vs. filename (e.g. “gapminder.csv”) vs. (document/plot/table) title (e.g. “CSSS 508 Homework 5”)
  • Comments: parts of the code that are not run/executed
    • In R, the comment character is #
    • Therefore, in R, if there is a # somewhere on a particular line of code, everything on that line before the # will be run as code and everything on that line after the # will not be run as code
  • Command line: a simple interface with a text prompt where commands can be entered
    • This is the kind of interface we see in both the Console and the Terminal in RStudio; the Console is an interface for running R code within RStudio, while the Terminal is an interface for running shell scripting commands (different language than R) on your computer terminal
  • Base R: the set of code and data that come with an R installation
  • Environment: a virtual space that includes the set of all objects, variables, and functions that are defined within it
    • Your working environment is the space in which you are currently working; for instance, if you are running commands in the Console and looking at data in the Environment tab, your working environment is the global environment, whereas when you render a Quarto document, a fresh R environment is created in the background for that process.
    • Objects you define and packages you load in the Console (affecting your global environment) will not affect how your Quarto document renders.
    • Objects and packages in your Quarto document will not affect your global environment unless you run those commands outside of rendering (such as by clicking “Run”, using the keyboard shortcut command/control-return, or copying and pasting commands into the Console).
  • Package: a bundle of code/functions and data
    • Installing versus loading packages

Quarto terminology:

  • Markup language: a text-encoding system which specifies the structure and formatting of a document
  • YAML: a specific type of markup language, and the kind that you use in Quarto to set options
  • YAML header: the section of settings at the top of a Quarto document, delineated with ---
  • Code chunks: sections of a Quarto document that will be interpreted as code when the document is rendered
  • Chunk options: YAML settings at the start of a code chunk that determine something about how the chunk runs or appears in the final rendered document
  • Inline R code: R code surrounded by backticks `` and nestled directly into the text of a Quarto document; it is interpreted (and run) as code instead of text when the document is rendered

Data types

Data types affect how values are stored and used. The main data types we have seen are numeric (num), integer (int), character (chr), and logical (logi). We’ll also introduce a fifth type, factor.

You can use str() (and often class() and typeof()) to see the data type of a particular object or variable.

Numeric and integer refer to numbers. Integers are whole numbers only, whereas a numeric-type object can have decimal values.

Character-type objects are values in quotes and are treated like text.

x = 1:10

# take a look at x and some related objects
x
 [1]  1  2  3  4  5  6  7  8  9 10
x < 3
 [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
as.character(x)
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
# use str to check out their types
str(x)
 int [1:10] 1 2 3 4 5 6 7 8 9 10
str(x < 3)
 logi [1:10] TRUE TRUE FALSE FALSE FALSE FALSE ...
str(as.character(x))
 chr [1:10] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
# use class to check out their types
class(x)
[1] "integer"
class(x < 3)
[1] "logical"
class(as.character(x))
[1] "character"

Some more examples you can run yourself and check out:

str(1)
str(1L)
str("1")
str(TRUE)
str(1 < 5)
str(gapminder$year)
str(gapminder$country)
str(gapminder) # shows you the type of each variable all at once
class(1)
class(1L)
class("1")
class(TRUE)
class(1 < 5)
class(gapminder$year)
class(gapminder$country)
class(gapminder)

You can change datatypes using as.integer, as.logical, and so on.

identical(1.3, as.numeric("1.3"))
identical(1L, as.integer("1"))
identical("1", as.character(1))
identical(TRUE, as.logical(1))

The data type matters when you try to do certain operations:

1+2
"1"+"2"
as.character(1) + as.character(2)

In R, factors are useful for working with categorical variables (variables that have a fixed, known set of possible values) such as country and continent in the gapminder dataset. They are also handy if you want to display character vectors in a non-alphabetical order. We’ll investigate this further in the homework.

Object types

We started to talk about these in Lecture 1, but let’s talk about this more now that we’ve gotten more experience with R. You can think of object types as different data structures. Some of the main object types are vectors, dataframes, lists, matrices, and arrays, and each value in an object is called an element of that object. There are all sorts of more complicated ones too, like simple feature (sf) objects for spatial data and graph or network objects for network data.

Vectors are one-dimensional; they have a length, which could just be 1 if they contain only a single value. Matrices and dataframes are two-dimensional, with rows and columns; a main difference between them is that the columns of dataframes can have different data types (numerical, character, etc.). Arrays can have even higher dimensions. So far we have mostly worked with vectors and dataframes.

Lists allow for more flexible structures; in fact, dataframes are able to have columns with different formats because dataframes are a type of list. A list is one or more objects grouped together in some order into one object. Here’s a simple example of making a list:

y = 10:1
mylist = list(x, y)
mylist
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10

[[2]]
 [1] 10  9  8  7  6  5  4  3  2  1
mylist[[2]] # second element of the list
 [1] 10  9  8  7  6  5  4  3  2  1
mylist[[2]][3] # third element of the second element of the list
[1] 8

If we instead name the elements of the list, we can access them using the extract operator $ just as we do in a dataframe object:

mylist = list("x" = x, "y" = y)
mylist
$x
 [1]  1  2  3  4  5  6  7  8  9 10

$y
 [1] 10  9  8  7  6  5  4  3  2  1
mylist$y # second element of the list
 [1] 10  9  8  7  6  5  4  3  2  1
mylist$y[3] # third element of the second element of the list
[1] 8

The elements of a list don’t have to have the same dimension or even the same object type:

mylist = list("some_numbers" = x, "gapminder" = gapminder, my_name = "Jess Kunke", is_instructor = TRUE)
mylist
$some_numbers
 [1]  1  2  3  4  5  6  7  8  9 10

$gapminder
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

$my_name
[1] "Jess Kunke"

$is_instructor
[1] TRUE
str(mylist)
List of 4
 $ some_numbers : int [1:10] 1 2 3 4 5 6 7 8 9 10
 $ gapminder    : tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
  ..$ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
  ..$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
  ..$ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
  ..$ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
  ..$ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
  ..$ gdpPercap: num [1:1704] 779 821 853 836 740 ...
 $ my_name      : chr "Jess Kunke"
 $ is_instructor: logi TRUE

Building a nicer plot step by step

Let’s build up a more complex plot step by step to learn how to do a bunch of other things. We’ll start with a line plot of life expectancy over time by country. (Why is group = country necessary??) Each time we add something to the plot, I’m just going to copy and paste the previous code into a new code chunk, then make one or more changes.

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line()

Now let’s split each continent into a different subplot. Is group = country still necessary? Why/why not?

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line() +
  facet_wrap(~ continent, nrow = 2)

Let’s change the theme:

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line() +
  facet_wrap(~ continent, nrow = 2) +
  theme_minimal()

Let’s fix the axis labels and add a title and subtitle. You can do this with separate functions like xlab() and ggtitle() or with a single function like labs(). We’ve seen the separate functions before, so let’s try labs() this time:

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line() +
  facet_wrap(~ continent, nrow = 2) + 
  labs(x = "", 
       y = "Life Expectancy (Years)",
       title = "Life Expectancy, 1952-2007", 
       subtitle = "By continent and country") +
  theme_minimal()

Notice that this time I added the last command (labs()) before a previous command (theme_minimal()), not at the end. For the commands I have here (labs() and theme_minimal()), the order doesn’t matter, but I tend to try to put geometry layers (geom_xx()) toward the beginning of the block of plot code and theme settings at the end of it.

The x-axis is cluttered and difficult to read, so let’s fix that. We can start by having fewer axis tick labels:

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line() +
  facet_wrap(~ continent, nrow = 2) +
  labs(x = "", 
       y = "Life Expectancy (Years)",
       title = "Life Expectancy, 1952-2007", 
       subtitle = "By continent and country") +
  scale_x_continuous(breaks = seq(1950, 2010, 20)) +
  theme_minimal()

Hmm, it doesn’t show a label at the right edge of the plot. If we want that to change, we can do that:

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line() +
  facet_wrap(~ continent, nrow = 2) +
  labs(x = "", 
       y = "Life Expectancy (Years)",
       title = "Life Expectancy, 1952-2007", 
       subtitle = "By continent and country") +
  scale_x_continuous(limits = c(1950, 2010), breaks = seq(1950, 2010, 20)) +
  theme_minimal()

Now the ends of the x-axes run together, so we can fix that by increasing the space or padding between the subplots. I didn’t remember how to do this, so I googled “r ggplot increase space between panels” to figure it out.

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line() +
  facet_wrap(~ continent, nrow = 2) +
  labs(x = "", 
       y = "Life Expectancy (Years)",
       title = "Life Expectancy, 1952-2007", 
       subtitle = "By continent and country") +
  scale_x_continuous(limits = c(1950, 2010), breaks = seq(1950, 2010, 20)) +
  theme_minimal() +
  theme(panel.spacing = unit(2, "lines"))

We could also tilt the text of the axis tick marks:

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line() +
  facet_wrap(~ continent, nrow = 2) +
  labs(x = "", 
       y = "Life Expectancy (Years)",
       title = "Life Expectancy, 1952-2007", 
       subtitle = "By continent and country") +
  scale_x_continuous(limits = c(1950, 2010), breaks = seq(1950, 2010, 20)) +
  theme_minimal() +
  theme(panel.spacing = unit(2, "lines"),
        axis.text.x = element_text(angle = 30))

Let’s color the lines by continent. Notice this automatically generates a legend.

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country, color = continent)) +
  geom_line() +
  facet_wrap(~ continent, nrow = 2) +
  labs(x = "", 
       y = "Life Expectancy (Years)",
       title = "Life Expectancy, 1952-2007", 
       subtitle = "By continent and country") +
  scale_x_continuous(limits = c(1950, 2010), breaks = seq(1950, 2010, 20)) +
  theme_minimal() +
  theme(panel.spacing = unit(2, "lines"),
        axis.text.x = element_text(angle = 30))

We can move the legend:

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country, color = continent)) +
  geom_line() +
  facet_wrap(~ continent, nrow = 2) +
  labs(x = "", 
       y = "Life Expectancy (Years)",
       title = "Life Expectancy, 1952-2007", 
       subtitle = "By continent and country") +
  scale_x_continuous(limits = c(1950, 2010), breaks = seq(1950, 2010, 20)) +
  theme_minimal() +
  theme(panel.spacing = unit(2, "lines"),
        legend.position = c(0.8, 0.2)) # or "bottom" or many other options

If you want to change the colors or the color palette, you can do that using, for example, scale_color_manual. We can also use that same function to rename the legend. Notice that

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country, color = continent)) +
  geom_line() +
  facet_wrap(~ continent, nrow = 2) +
  labs(x = "", 
       y = "Life Expectancy (Years)",
       title = "Life Expectancy, 1952-2007", 
       subtitle = "By continent and country") +
  scale_x_continuous(limits = c(1950, 2010), breaks = seq(1950, 2010, 20)) +
  scale_color_manual(
    name = "Which continent are\nwe looking at?", # \n adds a line break 
    values = c("Africa" = "#4e79a7", "Americas" = "#f28e2c", 
               "Asia" = "#e15759", "Europe" = "#76b7b2", "Oceania" = "#59a14f")) +
  theme_minimal() +
  theme(panel.spacing = unit(2, "lines"),
        legend.position = c(0.8, 0.2)) # or "bottom" or many other options

Let’s add a line to each plot that represents the continent average:

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country, color = continent)) +
  geom_line() +
  geom_line(stat = "smooth", 
            aes(group = continent)) +
  facet_wrap(~ continent, nrow = 2) +
  labs(x = "", 
       y = "Life Expectancy (Years)",
       title = "Life Expectancy, 1952-2007", 
       subtitle = "By continent and country") +
  scale_x_continuous(limits = c(1950, 2010), breaks = seq(1950, 2010, 20)) +
  scale_color_manual(
    name = "Which continent are\nwe looking at?", # \n adds a line break 
    values = c("Africa" = "#4e79a7", "Americas" = "#f28e2c", 
               "Asia" = "#e15759", "Europe" = "#76b7b2", "Oceania" = "#59a14f")) +
  theme_minimal() +
  theme(panel.spacing = unit(2, "lines"),
        legend.position = c(0.8, 0.2)) # or "bottom" or many other options

Oops, we can’t see the line! We should make it stand out as different from the other country-specific lines. Let’s make it thicker and black:

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country, color = continent)) +
  geom_line() +
  geom_line(stat = "smooth", 
            aes(group = continent),
            color = "black", size = 1) +
  facet_wrap(~ continent, nrow = 2) +
  labs(x = "", 
       y = "Life Expectancy (Years)",
       title = "Life Expectancy, 1952-2007", 
       subtitle = "By continent and country") +
  scale_x_continuous(limits = c(1950, 2010), breaks = seq(1950, 2010, 20)) +
  scale_color_manual(
    name = "Which continent are\nwe looking at?", # \n adds a line break 
    values = c("Africa" = "#4e79a7", "Americas" = "#f28e2c", 
               "Asia" = "#e15759", "Europe" = "#76b7b2", "Oceania" = "#59a14f")) +
  theme_minimal() +
  theme(panel.spacing = unit(2, "lines"),
        legend.position = c(0.8, 0.2)) # or "bottom" or many other options

This probably gives you this warning:

What does it mean that the size aesthetic was “depracated”? It means that it used to be an argument and technically you can still use it but they’re trying to phase it out. So they’d rather you use something else instead– they tell you the new argument to use is linewidth. So let’s change that:

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country, color = continent)) +
  geom_line() +
  geom_line(stat = "smooth", 
            aes(group = continent),
            color = "black", linewidth = 1) +
  facet_wrap(~ continent, nrow = 2) +
  labs(x = "", 
       y = "Life Expectancy (Years)",
       title = "Life Expectancy, 1952-2007", 
       subtitle = "By continent and country") +
  scale_x_continuous(limits = c(1950, 2010), breaks = seq(1950, 2010, 20)) +
  scale_color_manual(
    name = "Which continent are\nwe looking at?", # \n adds a line break 
    values = c("Africa" = "#4e79a7", "Americas" = "#f28e2c", 
               "Asia" = "#e15759", "Europe" = "#76b7b2", "Oceania" = "#59a14f")) +
  theme_minimal() +
  theme(panel.spacing = unit(2, "lines"),
        legend.position = c(0.8, 0.2)) # or "bottom" or many other options

Your Turn

Let’s keep going!

  1. Make the continent average line a little transparent so that you can see the other lines through it.
  2. Rename the legend “Continent”.
  3. Put all the continents in a single row instead of having this two-row plot arrangement.
  4. Change the font size.
  5. Set the panel.spacing using a different unit than lines.