Lecture 7: Import, process, join and export data

CSSS 508

Author

Jess Kunke; drawn from R for Data Science by Hadley Wickham and Garrett Grolemund

Published

November 12, 2024

Data workflows. Illustration by Allison Horst.

Data workflows. Illustration by Allison Horst.

Updates and reminders

  • This week’s office hours were via Zoom only due to the holiday, but the Monday 4-5pm office hours will be hybrid again starting Nov 18th
  • Office hours are Mon 4-5pm (Zoom and my office) and Fri 2-3pm (Zoom only)

Outline for today

  • RStudio pro tips
  • Importing/reading in data
  • Processing data using factors
  • Joining two datasets
  • Exporting data to a file

RStudio pro tips

  • Gear symbol next to “Render” button
    • Control where your document preview appears (and whether it appears at all) when you render
    • Control whether the Source panel shows previews of your images and equations
    • Control whether output you run from your qmd file using Run (or the keyboard shortcut or the green arrow) appears in the Console or “inline” in your Source panel
  • Use keyboard shortcuts and autocompletion to reduce the typing and clicking you need to do
    • Command-enter/Control-enter to run the line of code your cursor is on
      • Add Shift to run the whole code chunk
    • To make similar code to what you already have, copy, paste, and edit instead of typing it all from scratch
    • Use find and replace (Command-F or Control-F)
    • Take advantage of RStudio’s autocompletion features to autocomplete function names, object names, and YAML options
    • If you want to rerun an old line of code in the Console, just use the up arrow to find it again in your command history
      • If you want to run a similar line of code, use the up arrow and then modify it before you run it
    • If you want to find a command you ran previously that starts with something, type that something in the Console and then hit Command-up or Control-up to scroll through commands you ran that start with that!

Import/read data

The simplest way to read data into R is from a text file such as a comma-separated values (csv) file. If you have data in Excel, I generally recommend exporting it as a csv and importing the csv file into R. There are packages for importing Excel files into R, but I find some of the data formatting is more consistent if I export to csv first instead of trying to read the Excel files directly into R. That’s something you can experiment with in your own work if you like! There are ways to read many other file types into R, of course, but for this class we will focus on csv files since they are common and very computer-friendly.

The package nycflights13 contains five datasets: airlines, airports, flights, planes, and weather. We’ll be working with two of these datasets today, but to practice reading data from a file, I exported them as csv files for us to read in: “airlines.csv” and “flights.csv”.

Please download those csv files from Canvas so you can follow along.

In theory, you can just read these datasets into R with one line of code each. For example, try this to load the airlines data:

airlines = read_csv("airlines.csv")

Did that work for you? It may or may not. To know why, we need to talk a bit about how files are organized on your computer, and where R looks for things when you tell it to read in a file. This can be a bit painful and confusing at first, but once you know a little about it as we’ll cover today, you can establish a workflow that works for you!

File paths

A file path is the path to a folder (directory) or file on your computer. File paths are specified in reference to a root directory or a home directory. So for example, on my computer, the file path “~/Documents/CSSS-508/data/airlines.csv” means that in my home directory (signified by “~”), there should be a “Documents” folder, which should contain a “CSSS-508” folder, which in turn should contain a folder called “data” that has a file called “airlines.csv”. This path may or may not exist; it’s an address, and a file may or may not actually live there, and one of those folders might not actually be in the folder it’s supposed to be in, etc.

File paths can be absolute or relative. An absolute file path is a sequence of directories telling you how to get from the root directory to some location. Each directory in the path is separated by a slash. On a Windows machine, the slashes in the path are all backward slashes “\” and the root directory is usually “C:\”. On Mac and Linux machines, the slashes in the path are all forward slashes “/” and the root directory is usually “/”. For example, “C:\User\Desktop\” is an example of a Windows absolute file path, and “/Users/jessicakunke/Documents/CSSS-508/data/airlines.csv” is an example of a Mac absolute file path.

Important note: if you are on a Windows machine, use a double backslash instead of a single backslash throughout your Windows file paths when you are typing them in R. Next week we will be able to better explain why this is once we learn more about working with text in R, but feel free to take a peek at the details below if you’re curious. Otherwise, for now, just consider this a handy tip. In fact, in order for me to get the single backslash to show up correctly in these lecture notes, I had to type them as double backslashes.

R and other languages use backslashes as an “escape character”. What does that mean? We’ll learn more next week about handling text in R, but let’s introduce the idea now. Consider how character values have to be surrounded by double quotes to indicate it’s a character value instead of a variable/object/function name. Then what do you do if your character string includes double quotes? You “escape” the quotes with a backslash:

# these two lines won't work if you uncomment them
# print("He said "whooooaaa"")
# cat("He said "whooooaaa"")

# but these work; note the different output of print and cat
print('He said \"whooooaaa\"')
cat('He said \"whooooaaa\"')

As a result, if you want to include a backslash as a character, you need to escape it with another backslash:

# these two lines won't work if you uncomment them
#   specifically, they expect you to type more (they think the commands aren't 
#   complete) because the \" is interpreted as part of the character string and
#   it's expecting another " to end the character string
# print("C:\User\Desktop\")
# cat("C:\User\Desktop\")

# but these work
print("C:\\User\\Desktop\\")
cat("C:\\User\\Desktop\\")

More info on Windows file paths

A relative file path is defined with reference to an arbitrary location. Examples:

  • “data/airlines.csv” means “look in your current directory for a folder called ‘data’, and in there, look for a file called ‘airlines.csv’”.
  • The home directory is typically “C:\Users\username” on Windows and “/Users/username” on a Mac/Linux system, where you replace “username” with your username. A shorthand for the home directory in either case is “~”.

If you want to look “up” a level, i.e. the directory that contains your current directory, then use “..”. For example, if you’re in “~/MyDocs/Pictures” and you want to look in “~/MyDocs/Music”, you would use “../Music”. The “..” brings you up/back a level to MyDocs, and then “/Music” brings you into the Music subfolder of MyDocs:

Reading in the data with airlines = read_csv("airlines.csv") will work if RStudio knows to look in the directory that contains our dataset. You can use the command getwd() (for “get working directory”) to see the folder where RStudio is currently looking for your files; this location or folder is called your current working directory. Any relative file paths you use are relative to this folder. So when you say the file you want is “airlines.csv”, you’re looking for that file in this folder.

getwd()

The length of an absolute file path is how deep (how many levels/folders) you are nested within the root folder. The length of a relative file path is how many levels removed you are from your current working directory. A major advantage of relative file paths is that if the things you need for a project are all relatively close to each other in your file system, then no matter where they are or how many folders deep they are, the relative file paths will be short even though the absolute file paths may be long and complicated. Absolute paths can also be annoying if you are sharing your code or collaborating with others. For an alternative project setup using relative paths, see the next section on R projects!

  • Check out setwd() and getwd()
  • In the RStudio Files pane, navigate to the data set you want, click the gear, select “Copy folder path to clipboard”, then paste that file path wherever you want the file path (e.g. inside read_csv()).
Your Turn
  1. Use getwd() to figure out what your current working directory is in RStudio.
  2. In the RStudio Files pane, navigate to some data set on your computer, click the gear, select “Copy folder path to clipboard”, then paste that file path somewhere (in a script, in the Console as a comment, into a text file, whatever) to take a look at it. Remember: a file path is not itself code, so either put it in quotes or comment it out if you’re pasting it in the Console, an R script, or a Quarto document. If you copy and paste it into Word or Notepad or something, then it doesn’t matter. Once you’ve got the file path, describe what each part of that file path tells you.
  3. Rewrite the absolute file path “~/Documents/UW/CSSS508/Data/mydata.csv” as a relative file path that is relative to the directory “UW”. In other words, if your current working directory is “~/Documents/UW”, write a relative file path to get from there to the file mydata.csv.
  4. Now suppose that UW also contains a folder called “Research” and suppose that folder is your current working directory; write a relative file path to the file mydata.csv.

Organize your work with R projects

A fairly painless and straightforward way to handle these file path challenges is to create an R Project. This R project will be associated with a folder where you put most or all of the code and data needed for the project. When you open the project in RStudio, it will tell RStudio to use that folder as your working directory, so any relative file paths will be relative to that folder.

Let’s try this approach. Create an R Project (File > New Project) and select either New Directory or Existing Directory.

Once your new project opens, figure out where the current working directory is (it should be the folder that you made the project in):

# get working directory (getwd)
getwd()

Make sure that airlines.csv is in this directory, then try loading the file again as before:

airlines = read_csv("airlines.csv")

Ta-da!

Notice this command is kind of noisy, printing out a bunch of stuff we don’t need. As the message says, we can make it “quieter” by setting another argument of the read_csv() function:

airlines = read_csv("airlines.csv", show_col_types = FALSE)

In the future, when you want to work on this project, open RStudio by navigating to that folder in Finder or File Explorer and double-click the .Rproj file. This will automatically set your working directory to that folder! Splendid!

Using R Projects, you can have multiple RStudio sessions open and running at once if you like: for example, one for your CSSS 508 homework, one for your research, and one for your homework in another course.

Your Turn

Suppose that you want to keep all your data in a “Data” subfolder of your project directory (I often recommend doing this). How would you modify the code below to read in data files from that folder?

airlines = read_csv("airlines.csv")

Assuming you opened your R Project by double-clicking the .RProj file, your working directory should be your project directory. Since “Data” is a subfolder of that project directory, you can use the following relative path to say “look for a folder here called ‘Data’ and look inside there for a file called ‘airlines.csv’”.

airlines = read_csv("Data/airlines.csv")

Read in today’s data

We’ll assume you have the csv data files in a “Data” subfolder of your R project folder, so either make sure you have your project set up that way in order to successfully load the data with the code below, or adapt the code below to look for the data wherever you actually stored it.

airlines = read_csv("Data/airlines.csv"))
flights = read_csv("Data/flights.csv"))

Notice that this requires copying and pasting “Data/” many times. What if later you decide to reorganize your files and you move the data into another folder, or you rename the folder? You might update end up missing one of these lines when you go to update the code. To avoid that issue, I recommend that if you are going to use a file path more than once, actually store it as an object and then use paste0() to paste together the two strings without a space in between (that’s what the 0 means, no added space in between).

To see how this works and practice a little:

data_path = "Data/"

paste0(data_path, "airlines.csv")

paste0("My name is", "Jess")

So here’s the code you would actually use to read in your data files:

# set the relative file path to the folder where your data is stored
data_path = "Data/"

# read in your data files
airlines = read_csv(paste0(data_path, "airlines.csv"))
flights = read_csv(paste0(data_path, "flights.csv"))

Process data using factors

We’ve already done a good bit of work with data manipulation using tidyverse, and we are going to keep doing more with data processing next week. For now, let’s build upon what we started to learn about factors and missing data in the homework.

Let’s start by reviewing what you saw in the homework. Consider a vector of months:

x = c("Dec", "Apr", "Jan", "Mar")

Handling this as character-type data has at least two disadvantages: (1) you might have typos or multiple formats like “January” or “01” or “Jam” instead of “Jan”, even though there should only be 12 possible values since there are 12 months, and (2) sorting the vector sorts it alphabetically, not in the order of months. For instance, try this out:

sort(x)

We can address these problems by making the variable a factor. If you don’t care about the order of the levels, you can make the variable a factor in one step:

x = factor(x)

Take a look at the vector now by calling it and trying to sort it:

# call x
x
# sort x
sort(x)

Otherwise, if you do care about the order, as we do with months, we’ll do it in two steps. First, we create a list of the levels (possible values) in the order we want them to follow:

month_levels = c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)

Then we make our variable a factor:

x = factor(x, levels = month_levels)

Take a look at the vector now by calling it and trying to sort it:

# call x
x
# sort x
sort(x)

If you have typos or different formats, they’ll be converted to NA, in which case you can look back at the original data to see what those values were and handle them accordingly. For example:

# here the last two values should be counted as the same month
x_oops = c("Dec", "Apr", "January", "Jan")
x_oops_factor = factor(x_oops, levels = month_levels)
# but they're not
x_oops_factor
# which value didn't match our 12 values?
x_oops[is.na(x_oops_factor)]
# ahh, that should be Jan, so let's change it using logical indices
# - this code says, "set every element of x_oops whose value is 'January' to 'Jan'"
x_oops[x_oops == "January"] = "Jan"

How would we adapt the above code if x were not an object on its own but a column of a dataset? For example, here’s a small example dataset:

test_data = data.frame(month = c("Dec", "Apr", "Jan", "Mar"), temp_F = c(-10, 51, 16, 47))

test_data$month

How do we define levels for the month column and format the column as a factor?

# first set the levels:
month_levels = c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)

# then do one of the following:
# either this
test_data$month = factor(test_data$month, levels = month_levels)
# or this
test_data = test_data %>%
  mutate(month = factor(month, levels = month_levels))

# check the results:
test_data$month
str(test_data$month)
Your Turn: Applying this to today’s data
  1. Which column do the airlines and flights datasets have in common? How many different values does that column have in the airlines dataset? in the flights dataset? Does each row have a unique value for this column in the airlines dataset? in the flights dataset?

  2. Soon we’re going to merge these two datasets together by matching this column, so to prepare for that, let’s quality control this column in the two datasets.

    1. You should have found in Problem 1 that in one of the datasets (but not the other), every row has a unique value for this shared column. Make the column a factor in that dataset with a unique level for each unique value. Does order matter? Why or why not?

    2. Using the same levels as in Part a, make the same column of the other dataset a factor. Are there any missing values (NAs)? Why, what were the original values? You can read in that second dataset again to get the original values back instead of the NAs.

    3. Identify the unique values of that second dataset and correct them before making the column a factor. In other words, let’s say you find some typos like “Jam” instead of “Jan”. How would you write the following logic in R code: “For all the rows in which this column equals ‘Jam’, set this column to ‘Jan’.” Give it a shot and discuss with your neighbors!

By the end of this exercise, you should have made the following changes:

  • The shared column should now be formatted as a factor in both these datasets.
  • Any typos should be corrected so that there are no missing values in this column in either dataset.

Problem 1

Comparing the names of the variables in the two datasets, we see that carrier is the only shared column. (Tip: Since airlines only has two columns, just look for those names in flights.)

names(airlines)
names(flights)

A nice way to figure that out without having to visually scan the list of names yourself is to use the %in% operator. Question for HW7: Try out the following code to get a sense for how %in% works, then identify the line of code you’d use in this case. Explain how that line of code works, piece by piece, to help someone understand how they would come up with that line of code themselves.

names(airlines) %in% names(flights)
names(flights) %in% names(airlines)
names(airlines)[names(airlines) %in% names(flights)]

They each have 16 unique values:

n_distinct(airlines$carrier)
n_distinct(flights$carrier)

Since airlines has 16 rows, that means every row has a unique value for the carrier column. Since flights has over 300,000 rows, there certainly must be repeats of at least some of the 16 unique carriers.

Note that if you check which values they are, though, they are not the same 16 values:

unique(airlines$carrier)
unique(flights$carrier)

That should already tip you off that you will have to consider what to do about those values. For now, we’ll reserve that for later.

Problem 2a

A table like airlines, that lists each carrier by a short two-character ID and a full name, is often called a lookup table. Its purpose is to match every unique identifier, sometimes called a key (here, the carrier ID), with some other information about that variable (such as the full name in this case) which may or may not be unique for every value of the unique identifier. Usually lookup tables have every possible value for the unique identifier in them and have exactly one row for each value. Therefore, we might reasonably figure that the 16 values in the carrier column of the airlines dataset are meant to be the set of possible carrier IDs, not just the 16 values that happen to be in that dataset.

Here there is probably no particular inherent order to the different carrier IDs. Perhaps alphabetical makes sense, and that’s the order that R will pick by default. Therefore, we will make the carrier variable of the airlines dataset a factor without specifying the levels and their order:

# do one of the following:
# base R
airlines$carrier = factor(airlines$carrier)
# tidyverse
airlines = airlines %>%
  mutate(carrier = factor(carrier))

When might you handle this differently?

  • For one, if airlines is not a lookup table but, say, a sample of 16 carriers from the ones you care about, then you might want to set the levels to include all possible carriers that you care about for your analysis, whether or not they appear in airlines.
  • Also, if the carriers do not appear in alphabetical order in the column of your lookup table and you want the levels to reflect the order in the lookup table, you can set the levels using the column itself. Remember that a column is a vector, and the order in which the values appear in the column is the order they appear in airlines$carrier, so if you use that to set your levels, that will be the order of the levels for the factor variable. So if you sort those values, they will be sorted in that order. In this case it doesn’t matter because airlines$carrier is already in alphabetical order, but that may not be true in another dataset.
# for the second option (ordering the levels in the order they appear in airlines$carrier),
# do one of the following:
# base R
airlines$carrier = factor(airlines$carrier, levels = levels(airlines$carrier))
# tidyverse
airlines = airlines %>%
  mutate(carrier = factor(carrier, levels = levels(airlines$carrier)))

# again, this will give you the same result as the previous code that didn't set the levels,
# because airlines$carrier is already in alphabetical order

However you set up your factor variable, take a look at the result:

# the first row of output from this command shows the values as they appear in airlines$carrier
# the second row shows the possible levels, in the order you set them (or the default alphabetic order otherwise)
airlines$carrier

# the first row of output from this command shows the values as they appear in head(airlines$carrier)
# the second row shows the possible levels, in the order you set them (or the default alphabetic order otherwise)
head(airlines$carrier)

# the output says it's a factor variable with 16 levels, and lists the first few possible levels in order
# then after the colon (:), it says the first few values of airlines$carrier,
# but instead of listing them by the character value ("9E", "AA", etc.), it lists
# them by their order: 1 for the first level ("9E"), 2 for the second level ("AA"), etc.
str(airlines$carrier)

# not surprisingly, each value appears once in the dataset
# (why is this not surprising?)
summary(airlines$carrier)

Problem 2b

Now let’s make flights$carrier a factor variable using the same levels (same possible values in the same order) as airlines$carrier:

# do one of the following:
# base R
flights$carrier = factor(flights$carrier, levels = levels(airlines$carrier))
# tidyverse
flights = flights %>%
  mutate(carrier = factor(carrier, levels = levels(airlines$carrier)))

Then take a look at the result using any or all of the following code:

# I didn't call flights$carrier because it's so long (>300,000 rows!)
# instead, I start with head(flights$carrier)

# just as before,
# the first row of output from this command shows the values as they appear in head(flights$carrier)
# the second row shows the possible levels, in the order you set them (or the default alphabetic order otherwise)
head(flights$carrier)

# same as before,
# except that now the numbers after the colon say 12 12 2 4 5 ... instead of 1 2 3 4 5 ...
# why is that? because the first five values of the flights$carrier column
# are UA UA AA B6 DL, and UA is the 12th level, AA is the 2nd, and so on
str(flights$carrier)

# uh oh, 692 NAs!
summary(flights$carrier)

The fact that there are NAs means that there must have been some values of flights$carrier that are not in the possible levels we used when we defined it as a factor variable. Therefore, when we made it a factor variable, it set those values to NA because they didn’t match any of the 16 possible levels.

There are multiple ways to see what those original values were that got replaced with NA. Since we overwrote the original flights$carrier column with the factor version, to see the original values we’ll have to reload the original data from the file first:

# read the data in again from the file
flights = read_csv(paste0(data_path, "flights.csv"))

# look at the unique values
unique(flights$carrier)

# compare to airlines$carrier
unique(airlines$carrier)

We see that flights has an extra carrier that airlines doesn’t (“00”) and airlines has an extra carrier that flights doesn’t (“VX”). The “00” values will not match one of the levels if we’re setting the levels using airlines$carrier, so the “00” values will be converted to NAs.

Is that it, or are there other differences? How could we verify that all the values were originally “00”? Here are two good ways to check. One approach is to see how many values are “00”. If that’s the same as the number of NAs we got (692), then that supports our conclusion. Another approach is to create the factor column, but this time make it a separate column instead of overwriting the original carrier column so that we can compare them. Let’s try both these approaches, starting with the first.

# use one of these ways to see how many "00" values you have
table(flights$carrier)

Hmm, only 7 values are “00”, so that’s part of the story but not the whole story. Looking closely, we see that there’s another difference between the unique values of the two datasets: flights has “F8” while airlines has “F9”. As a result, they both have 16 unique values, but they aren’t the same sets of 16 values. Indeed, there are 685 F8 values, and \(685+7 = 692\). So maybe it’s the “00” and the “F8” values that account for all the NAs we got.

To be sure, let’s do the second approach: let’s create the factor column, but this time make it a separate column instead of overwriting the original carrier column so that we can compare them.

# do one of the following:
# base R
flights$carrier_factor = factor(flights$carrier, levels = levels(airlines$carrier))
# tidyverse
flights = flights %>%
  mutate(carrier_factor = factor(carrier, levels = levels(airlines$carrier)))

There are many ways you can compare these columns now, but there are a lot of rows. In rows without an NA, the two columns should match, so really we only care about rows in which carrier_factor is NA. Also, what we care about is seeing what values the carrier column has in those rows. So let’s do this:

flights_compare = flights %>%
  filter(is.na(carrier_factor))

unique(flights_compare$carrier)

What is this code doing, and why does it help us get the answer we want?

Problem 2c

This solution will be provided after HW7 is due. See HW7 for a discussion of this part of the problem.

Join two datasets

As our next step, we’re going to join the two datasets using this shared column. To understand conceptually how a general join works, let’s consider this diagram:

Joining two datasets.

Joining two datasets.

There are actually many different ways to join data: do you want to keep your first/left dataset and just add columns based on the second/right dataset? Vice versa? Keep rows from both? How do you handle any cases in which there may be multiple matches?

For now, we will assume that there are no multiple matches, that you have correctly identified the column you’re joining by as a unique identifier. We’ll also assume you want to keep your first/left dataset and just add columns based on the second/right dataset, so that’s known as a left join.

# start with your "left" dataset
combined = flights %>%
  left_join(airlines)

How can we check whether the join worked?

Export/Write data to a file

We used read_csv() to read in the data, and to write to file, we’ll use write_csv(). Where will the following line of code save your output? What if you wanted to save to another location?

write_csv(combined, file = "combined.csv")