This guide will get you ready to work with R and RStudio. It will also teach you the basics of version control with GitHub within you data analysis workflow to facilitate collaboration and ensure the safety of your data and scripts.
The installation process of these different tools will be done in the following few steps you should do in order.
R is a free software environment and programming language for statistical computing and data visualization maintained by the R foundation.
To install R, head to Comprehensive R Archive Network (CRAN) and choose the version corresponding to your operating system (Windows, MacOS, Linux). Make sure you click on a link in the upper box of the page to get the precompiled version of R.
If you prefer watching a step by step guide, check out roughly the first half of these videos!:
Great! Now that you have successfully installed R, close the installer and start R. You should be greeted with a window looking like this:
This is the basic interface which you can use to interact with R. Try
running the following command to test it out
print("Congratulations on installing R!")
before closing it
again.
This simple GUI isn’t very pretty and does not display a lot of information. Let’s install RStudio to remedy that!
RStudio is a free integrated development environment (IDE) that allows you to be more productive while analysing and visualizing data by integrating features such as code highlighting, debugging tools, workspace managers, etc. You can even run Python scripts in RStudio!
To install R studio, simply head to their website and select the installer corresponding to the version of your OS.
NB: If you have trouble installing RStudio, check out the second part of the installation videos linked above for a step by step guide.
Awesome! Now launch RStudio. You should see a window that is similar to the one I have while I am writing this article (with a different theme).
If this is the case, try running a few simple lines of code to test
whether R and RStudio have been correctly installed. Let us type the
following command in the console (i.e. the bottom left pane where you
see >
). Simply type
print("Congratulations on installing RStudio!")
and check
that the resulting message you get is
[1] "Congratulations on installing RStudio!"
. Getting this
message means that you have successfully installed both R and
RStudio.
Before we proceed, take the time to look at the RStudio IDE cheatsheet you will find here. This will get you acquainted with the different panes and buttons on the screen.
Now that you have installed base R and the RStudio IDE you’re almost ready for your first data visualization project! This little tutorial will also outline how to install new packages to extend R’s functionalities and play around with objects such as graphs and dataframes!
A helpful metaphor that allowed me to grasp the difference between
base R and packages is the workshop metaphor. Base R, the program we
just installed, is our workshop and already contains a number of basic
tools (e.g. functions) such as the print()
function we used
above. Now, think of packages as toolboxes of additional functions/tools
that you can add to your workshop to speed up your data analysis
workflow or to conduct more specialized tasks for instance e.g. a
carpenter won’t use the same tools as an electrician.
Now that you know what packages are, you might wonder where one can download them. The good news is that the vast majority of them is one command away and is free and open-source (meaning that you can check their source code online!). There are essentially two places you will find them:
install.packages("package_name")
commanddevtools::install_github("username/repositoryname")
command
after having installed the {devtools}
package from
CRAN.install.packages("devtools") # Install the devtools package
library(devtools) # Load the package into the R session
install_github("username/package_name") # Install some other package from Github
Now that we have seen how to install packages from the two main sources, let’s try it out and install a couple of packages ourselves!
# Run the following commands in your R-Studio console before proceeding
install.packages("devtools") # Install the devtools package
install.packages("tidyverse") # A set of SUPER useful packages
install.packages("gapminder") # Some data for our data visualization exercise
# NB: A neat way of doing this is to install them all at once by putting them
# their names inside a vector with c()!
install.packages(c("devtools", "tidyverse", "gapminder"))
Awesome! You’ve just installed a few packages and considerably
expanded your data analysis toolbox. Now, the unfortunate thing is that
you cannot use these tools right away without telling R in which toolbox
they are. In other words, you will need to either “import” them once per
R session by running the library(package_name)
command or
reference them before each function by writing
package_name::myfunction()
.
# tibble() is a function from the {dplyr} package we just installed with the
# {tidyverse}. Hence, simply calling tibble() won't work:
tibble(mtcars)
# We can access it either by loading the {dplyr} package...
library(dplyr)
tibble(mtcars)
# ... or by referencing it explicitly:
dplyr::tibble(mtcars)
You might wonder why the two ways exist since the first one seems way more convenient. One example when I found the second option safer and more convenient was when I was dealing with network analysis methods which came in different packages. Unfortunately, these packages contained identically or very similarly named functions which complicated my network analysis workflow. This becomes especially an issue since functions of the most recently loaded package take precedence over the latter. Hence, I’ve taken the habit of writing out the explicit name of the package from which the function comes. Plus, it will jog your memory and make you remember in which package you’ll find which function!
# Loading the previously installed packages
library(devtools)
library(tidyverse)
library(gapminder)
Every (user-accessible) function in a package contains a
documentation file that you can access by typing
?function_name
after having loaded the package. This
documentation details the inner workings of the function e.g. what type
of inputs it must be provided and what is output is. This is usually the
first place to look for help when encountering an error while using a
function.
Finally, as with every academic paper or book, it is important to
cite the tools you used within your data analysis workflow. R made this
simple by providing users with the simple
citation("name_of_package")
command.
library(ggplot2) # Load the{ggplot2} package
help("ggplot2") # Ask for package help page
?tibble # Ask for function help page
citation("dplyr") # Ask for citation info
To cite package 'dplyr' in publications use:
Wickham H, François R, Henry L, Müller K (2022). _dplyr: A Grammar of
Data Manipulation_. R package version 1.0.8,
<https://CRAN.R-project.org/package=dplyr>.
A BibTeX entry for LaTeX users is
@Manual{,
title = {dplyr: A Grammar of Data Manipulation},
author = {Hadley Wickham and Romain François and Lionel Henry and Kirill Müller},
year = {2022},
note = {R package version 1.0.8},
url = {https://CRAN.R-project.org/package=dplyr},
}
Careful: R is case-sensitive, so
?ggplot2
works but ?Ggplot2
will not.
Likewise, a variable called A
is different from anoter
called a
.
Now that you have all the tools at your disposal, let’s create our
first data visualization project! Feel free to open up a new R script in
RStudio by clicking on File --> New File --> R Script
or holding down CTRL+SHIFT+N
on your keyboard and copy and
paste the following code into your file.
The first thing we’ll do is to load all the necessary packages for our little project:
library(gapminder) # A package with the data we'll analyse
library(tidyverse) # A set of packages for data analysis and visualization
Datasets come in different formats and from various sources. We’ll
see in a further lab session how to import the most common types of
datasets. For now we will use the datasets contained in the {gapminder}
package. This data is essentially an excerpt of the famous gapminder database. Let’s have a
look at what we’re working with!
# Let's look at the original dataset from the {gapminder} package
glimpse(gapminder) # Displays a description of the data (dplyr)
Rows: 1,704
Columns: 6
$ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
head(gapminder) # Displays the first observations (base R)
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
tail(gapminder) # Displays the last observations (base R)
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Zimbabwe Africa 1982 60.4 7636524 789.
2 Zimbabwe Africa 1987 62.4 9216418 706.
3 Zimbabwe Africa 1992 60.4 10704340 693.
4 Zimbabwe Africa 1997 46.8 11404948 792.
5 Zimbabwe Africa 2002 40.0 11926563 672.
6 Zimbabwe Africa 2007 43.5 12311143 470.
names(gapminder) # Displays column names (base R)
[1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
summary(gapminder) # Displays a summary of the data
country continent year lifeExp
Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
Algeria : 12 Asia :396 Median :1980 Median :60.71
Angola : 12 Europe :360 Mean :1980 Mean :59.47
Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
Australia : 12 Max. :2007 Max. :82.60
(Other) :1632
pop gdpPercap
Min. :6.001e+04 Min. : 241.2
1st Qu.:2.794e+06 1st Qu.: 1202.1
Median :7.024e+06 Median : 3531.8
Mean :2.960e+07 Mean : 7215.3
3rd Qu.:1.959e+07 3rd Qu.: 9325.5
Max. :1.319e+09 Max. :113523.1
It looks like a panel dataset with 5-year interval observations of the life expectancy, the population and the GDP per capita for each country over the period 1952-2007. Let’s create a new dataframe by selecting only the data for a few European countries to plot the evolution of income per capita over time.
# Extract information with dplyr:
eu_data <- gapminder %>%
filter(country == "Switzerland" |
country == "France" |
country == "Germany")
# The equivalent, non-pipe code looks like this:
# eu_data <- gapminder
# eu_data <- filter(eu_data,
# country == "Switzerland" |
# country == "France" |
# country == "Germany")
The first line assigns the gapminder
dataset to the
eu_data
variable. The pipe operator
%>%
allows us to combine multiple operations on a
single dataset. For instance, here we filter our observations with
dplyr::filter()
after creating the new eu_data
dataset.
How does this subsetted dataframe look now ?
eu_data
# A tibble: 36 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 France Europe 1952 67.4 42459667 7030.
2 France Europe 1957 68.9 44310863 8663.
3 France Europe 1962 70.5 47124000 10560.
4 France Europe 1967 71.6 49569000 13000.
5 France Europe 1972 72.4 51732000 16107.
6 France Europe 1977 73.8 53165019 18293.
7 France Europe 1982 74.9 54433565 20294.
8 France Europe 1987 76.3 55630100 22066.
9 France Europe 1992 77.5 57374179 24704.
10 France Europe 1997 78.6 58623428 25890.
# … with 26 more rows
summary(eu_data)
country continent year lifeExp
France :12 Africa : 0 Min. :1952 Min. :67.41
Germany :12 Americas: 0 1st Qu.:1966 1st Qu.:70.95
Switzerland:12 Asia : 0 Median :1980 Median :74.34
Afghanistan: 0 Europe :36 Mean :1980 Mean :74.45
Albania : 0 Oceania : 0 3rd Qu.:1993 3rd Qu.:77.60
Algeria : 0 Max. :2007 Max. :81.70
(Other) : 0
pop gdpPercap
Min. : 4815000 Min. : 7030
1st Qu.: 7144182 1st Qu.:15767
Median :53799292 Median :22516
Mean :45627967 Mean :22155
3rd Qu.:74396451 3rd Qu.:28530
Max. :82400996 Max. :37506
Looks like we’ve successfully extracted the data for Switzerland, France and Germany! Let’s move on and plot the life expectancy of these three countries over the past half century.
# Create your first plot
plot <- ggplot(eu_data, aes(x = year, # Set the x axis
y = lifeExp, # Set the y axis
group = country, # Set the countries to plot
color = country)) + # Set different colors
geom_line() # Add a line geom to draw the actual lines
Hold on, where is my plot? Didn’t I just create it?
Absolutely, you just created your plot
and saved it in
your environment that you can see on the top right corner of your screen
along side the eu_data
dataframe. You can now visualize it
by typing plot
in your console!
plot
Et voilà! You just created your first plot in R!
Choose three other countries in the dataset and try to recreate the
plot above! Don’t forget to change the names of the objects your create
e.g. eu_data
to yourcountries_data
and
plot
to your_plot
. Feel free to add titles,
labels, or captions.
# Making the first plot fancy!
fancyplot <- ggplot(eu_data, aes(x = year, # Set the x axis
y = lifeExp, # Set the y axis
group = country, # Set the countries to plot
color = country)) + # Set different colors
geom_line() + # Add a line geom to draw the actual lines
labs(title = "Evolution of the Life Expectancy",
subtitle = paste(unique(eu_data$country), collapse = ", "),
caption = "Data: {gapminder}",
x = "Year",
y = "Life expectancy",
color = "Country") + # Changing the labels of the various graph elements
theme_minimal() # Add a more minimalist theme
# Displaying the fancy plot
fancyplot
The best way to save the plot you just created is to use
ggplot2::ggsave()
. This function makes it easy to specify
the various export options of plots and provides opinionated but
sensible defaults.
# Take a look at the documentation of the ggsave function
?ggsave
# Specify the name of the file and the plot object you'd like to save
ggsave("bernhards_plot.pdf", fancyplot)
Congrats! You’ve made it to the end of this first tutorial and have
successfully installed R and R-Studio! In the next tutorial, we will
look at version control software and learn how to install Git, Fork (a
GUI for Git) and create an account on GitHub. In the meantime, feel free
to experiment further with the {gapminder}
data to create
pretty visualizations!
{gapminder}
visualizationsCheck out {gapminder}
’s
repository on GitHub! They have a few pretty data visualization
examples on their README page you could try to reproduce!
If you want to explore the data visualization capabilities of R and ggplot, check out the tidytuesday challenge and the associated shiny app showcasing this week’s charts.