1 Installing R and RStudio

This guide will get you ready to work with R and RStudio. It will also teach you the basics of version control with GitHub within you data analysis workflow to facilitate collaboration and ensure the safety of your data and scripts.

The installation process of these different tools will be done in the following few steps you should do in order.

2 Installing R

R is a free software environment and programming language for statistical computing and data visualization maintained by the R foundation.

To install R, head to Comprehensive R Archive Network (CRAN) and choose the version corresponding to your operating system (Windows, MacOS, Linux). Make sure you click on a link in the upper box of the page to get the precompiled version of R.

2.1 Try it out !

Great! Now that you have successfully installed R, close the installer and start R. You should be greeted with a window looking like this:

This is the basic interface which you can use to interact with R. Try running the following command to test it out print("Congratulations on installing R!") before closing it again.

This simple GUI isn’t very pretty and does not display a lot of information. Let’s install RStudio to remedy that!

3 Install RStudio

RStudio is a free integrated development environment (IDE) that allows you to be more productive while analysing and visualizing data by integrating features such as code highlighting, debugging tools, workspace managers, etc. You can even run Python scripts in RStudio!

To install R studio, simply head to their website and select the installer corresponding to the version of your OS.

NB: If you have trouble installing RStudio, check out the second part of the installation videos linked above for a step by step guide.

3.1 Try out RStudio

Awesome! Now launch RStudio. You should see a window that is similar to the one I have while I am writing this article (with a different theme).

If this is the case, try running a few simple lines of code to test whether R and RStudio have been correctly installed. Let us type the following command in the console (i.e. the bottom left pane where you see >). Simply type print("Congratulations on installing RStudio!") and check that the resulting message you get is [1] "Congratulations on installing RStudio!". Getting this message means that you have successfully installed both R and RStudio.

Before we proceed, take the time to look at the RStudio IDE cheatsheet you will find here. This will get you acquainted with the different panes and buttons on the screen.

4 Getting readyfor your first data vizualization project!

Now that you have installed base R and the RStudio IDE you’re almost ready for your first data visualization project! This little tutorial will also outline how to install new packages to extend R’s functionalities and play around with objects such as graphs and dataframes!

4.1 Learning how to install R packages:

A helpful metaphor that allowed me to grasp the difference between base R and packages is the workshop metaphor. Base R, the program we just installed, is our workshop and already contains a number of basic tools (e.g. functions) such as the print() function we used above. Now, think of packages as toolboxes of additional functions/tools that you can add to your workshop to speed up your data analysis workflow or to conduct more specialized tasks for instance e.g. a carpenter won’t use the same tools as an electrician.

Now that you know what packages are, you might wonder where one can download them. The good news is that the vast majority of them is one command away and is free and open-source (meaning that you can check their source code online!). There are essentially two places you will find them:

  • CRAN –> Installed via the install.packages("package_name") command
  • GitHub –> Installed via devtools::install_github("username/repositoryname") command after having installed the {devtools} package from CRAN.
install.packages("devtools")  # Install the devtools package
library(devtools)             # Load the package into the R session
install_github("username/package_name") # Install some other package from Github

Now that we have seen how to install packages from the two main sources, let’s try it out and install a couple of packages ourselves!

# Run the following commands in your R-Studio console before proceeding
install.packages("devtools")  # Install the devtools package
install.packages("tidyverse") # A set of SUPER useful packages
install.packages("gapminder") # Some data for our data visualization exercise

# NB: A neat way of doing this is to install them all at once by putting them
# their names inside a vector with c()!
install.packages(c("devtools", "tidyverse", "gapminder"))

4.2 Learning how to use R packages:

Awesome! You’ve just installed a few packages and considerably expanded your data analysis toolbox. Now, the unfortunate thing is that you cannot use these tools right away without telling R in which toolbox they are. In other words, you will need to either “import” them once per R session by running the library(package_name) command or reference them before each function by writing package_name::myfunction().

# tibble() is a function from the {dplyr} package we just installed with the
# {tidyverse}. Hence, simply calling tibble() won't work:
tibble(mtcars)
# We can access it either by loading the {dplyr} package...
library(dplyr)
tibble(mtcars)
# ... or by referencing it explicitly:
dplyr::tibble(mtcars)

You might wonder why the two ways exist since the first one seems way more convenient. One example when I found the second option safer and more convenient was when I was dealing with network analysis methods which came in different packages. Unfortunately, these packages contained identically or very similarly named functions which complicated my network analysis workflow. This becomes especially an issue since functions of the most recently loaded package take precedence over the latter. Hence, I’ve taken the habit of writing out the explicit name of the package from which the function comes. Plus, it will jog your memory and make you remember in which package you’ll find which function!

# Loading the previously installed packages
library(devtools)
library(tidyverse)
library(gapminder)

4.3 Package documentation and citation

Every (user-accessible) function in a package contains a documentation file that you can access by typing ?function_name after having loaded the package. This documentation details the inner workings of the function e.g. what type of inputs it must be provided and what is output is. This is usually the first place to look for help when encountering an error while using a function.

Finally, as with every academic paper or book, it is important to cite the tools you used within your data analysis workflow. R made this simple by providing users with the simple citation("name_of_package") command.

library(ggplot2)  # Load the{ggplot2} package
help("ggplot2")   # Ask for package help page
?tibble         # Ask for function help page
citation("dplyr") # Ask for citation info

To cite package 'dplyr' in publications use:

  Wickham H, François R, Henry L, Müller K (2022). _dplyr: A Grammar of
  Data Manipulation_. R package version 1.0.8,
  <https://CRAN.R-project.org/package=dplyr>.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {dplyr: A Grammar of Data Manipulation},
    author = {Hadley Wickham and Romain François and Lionel Henry and Kirill Müller},
    year = {2022},
    note = {R package version 1.0.8},
    url = {https://CRAN.R-project.org/package=dplyr},
  }

Careful: R is case-sensitive, so ?ggplot2 works but ?Ggplot2 will not. Likewise, a variable called A is different from anoter called a.

4.4 Create your first plot

Now that you have all the tools at your disposal, let’s create our first data visualization project! Feel free to open up a new R script in RStudio by clicking on File --> New File --> R Script or holding down CTRL+SHIFT+N on your keyboard and copy and paste the following code into your file.

The first thing we’ll do is to load all the necessary packages for our little project:

library(gapminder) # A package with the data we'll analyse
library(tidyverse) # A set of packages for data analysis and visualization

Datasets come in different formats and from various sources. We’ll see in a further lab session how to import the most common types of datasets. For now we will use the datasets contained in the {gapminder} package. This data is essentially an excerpt of the famous gapminder database. Let’s have a look at what we’re working with!

# Let's look at the original dataset from the {gapminder} package
glimpse(gapminder) # Displays a description of the data (dplyr)
Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
head(gapminder) # Displays the first observations (base R)
# A tibble: 6 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.
tail(gapminder) # Displays the last observations (base R)
# A tibble: 6 × 6
  country  continent  year lifeExp      pop gdpPercap
  <fct>    <fct>     <int>   <dbl>    <int>     <dbl>
1 Zimbabwe Africa     1982    60.4  7636524      789.
2 Zimbabwe Africa     1987    62.4  9216418      706.
3 Zimbabwe Africa     1992    60.4 10704340      693.
4 Zimbabwe Africa     1997    46.8 11404948      792.
5 Zimbabwe Africa     2002    40.0 11926563      672.
6 Zimbabwe Africa     2007    43.5 12311143      470.
names(gapminder) # Displays column names (base R)
[1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"
summary(gapminder) # Displays a summary of the data
        country        continent        year         lifeExp     
 Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
 Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
 Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
 Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
 Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
 Australia  :  12                  Max.   :2007   Max.   :82.60  
 (Other)    :1632                                                
      pop              gdpPercap       
 Min.   :6.001e+04   Min.   :   241.2  
 1st Qu.:2.794e+06   1st Qu.:  1202.1  
 Median :7.024e+06   Median :  3531.8  
 Mean   :2.960e+07   Mean   :  7215.3  
 3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
 Max.   :1.319e+09   Max.   :113523.1  
                                       

It looks like a panel dataset with 5-year interval observations of the life expectancy, the population and the GDP per capita for each country over the period 1952-2007. Let’s create a new dataframe by selecting only the data for a few European countries to plot the evolution of income per capita over time.

# Extract information with dplyr:
eu_data <- gapminder %>%
  filter(country == "Switzerland" |
           country == "France" |
           country == "Germany")

# The equivalent, non-pipe code looks like this:
# eu_data <- gapminder
# eu_data <- filter(eu_data,
#                   country == "Switzerland" |
#                     country == "France" |
#                     country == "Germany")

The first line assigns the gapminder dataset to the eu_data variable. The pipe operator %>% allows us to combine multiple operations on a single dataset. For instance, here we filter our observations with dplyr::filter() after creating the new eu_data dataset.

How does this subsetted dataframe look now ?

eu_data
# A tibble: 36 × 6
   country continent  year lifeExp      pop gdpPercap
   <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
 1 France  Europe     1952    67.4 42459667     7030.
 2 France  Europe     1957    68.9 44310863     8663.
 3 France  Europe     1962    70.5 47124000    10560.
 4 France  Europe     1967    71.6 49569000    13000.
 5 France  Europe     1972    72.4 51732000    16107.
 6 France  Europe     1977    73.8 53165019    18293.
 7 France  Europe     1982    74.9 54433565    20294.
 8 France  Europe     1987    76.3 55630100    22066.
 9 France  Europe     1992    77.5 57374179    24704.
10 France  Europe     1997    78.6 58623428    25890.
# … with 26 more rows
summary(eu_data)
        country      continent       year         lifeExp     
 France     :12   Africa  : 0   Min.   :1952   Min.   :67.41  
 Germany    :12   Americas: 0   1st Qu.:1966   1st Qu.:70.95  
 Switzerland:12   Asia    : 0   Median :1980   Median :74.34  
 Afghanistan: 0   Europe  :36   Mean   :1980   Mean   :74.45  
 Albania    : 0   Oceania : 0   3rd Qu.:1993   3rd Qu.:77.60  
 Algeria    : 0                 Max.   :2007   Max.   :81.70  
 (Other)    : 0                                               
      pop             gdpPercap    
 Min.   : 4815000   Min.   : 7030  
 1st Qu.: 7144182   1st Qu.:15767  
 Median :53799292   Median :22516  
 Mean   :45627967   Mean   :22155  
 3rd Qu.:74396451   3rd Qu.:28530  
 Max.   :82400996   Max.   :37506  
                                   

Looks like we’ve successfully extracted the data for Switzerland, France and Germany! Let’s move on and plot the life expectancy of these three countries over the past half century.

# Create your first plot
plot <- ggplot(eu_data, aes(x = year,           # Set the x axis
                            y = lifeExp,        # Set the y axis
                            group = country,    # Set the countries to plot
                            color = country)) + # Set different colors
  geom_line() # Add a line geom to draw the actual lines

Hold on, where is my plot? Didn’t I just create it?

Absolutely, you just created your plot and saved it in your environment that you can see on the top right corner of your screen along side the eu_data dataframe. You can now visualize it by typing plot in your console!

plot 

Et voilà! You just created your first plot in R!

5 Now it’s your turn!

Choose three other countries in the dataset and try to recreate the plot above! Don’t forget to change the names of the objects your create e.g. eu_data to yourcountries_data and plot to your_plot. Feel free to add titles, labels, or captions.

# Making the first plot fancy!
fancyplot <- ggplot(eu_data, aes(x = year,      # Set the x axis
                            y = lifeExp,        # Set the y axis
                            group = country,    # Set the countries to plot
                            color = country)) + # Set different colors
  geom_line() + # Add a line geom to draw the actual lines
  labs(title = "Evolution of the Life Expectancy",
       subtitle = paste(unique(eu_data$country), collapse = ", "),
       caption = "Data: {gapminder}",
       x = "Year",
       y = "Life expectancy",
       color = "Country") + # Changing the labels of the various graph elements
  theme_minimal()           # Add a more minimalist theme
# Displaying the fancy plot
fancyplot

5.1 Save your creation

The best way to save the plot you just created is to use ggplot2::ggsave(). This function makes it easy to specify the various export options of plots and provides opinionated but sensible defaults.

# Take a look at the documentation of the ggsave function
?ggsave
# Specify the name of the file and the plot object you'd like to save
ggsave("bernhards_plot.pdf", fancyplot)

Congrats! You’ve made it to the end of this first tutorial and have successfully installed R and R-Studio! In the next tutorial, we will look at version control software and learn how to install Git, Fork (a GUI for Git) and create an account on GitHub. In the meantime, feel free to experiment further with the {gapminder} data to create pretty visualizations!

6 Other nice {gapminder} visualizations

  • Check out {gapminder}’s repository on GitHub! They have a few pretty data visualization examples on their README page you could try to reproduce!

  • If you want to explore the data visualization capabilities of R and ggplot, check out the tidytuesday challenge and the associated shiny app showcasing this week’s charts.