Chapter 4 Exploratory Data Analysis (EDA)
In Exploratory Data Analysis (EDA for short) we will want to explore our data. We usually start by getting summaries and plots that broadly summarise how our data is structured. We then proceed to more specific exploration of what variables interest us. There is no right way of doing EDA, and everyone does EDA slightly different. What matters in the end is that this part of your data analysis gives you a good glimpse about your data and potential targets to further explore and analyze. And of course, this will be a good time to put your plotting skills to test, and with R, you can do pretty much anything… and, again, I truly mean anything.
What is the data we will be working with? The data exemplified here comes from Dr. Kristen Gorman in Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. It contains information about 344 penguins, across 3 different species of penguins, collected from 3 islands in the Palmer Archipelago, Antarctica.
Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/
Now lets look at the data We can either show the first rows (or last rows) to get a general view on what the columns are and what the values associated with each columns are, or we can View the entire data frame. Additionally we can also use several commands to provide us some more general info about the data frame itself.
4.1 Loading data
<- palmerpenguins::penguins_raw
penguins_raw head(penguins_raw)
## # A tibble: 6 × 17
## study…¹ Sampl…² Species Region Island Stage Indiv…³ Clutc…⁴ `Date Egg` Culme…⁵
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <date> <dbl>
## 1 PAL0708 1 Adelie… Anvers Torge… Adul… N1A1 Yes 2007-11-11 39.1
## 2 PAL0708 2 Adelie… Anvers Torge… Adul… N1A2 Yes 2007-11-11 39.5
## 3 PAL0708 3 Adelie… Anvers Torge… Adul… N2A1 Yes 2007-11-16 40.3
## 4 PAL0708 4 Adelie… Anvers Torge… Adul… N2A2 Yes 2007-11-16 NA
## 5 PAL0708 5 Adelie… Anvers Torge… Adul… N3A1 Yes 2007-11-16 36.7
## 6 PAL0708 6 Adelie… Anvers Torge… Adul… N3A2 Yes 2007-11-16 39.3
## # … with 7 more variables: `Culmen Depth (mm)` <dbl>,
## # `Flipper Length (mm)` <dbl>, `Body Mass (g)` <dbl>, Sex <chr>,
## # `Delta 15 N (o/oo)` <dbl>, `Delta 13 C (o/oo)` <dbl>, Comments <chr>, and
## # abbreviated variable names ¹studyName, ²`Sample Number`, ³`Individual ID`,
## # ⁴`Clutch Completion`, ⁵`Culmen Length (mm)`
# Use tail() if you want to see the last rows
The str()
command gives us some detail into the class of each column of the data frame, its length, its values, among others.
P.s., <dbl>
means a numerical value with decimal points.
4.2 Checking data
str(penguins_raw)
## tibble [344 × 17] (S3: tbl_df/tbl/data.frame)
## $ studyName : chr [1:344] "PAL0708" "PAL0708" "PAL0708" "PAL0708" ...
## $ Sample Number : num [1:344] 1 2 3 4 5 6 7 8 9 10 ...
## $ Species : chr [1:344] "Adelie Penguin (Pygoscelis adeliae)" "Adelie Penguin (Pygoscelis adeliae)" "Adelie Penguin (Pygoscelis adeliae)" "Adelie Penguin (Pygoscelis adeliae)" ...
## $ Region : chr [1:344] "Anvers" "Anvers" "Anvers" "Anvers" ...
## $ Island : chr [1:344] "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
## $ Stage : chr [1:344] "Adult, 1 Egg Stage" "Adult, 1 Egg Stage" "Adult, 1 Egg Stage" "Adult, 1 Egg Stage" ...
## $ Individual ID : chr [1:344] "N1A1" "N1A2" "N2A1" "N2A2" ...
## $ Clutch Completion : chr [1:344] "Yes" "Yes" "Yes" "Yes" ...
## $ Date Egg : Date[1:344], format: "2007-11-11" "2007-11-11" ...
## $ Culmen Length (mm) : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ Culmen Depth (mm) : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ Flipper Length (mm): num [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ Body Mass (g) : num [1:344] 3750 3800 3250 NA 3450 ...
## $ Sex : chr [1:344] "MALE" "FEMALE" "FEMALE" NA ...
## $ Delta 15 N (o/oo) : num [1:344] NA 8.95 8.37 NA 8.77 ...
## $ Delta 13 C (o/oo) : num [1:344] NA -24.7 -25.3 NA -25.3 ...
## $ Comments : chr [1:344] "Not enough blood for isotopes." NA NA "Adult not sampled." ...
## - attr(*, "spec")=
## .. cols(
## .. studyName = col_character(),
## .. `Sample Number` = col_double(),
## .. Species = col_character(),
## .. Region = col_character(),
## .. Island = col_character(),
## .. Stage = col_character(),
## .. `Individual ID` = col_character(),
## .. `Clutch Completion` = col_character(),
## .. `Date Egg` = col_date(format = ""),
## .. `Culmen Length (mm)` = col_double(),
## .. `Culmen Depth (mm)` = col_double(),
## .. `Flipper Length (mm)` = col_double(),
## .. `Body Mass (g)` = col_double(),
## .. Sex = col_character(),
## .. `Delta 15 N (o/oo)` = col_double(),
## .. `Delta 13 C (o/oo)` = col_double(),
## .. Comments = col_character()
## .. )
# You can also use glimpse() from the dplyr package
# Or you can use the skim() function from the skimr package
The summary
command gives us a summary of each column, with, in case of numeric, the distribution parameters of those values, and in the case of categorical columns, the count of each value while also highlighting always the number of NA
entries.
summary(penguins_raw)
## studyName Sample Number Species Region
## Length:344 Min. : 1.00 Length:344 Length:344
## Class :character 1st Qu.: 29.00 Class :character Class :character
## Mode :character Median : 58.00 Mode :character Mode :character
## Mean : 63.15
## 3rd Qu.: 95.25
## Max. :152.00
##
## Island Stage Individual ID Clutch Completion
## Length:344 Length:344 Length:344 Length:344
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm)
## Min. :2007-11-09 Min. :32.10 Min. :13.10 Min. :172.0
## 1st Qu.:2007-11-28 1st Qu.:39.23 1st Qu.:15.60 1st Qu.:190.0
## Median :2008-11-09 Median :44.45 Median :17.30 Median :197.0
## Mean :2008-11-27 Mean :43.92 Mean :17.15 Mean :200.9
## 3rd Qu.:2009-11-16 3rd Qu.:48.50 3rd Qu.:18.70 3rd Qu.:213.0
## Max. :2009-12-01 Max. :59.60 Max. :21.50 Max. :231.0
## NA's :2 NA's :2 NA's :2
## Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo)
## Min. :2700 Length:344 Min. : 7.632 Min. :-27.02
## 1st Qu.:3550 Class :character 1st Qu.: 8.300 1st Qu.:-26.32
## Median :4050 Mode :character Median : 8.652 Median :-25.83
## Mean :4202 Mean : 8.733 Mean :-25.69
## 3rd Qu.:4750 3rd Qu.: 9.172 3rd Qu.:-25.06
## Max. :6300 Max. :10.025 Max. :-23.79
## NA's :2 NA's :14 NA's :13
## Comments
## Length:344
## Class :character
## Mode :character
##
##
##
##
At this phase you should start assessing your data and will probably need to modify the data frame a bit. Most of the tools you need were already introduced in Chapter 2, but you will learn some new ones here along the way. First though, lets get familiarized with the data by reading its description.
So a brief intro to this data. It has 17 columns, namely:
studyName: Sampling expedition from which data were collected, generated, etc.
Sample Number: an integer denoting the continuous numbering sequence for each sample
Species: a character string denoting the penguin species
Region: a character string denoting the region of Palmer LTER sampling grid
Island: a character string denoting the island near Palmer Station where samples were collected
Stage: a character string denoting reproductive stage at sampling
Individual ID: a character string denoting the unique ID for each individual in dataset
Clutch Completion: a character string denoting if the study nest observed with a full clutch, i.e., 2 eggs
Date Egg: a date denoting the date study nest observed with 1 egg (sampled)
Culmen Length: a number denoting the length of the dorsal ridge of a bird’s bill (millimeters)
Culmen Depth: a number denoting the depth of the dorsal ridge of a bird’s bill (millimeters)
Flipper Length: an integer denoting the length penguin flipper (millimeters)
Body Mass: an integer denoting the penguin body mass (grams)
Sex: a character string denoting the sex of an animal
Delta 15 N: a number denoting the measure of the ratio of stable isotopes 15N:14N
Delta 13 C: a number denoting the measure of the ratio of stable isotopes 13C:12C
Comments: a character string with text providing additional relevant information for data
There is also a subsetted (cleaner) version with just the species, island, size (flipper length, body mass, bill dimensions) and sex variables. We will, however, work with the raw version and transform it ourselves making it cleaner and easier to explore and gather some initial insights about the data.
head(penguins_raw)
## # A tibble: 6 × 17
## study…¹ Sampl…² Species Region Island Stage Indiv…³ Clutc…⁴ `Date Egg` Culme…⁵
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <date> <dbl>
## 1 PAL0708 1 Adelie… Anvers Torge… Adul… N1A1 Yes 2007-11-11 39.1
## 2 PAL0708 2 Adelie… Anvers Torge… Adul… N1A2 Yes 2007-11-11 39.5
## 3 PAL0708 3 Adelie… Anvers Torge… Adul… N2A1 Yes 2007-11-16 40.3
## 4 PAL0708 4 Adelie… Anvers Torge… Adul… N2A2 Yes 2007-11-16 NA
## 5 PAL0708 5 Adelie… Anvers Torge… Adul… N3A1 Yes 2007-11-16 36.7
## 6 PAL0708 6 Adelie… Anvers Torge… Adul… N3A2 Yes 2007-11-16 39.3
## # … with 7 more variables: `Culmen Depth (mm)` <dbl>,
## # `Flipper Length (mm)` <dbl>, `Body Mass (g)` <dbl>, Sex <chr>,
## # `Delta 15 N (o/oo)` <dbl>, `Delta 13 C (o/oo)` <dbl>, Comments <chr>, and
## # abbreviated variable names ¹studyName, ²`Sample Number`, ³`Individual ID`,
## # ⁴`Clutch Completion`, ⁵`Culmen Length (mm)`
4.3 Cleaning
4.3.1 Cleaning col. names
names(penguins_raw)
## [1] "studyName" "Sample Number" "Species"
## [4] "Region" "Island" "Stage"
## [7] "Individual ID" "Clutch Completion" "Date Egg"
## [10] "Culmen Length (mm)" "Culmen Depth (mm)" "Flipper Length (mm)"
## [13] "Body Mass (g)" "Sex" "Delta 15 N (o/oo)"
## [16] "Delta 13 C (o/oo)" "Comments"
So first off, we might want to clean the data.
One thing we often do is rename variables (in this case columns).
We can do so simply with the use of the rename()
function from dplyr
.
rename()
: changes the names of individual variables using new_name = old_name syntax.
rename_with()
: renames columns using a function.
# Lets say I want to rename "Individual ID" to "ID".
<- penguins_raw %>%
penguins rename('ID' = 'Individual ID')
Now we might want to make every column, except ID, lower case, substitute spaces with “_” and remove “()”.
<- penguins %>%
penguins rename_with(tolower) %>% # lower-case every column
rename_with(toupper, starts_with('ID')) %>% # up-case column starting with "ID"
rename_with(~gsub(" ", "_", .x)) %>% # subs every space (" ") with no space ("_")
rename_with(~gsub("\\(", "", .x)) %>% #removes every "("
rename_with(~gsub("\\)", "", .x)) %>% # removes every ")"
rename_with(~gsub("/" ,"", .x)) # removes /
We could also skip a few things and be less specific, but instead use only the following function from the “janitor” package
%>%
penguins ::clean_names() janitor
## # A tibble: 344 × 17
## studyn…¹ sampl…² species region island stage id clutc…³ date_egg culme…⁴
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <date> <dbl>
## 1 PAL0708 1 Adelie… Anvers Torge… Adul… N1A1 Yes 2007-11-11 39.1
## 2 PAL0708 2 Adelie… Anvers Torge… Adul… N1A2 Yes 2007-11-11 39.5
## 3 PAL0708 3 Adelie… Anvers Torge… Adul… N2A1 Yes 2007-11-16 40.3
## 4 PAL0708 4 Adelie… Anvers Torge… Adul… N2A2 Yes 2007-11-16 NA
## 5 PAL0708 5 Adelie… Anvers Torge… Adul… N3A1 Yes 2007-11-16 36.7
## 6 PAL0708 6 Adelie… Anvers Torge… Adul… N3A2 Yes 2007-11-16 39.3
## 7 PAL0708 7 Adelie… Anvers Torge… Adul… N4A1 No 2007-11-15 38.9
## 8 PAL0708 8 Adelie… Anvers Torge… Adul… N4A2 No 2007-11-15 39.2
## 9 PAL0708 9 Adelie… Anvers Torge… Adul… N5A1 Yes 2007-11-09 34.1
## 10 PAL0708 10 Adelie… Anvers Torge… Adul… N5A2 Yes 2007-11-09 42
## # … with 334 more rows, 7 more variables: culmen_depth_mm <dbl>,
## # flipper_length_mm <dbl>, body_mass_g <dbl>, sex <chr>,
## # delta_15_n_ooo <dbl>, delta_13_c_ooo <dbl>, comments <chr>, and abbreviated
## # variable names ¹studyname, ²sample_number, ³clutch_completion,
## # ⁴culmen_length_mm
4.3.2 Date
One interesting and somewhat difficult (sometimes) parameter to adjust is turning a column that is often a character to a date
format.
In this case is simple, but I give you some examples below for you to work your way through other types of transformations.
$date_egg <- as.Date(penguins$date_egg) penguins
Now here are other examples, aside from this data.
<- c("1jan1960", "2jan1960", "31mar1960", "30jul1960")
characters <- as.Date(characters, "%d%b%Y")
dates dates
## [1] "1960-01-01" "1960-01-02" "1960-03-31" "1960-07-30"
<- c("02/27/92", "02/27/92", "01/14/92", "02/28/92", "02/01/92")
characters <- as.Date(characters, "%m/%d/%y")
dates dates
## [1] "1992-02-27" "1992-02-27" "1992-01-14" "1992-02-28" "1992-02-01"
Check also these link for better examples. https://www.r-bloggers.com/2013/08/date-formats-in-r/
4.3.3 Cleaning some columns
In particular, I want to remove the species latin name from the “species” column as well as the parenthesis.
Now this may seem simple, but messing with characters is one of those things that having an internet connection is key, since base commands, in my opinion, are not at all clear or simple.
The command below simple means replace the species
column with a new one (with the same name), but in this one replace all the text that is between the parenthesis (including the parenthesis themselves) " *\\(.*?\\) *"
and replace by nothing ""
.
<- penguins %>%
penguins mutate(species = gsub(" *\\(.*?\\) *", "", species))
Another thing that jumps to sight is the “stage” column, which appears to be telling the same thing always. Lets check.
table(penguins$stage)
##
## Adult, 1 Egg Stage
## 344
unique(penguins$stage)
## [1] "Adult, 1 Egg Stage"
Since its all the same, I can choose to remove it.
$stage <- NULL penguins
4.3.4 Dealing with NA values
Now lets deal with NA values.
First lets quickly identify which columns have NA values, and how many of them each one has.
We could just look at the summary(penguins)
, or we could do this with other base R commands, like so:
sapply(penguins, function(x) sum(is.na(x)))
## studyname sample_number species region
## 0 0 0 0
## island ID clutch_completion date_egg
## 0 0 0 0
## culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g
## 2 2 2 2
## sex delta_15_n_ooo delta_13_c_ooo comments
## 11 14 13 290
apply(is.na(penguins), 2, sum)
## studyname sample_number species region
## 0 0 0 0
## island ID clutch_completion date_egg
## 0 0 0 0
## culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g
## 2 2 2 2
## sex delta_15_n_ooo delta_13_c_ooo comments
## 11 14 13 290
Note: Go to https://www.guru99.com/r-apply-sapply-tapply.html to better understand “apply” and other functions.
Or we can do something more elegant and use some packages that show us, with graphics, where our NAs are. Here are a few examples that I took from a quick google search:
# Alternative 1
library(naniar)
## Warning: package 'naniar' was built under R version 4.2.2
library(UpSetR)
## Warning: package 'UpSetR' was built under R version 4.2.2
%>%
penguins as_shadow_upset() %>%
upset()
# Alternative n2
library(visdat)
## Warning: package 'visdat' was built under R version 4.2.2
vis_miss(penguins)
::vis_dat(penguins) visdat
Now, we need to decide what to do with these NA values. We can see below that two penguins have data missing from their base features (e.g., culmen_lenght and body_mass). Also, a few of them don’t have information regarding sex. Since these are critical features, I’m gonna go ahead and eliminate these entries. We can do so by several ways, but below you have two examples on how to eliminate NA values.
# Example 1
<- penguins %>%
penguins filter(!is.na(culmen_length_mm), # show only rows that are not NA in the culmen_length_mm
!is.na(sex)) # show only rows that are not NA in the sex column
# Example 2
<- penguins %>%
penguins drop_na(culmen_length_mm, sex)
Now for some final touches, I will transform the body mass column’s units from grams to kilograms.
# Converting body_mass from grams to kg
<- penguins %>%
penguins mutate(body_mass = body_mass_g/1000) %>%
mutate(body_mass_g = NULL) # removing old column
To finish this data cleaning, its important to point out that we could have just done everything in one go, as exemplified below. Although as you can see, when dealing with too many things, its usually best to separate adjustments per chunks.
<- penguins_raw %>%
penguins rename('ID' = 'Individual ID') %>%
rename_with(tolower) %>% # lower-case every column
rename_with(toupper, starts_with('ID')) %>% # up-case column starting with "ID"
rename_with(~gsub(" ", "_", .x)) %>% # subs every space (" ") with no space ("_")
rename_with(~gsub("\\(", "", .x)) %>% #removes every "("
rename_with(~gsub("\\)", "", .x)) %>% # removes every ")"
rename_with(~gsub("/" ,"", .x)) %>% # removes /
mutate(date_egg = as.Date(date_egg)) %>% # turns the column to data format
mutate(species = gsub(" *\\(.*?\\) *", "", species)) %>% # removes text between parenthesis in species
mutate(stage = NULL) %>% # removes stage column
filter(!is.na(culmen_length_mm), !is.na(sex)) %>% # removes NA rows from these 2 columns
mutate(body_mass = body_mass_g/1000) %>% # transforms body_mass_g column to kilograms
mutate(body_mass_g = NULL)
4.4 Summarising
Before we start drawing graphs left and right, its important to also explore the data with data summaries (in text format). As mentioned, from my perspective, EDA isn’t an exact science, in a sense that there is no appropriate set of measures to explore. Knowing what to do comes from both a) knowing your data, b) what your objectives are and c) experience. Below I’m going to try to cover some aspects of EDA (first in text than with plots) that I would do when analysing this data. Lets start…
So, how many observations per specie are there in the data?
%>%
penguins count(species)
## # A tibble: 3 × 2
## species n
## <chr> <int>
## 1 Adelie Penguin 146
## 2 Chinstrap penguin 68
## 3 Gentoo penguin 119
table(penguins$species)
##
## Adelie Penguin Chinstrap penguin Gentoo penguin
## 146 68 119
Alternatively, we can use the functions tabyl()
and adorn_totals()
from the janitor package and create a more elaborate summary with percentages and totals.
%>%
penguins ::tabyl(species) %>%
janitor::adorn_totals() janitor
## species n percent
## Adelie Penguin 146 0.4384384
## Chinstrap penguin 68 0.2042042
## Gentoo penguin 119 0.3573574
## Total 333 1.0000000
Alright, now lets say we want to know how they are distributed across islands
%>%
penguins group_by(island) %>%
count(species)
## # A tibble: 5 × 3
## # Groups: island [3]
## island species n
## <chr> <chr> <int>
## 1 Biscoe Adelie Penguin 44
## 2 Biscoe Gentoo penguin 119
## 3 Dream Adelie Penguin 55
## 4 Dream Chinstrap penguin 68
## 5 Torgersen Adelie Penguin 47
What seem to be the main problems presented in the observation column?
%>%
penguins filter(!is.na(comments)) %>%
count(comments)
## # A tibble: 4 × 2
## comments n
## <chr> <int>
## 1 Nest never observed with full clutch. 34
## 2 Nest never observed with full clutch. Not enough blood for isotopes. 1
## 3 No delta15N data received from lab. 1
## 4 Not enough blood for isotopes. 7
Since we spotted that some comments only appear very often we could quickly do some data cleaning again.
Namely we might wanna combine certain levels of a factor or characters that are less represented (compared to other levels) together, instead of representing each.
In this case we would combine less represented types of comments into a single label (e.g., “OtherProblems”).
For that we could use fct_lump()
.
This function “lumps” (joins) together factors or characters with little representation.
We just need to say which column we want to use this on, and we can define other parameters of interest.
These are:
n: (specifying which of the most common to preserve, default being 1). If I say
n = 2
only the two most common comments will be preserved, with the rest being labeled as “OtherProblems”.p: Alternative to “n” in which we can specify to preserve only the levels/characters which have at least a proportion of x. Again, if we said p = 0.2, only the comments which are represented in 20% of the data would be kept.
other_level: Here you set the name attributed to the lumped factors. In this case we say “OtherProblems”.
%>%
penguins mutate(comments2 = fct_lump(f=comments, n = 1, other_level = 'OtherProblems')) %>%
head()
## # A tibble: 6 × 17
## study…¹ sampl…² species region island ID clutc…³ date_egg culme…⁴ culme…⁵
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <date> <dbl> <dbl>
## 1 PAL0708 1 Adelie… Anvers Torge… N1A1 Yes 2007-11-11 39.1 18.7
## 2 PAL0708 2 Adelie… Anvers Torge… N1A2 Yes 2007-11-11 39.5 17.4
## 3 PAL0708 3 Adelie… Anvers Torge… N2A1 Yes 2007-11-16 40.3 18
## 4 PAL0708 5 Adelie… Anvers Torge… N3A1 Yes 2007-11-16 36.7 19.3
## 5 PAL0708 6 Adelie… Anvers Torge… N3A2 Yes 2007-11-16 39.3 20.6
## 6 PAL0708 7 Adelie… Anvers Torge… N4A1 No 2007-11-15 38.9 17.8
## # … with 7 more variables: flipper_length_mm <dbl>, sex <chr>,
## # delta_15_n_ooo <dbl>, delta_13_c_ooo <dbl>, comments <chr>,
## # body_mass <dbl>, comments2 <fct>, and abbreviated variable names
## # ¹studyname, ²sample_number, ³clutch_completion, ⁴culmen_length_mm,
## # ⁵culmen_depth_mm
Anyway, back to exploring the data… Now lets ask the mean for each specie across
all columns in the data frame, where
the values are numeric.
We can do this, by simply grouping the data by species, that using the command below.
%>%
penguins group_by(species) %>%
summarise(across(where(is.numeric), mean, na.rm = TRUE)) %>%
head()
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(where(is.numeric), mean, na.rm = TRUE)`.
## ℹ In group 1: `species = "Adelie Penguin"`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
## # A tibble: 3 × 8
## species sample_num…¹ culme…² culme…³ flipp…⁴ delta…⁵ delta…⁶ body_…⁷
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie Penguin 79 38.8 18.3 190. 8.86 -25.8 3.71
## 2 Chinstrap penguin 34.5 48.8 18.4 196. 9.36 -24.5 3.73
## 3 Gentoo penguin 61.5 47.6 15.0 217. 8.25 -26.2 5.09
## # … with abbreviated variable names ¹sample_number, ²culmen_length_mm,
## # ³culmen_depth_mm, ⁴flipper_length_mm, ⁵delta_15_n_ooo, ⁶delta_13_c_ooo,
## # ⁷body_mass
4.5 Plots
4.5.1 Formula
https://r-graph-gallery.com/
R has a natural plot language, but its quite “primitive” and complicated.
It is adequate if you want really quick plots with minimal customization.
Otherwise I would recommend using ggplot2
.
ggplot2
uses a distinct grammar for expression plots.
It may seem complicated at first, but its actually quite simple.
So every plot must start with a simple command:
ggplot(dataframe, aes("data"))
In the first parameter you just tell the command which dataframe to use. Assuming your dataframe is labeled as “df” you can just the following:
ggplot(df, aes("data"))
or
df %>% ggplot(aes("data"))
The next thing we need to do is specify which variables and measures we want to add to the plot.
This is done inside the aes()
command.
You can add the obvious x and y info, and you can also add group related info, which can be coded as “color”, “fill” or “shape”.
Here’s a few generic examples:
ggplot(df, aes(x = weight, y = height))
ggplot(df, aes(x = education_level, y = average_grade))
ggplot(df, aes(x = country, y = height, color = sex))
ggplot(df, aes(x = country, y = average_reading_time,
color = sex, shape = genre))
After this is specified, you just need to add layers of what you want.
By adding, I mean inserting a +
and specifying the layers you want.
Do you want points?
Bars?
Lines?
Circles?
Whatever you want, you’ll add each as a layer with geom_something()
, with the something corresponding to what you want.
ggplot2
has plenty of geoms, and I’ll be showing you a few below.
If you want to see what the rest can do, visit: https://ggplot2.tidyverse.org/reference/
So your general plot code for a simple bar plot should look something like:
ggplot(dataframe, aes(x = variable, y = variable2)) +
geom_bar()
Lastly, you might want to specify other little details, such as the labels, the theme, the color scheme, etc. You can specify hundreds of things. I’m just gonna leave you the commands I use the most and I think will be most useful to you.
# Specifying the labels
labs(x = 'x-axis label', y = 'y-axis label', title = 'title text', color = 'color label text')
# Specifying the lower and upper bounds of the axis
coord_cartesian(ylim = c('lower_y_limit', 'upper_y_limit'),
xlim = c('lower_x_limit', 'upper_y_limit'))
# Splitting the graphs creating several based on a grouping variable
facet_wrap(~'grouping_variable')
# Themes (just some examples)
theme_classic()
theme_light()
theme_gray()
If you want to save your plots you can do so either by clicking in the “Export” button on your Plots separator or you can use the function ggsave()
.
Well this should give you a general idea on how plots work. For more info, visit: http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html
On to some examples with our data.
Note: I purposely leave some additional parameters inside each plot building code, not to confuse you, but to introduce you more ways to tweak your plot. I encourage you to mess with them.
4.5.2 Cols and Bars
The bar plot is one of the most common ones. R has two types of “bar” plots.
geom_bar
: Creates proportional count of entries (same ascount
but in a plot). Only requires one variable.geom_col
: Creates a more typical bar plot, where the height represents values/statistics of the data,
Here’s an example usage of geom_bar
# Count
%>%
penguins count(species)
## # A tibble: 3 × 2
## species n
## <chr> <int>
## 1 Adelie Penguin 146
## 2 Chinstrap penguin 68
## 3 Gentoo penguin 119
# Geom bar (count as a plot)
ggplot(penguins, aes(species)) +
geom_bar()
%>%
penguins ggplot(aes(species)) +
geom_bar(fill = 'darkgreen', color = 'black', size = 2, width = .5) +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
Now regarding geom_col
and the more typical use of showing means.
You have two ways of doing it, either:
you compute the means and build the plot with
geom_col()
you use the
stat_summary()
function to compute the mean inside the ggplot and specify that you want the “col” geom (or “bar” geom).
I usually just use stat_summary()
, instead of pre-computing the means, because its faster and this allows me to add another function with stat_summary()
that can also compute things like error bars.
# 1.
%>%
penguins group_by(species, sex) %>%
summarise(body_mass = mean(body_mass)) %>%
ggplot(aes(species, body_mass, fill = sex)) +
geom_col(position = position_dodge(.9), color = 'black') +
coord_cartesian(ylim = c(2, 5)) +
theme_minimal() +
labs(title = 'Plot with Means: Method 1')
## `summarise()` has grouped output by 'species'. You can override using the
## `.groups` argument.
# 2.
%>%
penguins ggplot(aes(species, body_mass, fill = sex)) +
stat_summary(fun = mean, geom = 'col', position = position_dodge(.9),
color = 'black') +
coord_cartesian(ylim = c(0, 6)) +
theme_minimal() +
labs(title = 'Plot with Means: Method 2')
4.5.3 Boxplots
Boxplots are a really cool way of showing you how your data is distributed across each category.
Its particularly important when you have skewed data, or are just simply more interested in the median (as opposed to the mean).
Below I introduce you the standard box plot, while also showing you other layers that can both appear on their own or can complement the boxplot geom, namely geom_jitter()
and geom_violin()
.
%>%
penguins filter(!is.na(delta_15_n_ooo)) %>% # remove NAs in the column to prevent warnings
ggplot(aes(x = species, y = delta_15_n_ooo, fill = species)) +
geom_boxplot() +
geom_jitter(width = .2, alpha = 0.3, aes(shape = island, color = sex), size = 3) +
geom_violin(alpha = .1, size = 1) + # the alpha parameter models transparency (ranges from 0 to 1)
theme_light()
4.5.4 Density
Density plots are great and provide a simple way for you to measure distribution of a variable’s values. You just need to point out the variable name, in this case we will look at “body_mass”, and we will additionally want the distributions separated by “species”.
%>%
penguins ggplot(aes(x = body_mass, fill = species)) +
geom_density(alpha = 0.5) +
theme_minimal() +
labs(x = 'Body mass (kg)', y = 'Density',
fill = 'Species',
title = 'Body mass distributions across Species') +
scale_fill_brewer(palette = "Dark2")
4.5.5 Facets
facet_wrap
is quite useful if you want to compare two plots across 2 (or more) variables.
It divides the data across the variable you name and creates two distinct plots.
You just need to specify the variable, which in this case in done so with ~name_of_the_variable
.
Importantly, if you want the scales (y-limits and x-limits) to be adjusted for each variable’s info, you need to specify this using scales = 'free'
if every scale (x and y) can vary between facets, or scales = 'free_x'
and scales = 'free_y'
, if you just want the x or y scales to vary freely, respetively.
In this case since we want to compare them directly, I think its best if they are fixed (the default behavior of the function), so you get a better picture as to how different they really are.
%>%
penguins ggplot(aes(x = body_mass, fill = species)) +
geom_density(alpha = .5) +
theme_minimal() +
labs(x = 'Body mass (g)', y = 'Density',
fill = 'Species',
title = 'Body mass distributions across Species') +
scale_color_brewer(palette = "Dark2") +
facet_wrap(~sex, scales = 'fixed')
4.5.6 Coord flip
For some plots, it might also be interesting (either more aesthetically pleasing or just a necessity given space limitations) to “flip the coordinates”.
You can do so, by using the command coord_flip()
.
Here’s an example.
ggplot(penguins, aes(x = island, fill = species)) +
geom_bar(alpha = 0.8) +
scale_fill_manual(values = c("darkorange","purple","cyan4")) +
theme_minimal() +
facet_wrap(~species, ncol = 1) +
coord_flip()
4.5.7 Correlation plots
Another particularly useful feature that is usually (initially) explored with plots is correlations. So lets say we want to get a general idea about correlations between a set of variables in the dataframe. Here’s a way to do it:
<- penguins %>%
correlations select('culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass') %>%
cor()
::corrplot(correlations, method = 'number') corrplot
We can additionally separate this correlations by specie.
For that we can use the function ggpairs()
from the GGally
package.
%>%
penguins select('species','culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass') %>%
::ggpairs(aes(color = species), axisLabels = 'none') GGally
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
When you want to assess something more specific you can use a point plot. Point plots allow us to see how two numeric variables correlate with each other. In this example we explore how bill length and flipper length might be correlated, within each species.
<- penguins %>%
plot_fixe ggplot(aes(x = flipper_length_mm, y = culmen_length_mm, color = species, shape = species)) +
geom_point(aes(group = species), size = 1.7) +
geom_smooth(aes(group = species), method = 'lm', alpha = .2, formula = 'y ~ x') + # adding a line based on linear regression
theme_minimal() +
labs(x = 'Flipper Length (mm)', y = 'Bill Length (mm)',
color = 'Species', shape = 'Species',
title = 'Correlation between Flipper Length and Bill Length') +
scale_color_brewer(palette = "Dark2")
plot_fixe
4.5.8 Multiplots
Another cool thing we might want to do is add multiple plots to a final image.
We can do this using the function multiplot
from the packages Rmisc
.
Lets say we want to measure the mean of the deltas (“both delta_15_n_ooo” and “delta_13_c_ooo”), by species and sex.
And we want to see them, side by side.
To do so, we must create each plot separately, and assign it to a object.
Afterwards, we just need to call these objects inside the multiplot
function.
Here’s how to do it:
<- penguins %>%
d15_plot filter(!is.na(delta_15_n_ooo)) %>%
group_by(species, sex) %>%
summarise(delta_15 = mean(delta_15_n_ooo)) %>%
ggplot(aes(species, delta_15, color = sex)) +
geom_point(size = 5) +
geom_line(aes(group = sex), size = 1)
## `summarise()` has grouped output by 'species'. You can override using the
## `.groups` argument.
<- penguins %>%
d13_plot filter(!is.na(delta_13_c_ooo)) %>%
group_by(species, sex) %>%
summarise(delta_13 = mean(delta_13_c_ooo)) %>%
ggplot(aes(species, delta_13, color = sex)) +
geom_point(size = 5) +
geom_line(aes(group = sex), size = 1)
## `summarise()` has grouped output by 'species'. You can override using the
## `.groups` argument.
::multiplot(d15_plot, d13_plot) Rmisc
# or
::multiplot(d15_plot, d13_plot, cols = 2) # the cols parameter establishes the number of columns in the plot matrix Rmisc
# or
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.2.2
+ d13_plot d15_plot
4.5.9 Date
Plotting date is quite easy, and its quite useful when dealing with time-series analysis. In this data, as you will see, data is not that relevant. Below, I first build a general plot showing how many collections there were per date. Since I found a pattern, showing that these collection took place between November and December in three separate years, I then build three blocks of data for each year, and split the plot to better show the number of data entries per date.
# Original plot
%>%
penguins ggplot(aes(date_egg)) +
geom_bar()
# Transformed plot
%>%
penguins mutate(DateBlock = case_when( # creating a new column that identifies which year that data bellongs to
< '2008-01-01' ~ '2007',
date_egg > '2008-01-01' & date_egg < '2009-01-01' ~ '2008',
date_egg > '2009-01-01' ~ '2009'
date_egg %>%
)) ggplot(aes(date_egg)) +
geom_bar(fill = 'lightblue', color = 'black') +
facet_wrap(~DateBlock, scales = 'free_x') +
theme_minimal()
4.5.10 Ordering factors
Another useful function, is the function fct_reorder()
which comes from the package forcats
.
This function is particularly (although not exclusively) useful for plots.
As default, R organizes the levels in a factor by alphabetical order.
This function allows us to alter the order by which levels in a factor are presented, according to certain conditions we can define ourselves.
For instance if we plot the mean “delta_15_ooo” per species of penguin, we get the x-axis with (alphabetically defined) Adelie, Chinstrap and Gentoo penguins.
But lets say instead, we want to reorder the x-axis in a descending order according to mean delta_15 of each species.
# Original plot
%>%
penguins mutate(species = as.factor(species)) %>% # Turning species to factor
group_by(species) %>%
summarise(delta_15 = mean(delta_15_n_ooo, na.rm = TRUE)) %>%
ggplot(aes(species, delta_15)) +
stat_summary(fun = mean, geom = 'bar', width=.75, fill = 'tomato1', color = 'black') + # plotting the mean
theme(axis.text.x = element_text(angle = 10, vjust = 0.6)) +# just to show the x-axis text better
coord_cartesian(ylim = c(5, 10))
# Reordered plot
%>%
penguins group_by(species) %>%
summarise(delta_15 = mean(delta_15_n_ooo, na.rm = TRUE)) %>%
mutate(species = fct_reorder(species, delta_15, .desc = TRUE)) %>% # organized the levels of species based on delta_15 (descending)
ggplot(aes(species, delta_15)) +
stat_summary(fun = mean, geom = 'bar', width=.75, fill = 'tomato1', color = 'black') + # plotting the mean
theme(axis.text.x = element_text(angle = 10, vjust = 0.6)) + # just to show the x-axis text better
coord_cartesian(ylim = c(5, 10))