Chapter 7 Introduction to Data Visualization with ggplot2

7.1 Introduction

7.1.1 Data columns types affect plot types

The mtcars dataset contains information on 32 cars from a 1973 issue of Motor Trend magazine.

library(ggplot2)

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

After changing variable data type, the x-axis does not contain variables like 5 or 7, only the values that are present in the dataset.

# Original command
ggplot(mtcars, aes(cyl, mpg)) +
  geom_point()

# Change the command below so that cyl is treated as factor
ggplot(mtcars, aes(factor(cyl), mpg)) +
  geom_point()

7.1.2 The grammar of graphics

The seven grammatical elements

Element	Description
Data	The data-set being plotted.	essential grammatical elements
Aesthetics	The scales onto which we map our data.	essential grammatical elements
Geometries	The visual elements used for our data.	essential grammatical elements
Themes	All non-data ink.
Statistics	Representations of our data to aid understanding.
Coordinates	The space on which the data will be plotted.
Facets	Plotting small multiples.

Jargon for each element

7.1.2.1 Mapping data columns to aesthetics

# Edit to add a color aesthetic mapped to disp
ggplot(mtcars, aes(wt, mpg, color = disp)) +
  geom_point()

# Change the color aesthetic to a size aesthetic
ggplot(mtcars, aes(wt, mpg, size = disp)) +
  geom_point()

7.1.3 ggplot2 layers

7.1.3.1 Adding geometries

geom_smooth() adds a smooth trend curve.

str(diamonds)

## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

ggplot(diamonds, aes(carat, price)) +
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'gam' and
## formula = 'y ~ s(x, bs = "cs")'

7.1.3.2 Changing one geom or every geom

geom_point() has an alpha argument that controls the opacity of the points. A value of 1 (the default) means that the points are totally opaque; a value of 0 means the points are totally transparent (and therefore invisible).

# Make the points 40% opaque
ggplot(diamonds, aes(carat, price, color = clarity)) +
  geom_point(alpha = 0.4) +
  geom_smooth()

## `geom_smooth()` using method = 'gam' and
## formula = 'y ~ s(x, bs = "cs")'

7.1.3.3 Saving plots as variables

Plots can be saved as variables, which can be added to later on using the + operator. This is really useful if you want to make multiple related plots from a common base.

# Draw a ggplot
plt_price_vs_carat <- ggplot(
  # Use the diamonds dataset
  diamonds,
  # For the aesthetics, map x to carat and y to price
  aes(carat, price)
)

# Add a point layer to plt_price_vs_carat
plt_price_vs_carat + geom_point()

# Edit this to make points 20% opaque: plt_price_vs_carat_transparent
plt_price_vs_carat_transparent <- plt_price_vs_carat + geom_point(alpha = 0.2)

# See the plot
plt_price_vs_carat_transparent

# Edit this to map color to clarity,
# Assign the updated plot to a new object
plt_price_vs_carat_by_clarity <- plt_price_vs_carat + 
    geom_point(aes(color = clarity))

# See the plot
plt_price_vs_carat_by_clarity

7.2 Aesthetics

7.2.1 Visible aesthetics

Typical visible aesthetics

Aesthetic	Description
x	X axis position
y	Y axis position
fill	Fill color (點的填滿顏色)
color	Color of points, outlines of othergeoms (點的外框顏色)
size	Area or radius of points, thickness of lines
alpha	Transparency (點的透明度)
linetype	line dash pattern
labels	Text on a plot or axes
shape	Shape of points

label and shape are only applicable to categorical data.

7.2.1.1 Aesthetics: color, shape and size

# transformed cyl, am into a factor fcyl, fam
mtcars$fcyl <- factor(mtcars$cyl)
mtcars$fam <- factor(mtcars$am)
str(mtcars)

## 'data.frame':    32 obs. of  13 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
##  $ fcyl: Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ fam : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...

# Map x to wt, y to mpg and color to fcyl
ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
  # Set the shape and size of the points
  geom_point(shape = 1, size = 4)

7.2.1.2 Aesthetics: color vs. fill

Typically, the color aesthetic changes the outline of a geom and the fill aesthetic changes the inside. geom_point() is an exception: you use color (not fill) for the point color. However, some shapes have special behavior.

The default geom_point() uses shape = 19: a solid circle. An alternative is shape = 21: a circle that allow you to use both fill for the inside and color for the outline. This is lets you to map two aesthetics to each point.

All shape values are described on the ?points help page.

# Map fcyl to fill
ggplot(mtcars, aes(wt, mpg, fill = fcyl)) +
  geom_point(shape = 1, size = 4)

ggplot(mtcars, aes(wt, mpg, fill = fcyl)) +
  # Change point shape; set alpha
  geom_point(shape = 21, size = 4, alpha = 0.6)

# Map color to fam
ggplot(mtcars, aes(wt, mpg, fill = fcyl, color = fam)) +
  geom_point(shape = 21, size = 4, alpha = 0.6)

Notice that mapping a categorical variable onto fill doesn’t change the colors, although a legend is generated! This is because the default shape for points only has a color attribute and not a fill attribute! Use fill when you have another shape (such as a bar), or when using a point that does have a fill and a color attribute, such as shape = 21, which is a circle with an outline. Any time you use a solid color, make sure to use alpha blending to account for over plotting.

7.2.1.3 Comparing aesthetics

# Establish the base layer
plt_mpg_vs_wt <- ggplot(mtcars, aes(x = wt, y = mpg))

# Map fcyl to size
plt_mpg_vs_wt +
  geom_point(aes(size = fcyl))

## Warning: Using size for a discrete variable is not
## advised.

# Map fcyl to alpha, not size
plt_mpg_vs_wt +
  geom_point(aes(alpha = fcyl))

## Warning: Using alpha for a discrete variable is not
## advised.

# Map fcyl to shape, not alpha
plt_mpg_vs_wt +
  geom_point(aes(shape = fcyl))

# Use text layer and map fcyl to label
plt_mpg_vs_wt +
  geom_text(aes(label = fcyl))

7.2.2 Using attributes

Set attributes in geom_*(). Attributes are always called in the geom layer!

7.2.2.1 Attributes: color, shape, size and alpha

This time you’ll use these arguments to set attributes of the plot, not map variables onto aesthetics.

You can specify colors in R using hex codes: a hash followed by two hexadecimal numbers each for red, green, and blue ("#RRGGBB"). Hexadecimal is base-16 counting. You have 0 to 9, and A representing 10 up to F representing 15. Pairs of hexadecimal numbers give you a range from 0 to 255. "#000000" is “black” (no color), "#FFFFFF" means “white”, and `“#00FFFF” is cyan (mixed green and blue).

# A hexadecimal color
my_blue <- "#4ABEFF"

ggplot(mtcars, aes(wt, mpg)) +
  # Set the point color and alpha
  geom_point(color = my_blue, alpha = 0.6)

# Change the color mapping to a fill mapping
ggplot(mtcars, aes(wt, mpg, fill = fcyl)) +
  # Set point size and shape
  geom_point(color = my_blue,
             size = 10,
             shape = 1)

7.2.2.2 Attributes: conflicts with aesthetics

ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
  # Add point layer with alpha 0.5
  geom_point(alpha = 0.5)

ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
  # Add text layer with label rownames of the dataset mtcars and color red
  geom_text(label = rownames(mtcars),
            color = "red")

ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
  # Add points layer with shape 24 and color yellow
  geom_point(shape = 24,
             color = "yellow")

Notice that adding more aesthetic mappings to your plot is not always a good idea! You may just increase complexity and decrease readability.

# 3 aesthetics: qsec vs. mpg, colored by fcyl
ggplot(mtcars, aes(x = mpg, y = qsec, color = fcyl)) +
  geom_point()

# 4 aesthetics: add a mapping of shape to fam
ggplot(mtcars, aes(mpg, qsec, color = fcyl, shape = fam)) +
  geom_point()

# 5 aesthetics: add a mapping of size to hp / wt
ggplot(mtcars, aes(mpg, qsec, color = fcyl, shape = fam, size = hp/wt)) +
  geom_point()

Between the x and y dimensions, the color, shape, and size of the points, your plot displays five dimensions of the dataset.

7.2.3 Modifying aesthetics

7.2.3.1 Adjustment for overlapping

Positions

position = "*" or position_*()

identity: default. Don’t adjust data positions.
dodge: preserves the vertical position of a geom while adjusting the horizontal position. position_dodge(width = NULL, preserve = c("total", "single")).
stack: stacks bars on to of each other. This is the default of geom_bar and geom_area.
fill: stacks bars and standardizes each stack to have constant height.
jitter: add some random noise on both the x and y axes. position_jitter(width = NULL, height = NULL, seed = num)
jitterdodage
nudge

7.2.3.2 Scale functions

scale_*_*() Appropriately enough, we can access all the scales with the scale underscore functions. The second part of the function defines which scale we want to modify. The third part must match the type of data we are using.

scale_x_*() / scale_x_continuous()
scale_y_*()
scale_color_() / scale_color_discrete()
- Also scale_colour_*() / scale_colour_*()
scale_fill_*()
scale_shape_*()
scale_linetype_*()
scale_size_*()

There are many arguments for the scale functions. Most common are limits, breaks, expand and labels.

limits: describe the scale’s range.
breaks: control the tick mark positions.
expand: a numeric vector of length two, giving a multiplicative and additive constant used to expand the range of the scales so that there is a small gap between the data and the axes.
labels: adjust the category names.
labs: change the axis labels.

ggplot(mtcars, aes(fcyl, fill = fam)) +
  geom_bar() +
  # Set the axis labels
  labs(x = "Number of Cylinders",
       y = "Count")

scale_fill_manual() defines properties of the color scale (i.e. axis). The first argument sets the legend title. values is a named vector of colors to use.

levels(mtcars$fam) <- c("automatic", "manual")
str(mtcars$fam)

##  Factor w/ 2 levels "automatic","manual": 2 2 2 1 1 1 1 1 1 1 ...

palette <- c(automatic = "#377EB8", manual = "#E41A1C")

ggplot(mtcars, aes(fcyl, fill = fam)) +
  geom_bar() +
  labs(x = "Number of Cylinders", y = "Count") +
  # Set the fill color scale
  scale_fill_manual("Transmission", values = palette)

# Set the position
ggplot(mtcars, aes(fcyl, fill = fam)) +
  geom_bar(position = "dodge") +
  labs(x = "Number of Cylinders", y = "Count")

  scale_fill_manual("Transmission", values = palette)

## <ggproto object: Class ScaleDiscrete, Scale, gg>
##     aesthetics: fill
##     axis_order: function
##     break_info: function
##     break_positions: function
##     breaks: waiver
##     call: call
##     clone: function
##     dimension: function
##     drop: TRUE
##     expand: waiver
##     get_breaks: function
##     get_breaks_minor: function
##     get_labels: function
##     get_limits: function
##     guide: legend
##     is_discrete: function
##     is_empty: function
##     labels: waiver
##     limits: function
##     make_sec_title: function
##     make_title: function
##     map: function
##     map_df: function
##     n.breaks.cache: NULL
##     na.translate: TRUE
##     na.value: grey50
##     name: Transmission
##     palette: function
##     palette.cache: NULL
##     position: left
##     range: environment
##     rescale: function
##     reset: function
##     scale_name: manual
##     train: function
##     train_df: function
##     transform: function
##     transform_df: function
##     super:  <ggproto object: Class ScaleDiscrete, Scale, gg>

7.2.3.3 Setting a dummy aesthetic

You can make univariate plots in ggplot2, but you will need to add a fake y axis by mapping y to zero.

When using setting y-axis limits, you can specify the limits as separate arguments, or as a single numeric vector. That is, ylim(low, high) or ylim(c(low, high)).

# Plot 0 vs. mpg
ggplot(mtcars, aes(x = mpg, y = 0)) +
  # Add jitter 
  geom_point(position = "jitter") +
  # Set the y-axis limits
  ylim(-2, 2)

7.2.4 Aesthetics best practices

Mapping continuous variables

Mapping categorical variables

7.3 Geometries

geom_*

Each geom is associated with specific aesthetic mappings, some of which are essential, some of which are optional(e.g, alpha, color, fill, shape, size, stroke).

Common plot types

7.3.1 Scatter plots

Possible geoms: points, jitter, abline, smooth, count
Essential aes: x, y

7.3.1.1 Overplotting 1: large datasets

Typically, alpha blending (i.e. adding transparency) is recommended when using solid shapes. Alternatively, you can use opaque, hollow shapes.

Small points are suitable for large datasets with regions of high density (lots of overlapping).

# Plot price vs. carat, colored by clarity
plt_price_vs_carat_by_clarity <- ggplot(diamonds, aes(carat, price, color = clarity))

# Add a point layer with tiny points
plt_price_vs_carat_by_clarity + 
    geom_point(alpha = 0.5, shape = ".")

# Set transparency to 0.5
plt_price_vs_carat_by_clarity + geom_point(alpha = 0.5, shape = 16)

7.3.1.2 Overplotting 2: Aligned values

This occurs when one axis is continuous and the other is categorical, which can be overcome with some form of jittering.

# Plot base
plt_mpg_vs_fcyl_by_fam <- ggplot(mtcars, aes(fcyl, mpg, color = fam))

# Default points are shown for comparison
plt_mpg_vs_fcyl_by_fam + geom_point()

# Alter the point positions by jittering, width 0.3
plt_mpg_vs_fcyl_by_fam + 
    geom_point(position = position_jitter(width = 0.3))

# Now jitter and dodge the point positions
plt_mpg_vs_fcyl_by_fam + 
    geom_point(position = position_jitterdodge(jitter.width = 0.3, 
                                               dodge.width = 0.3))

7.3.1.3 Overplotting 3: Low-precision data

This results from low-resolution measurements like in the iris dataset, which is measured to 1mm precision. It’s similar to case 2, but in this case we can jitter on both the x and y axis.

Notice that jitter can be a geom itself (i.e. geom_jitter()), an argument in geom_point() (i.e. position = "jitter"), or a position function, (i.e. position_jitter()).

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  # Swap for jitter layer with width 0.1
  geom_jitter(width = 0.1, alpha = 0.5)

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  # Set the position to jitter
  geom_point(alpha = 0.5,
             position = "jitter")

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  # Use a jitter position function with width 0.1
  geom_point(alpha = 0.5,
             position = position_jitter(width = 0.1))

7.3.1.4 Overplotting 4: Integer data

This can be type integer (i.e. 1 ,2, 3…) or categorical (i.e. class factor) variables. factor is just a special class of type integer.

You’ll typically have a small, defined number of intersections between two variables, which is similar to case 3, but you may miss it if you don’t realize that integer and factor data are the same as low precision data.

library(tidyverse)
Vocab <- read_csv("data/Vocab.csv")

# Examine the structure of Vocab
str(Vocab)

## spc_tbl_ [21,638 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ year      : num [1:21638] 2004 2004 2004 2004 2004 ...
##  $ sex       : chr [1:21638] "Female" "Female" "Male" "Female" ...
##  $ education : num [1:21638] 9 14 14 17 14 14 12 10 11 9 ...
##  $ vocabulary: num [1:21638] 3 6 9 8 1 7 6 6 5 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   year = col_double(),
##   ..   sex = col_character(),
##   ..   education = col_double(),
##   ..   vocabulary = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

# Convert data structure
Vocab$sex <- factor(Vocab$sex)
str(Vocab)

## spc_tbl_ [21,638 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ year      : num [1:21638] 2004 2004 2004 2004 2004 ...
##  $ sex       : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 2 1 2 2 1 ...
##  $ education : num [1:21638] 9 14 14 17 14 14 12 10 11 9 ...
##  $ vocabulary: num [1:21638] 3 6 9 8 1 7 6 6 5 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   year = col_double(),
##   ..   sex = col_character(),
##   ..   education = col_double(),
##   ..   vocabulary = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

# Plot vocabulary vs. education
ggplot(Vocab, aes(education, vocabulary)) +
  # Add a point layer
  geom_point()

ggplot(Vocab, aes(education, vocabulary)) +
  # Change to a jitter layer
  geom_jitter()

ggplot(Vocab, aes(education, vocabulary)) +
  # Set the transparency to 0.2
  geom_jitter(alpha = 0.2)

ggplot(Vocab, aes(education, vocabulary)) +
  # Set the shape to 1
  geom_jitter(alpha = 0.2, shape = 1)

Notice how jittering and alpha blending serves as a great solution to the overplotting problem here. Setting the shape to 1 didn’t really help, but it was useful in the previous exercises when you had less data. You need to consider each plot individually.

7.3.2 Histograms

A histogram is a special type of bar plot that shows the binned distribution of a continuous variable.

Essential aes: x (continuous variable)
A plot of binned values.
Always set a meaningful bin widths for your data.
No spaces between bars.
X axis labels are between bars (represent intervals and not actual values).

7.3.2.1 Drawing histograms

Recall that histograms cut up a continuous variable into discrete bins and, by default, maps the internally calculated count variable (the number of observations in each bin) onto the y aesthetic. An internal variable called density can be accessed by using the .. notation, i.e. ..density... Plotting this variable will show the relative frequency, which is the height times the width of each bin.

# Plot mpg
ggplot(mtcars, aes(mpg)) +
  # Add a histogram layer
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better
## value with `binwidth`.

ggplot(mtcars, aes(mpg)) +
  # Set the binwidth to 1
  geom_histogram(binwidth = 1)

If you want to use density on the y-axis be sure to set your binwidth to an intuitive value.

# Map y to ..density..
ggplot(mtcars, aes(mpg, ..density..)) +
  geom_histogram(binwidth = 1)

## Warning: The dot-dot notation (`..density..`) was
## deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()`
## to see where this warning was generated.

datacamp_light_blue <- "#51A8C9"

ggplot(mtcars, aes(mpg, ..density..)) +
  # Set the fill color to datacamp_light_blue
  geom_histogram(binwidth = 1, fill = datacamp_light_blue)

7.3.2.2 Positions in histograms

geom_histogram(), a special case of geom_bar(), has a position argument.

stack (default): Bars for different groups are stacked on top of each other.

# Update the aesthetics so the fill color is by fam
ggplot(mtcars, aes(mpg, fill = fam)) +
  geom_histogram(binwidth = 1)

dodge: Bars for different groups are placed side by side.

ggplot(mtcars, aes(mpg, fill = fam)) +
  # Change the position to dodge
  geom_histogram(binwidth = 1,
                 position = "dodge")

fill: Bars for different groups are shown as proportions.

ggplot(mtcars, aes(mpg, fill = fam)) +
  # Change the position to fill
  geom_histogram(binwidth = 1, 
                 position = "fill")

## Warning: Removed 16 rows containing missing values
## (`geom_bar()`).

identity: Plot the values as they appear in the dataset.

ggplot(mtcars, aes(mpg, fill = fam)) +
  # Change the position to identity, with transparency 0.4
  geom_histogram(binwidth = 1, 
                 position = "identity", 
                 alpha = 0.4)

7.3.3 Bar plots

A categorical X-axis
Two types:
- Absolute counts
- Distributions: dynamite plots (avg & sd)

Note that the function geom_col() is just geom_bar() where both the position and stat arguments are set to "identity". It is used when we want the heights of the bars to represent the exact values in the data.

7.3.3.1 Position in bar and col plots

All positions are available.

# Plot fcyl, filled by fam
ggplot(mtcars, aes(fcyl, fill = fam)) +
  # Add a bar layer
  geom_bar()

ggplot(mtcars, aes(fcyl, fill = fam)) +
  # Set the position to "fill"
  geom_bar(position = "fill")

ggplot(mtcars, aes(fcyl, fill = fam)) +
  # Change the position to "dodge"
  geom_bar(position = "dodge")

7.3.3.2 Overlapping bar plots

You can use position_dodge() (and position_jitter()) is to specify how much dodging (or jittering) you want.

ggplot(mtcars, aes(cyl, fill = fam)) +
  # Change position to use the functional form, with width 0.2
  geom_bar(position = position_dodge(width = 0.2))

ggplot(mtcars, aes(cyl, fill = fam)) +
  # Set the transparency to 0.6
  geom_bar(position = position_dodge(width = 0.2), 
           alpha = 0.6)

7.3.3.3 Sequential color palette

In this bar plot, we’ll fill each segment according to an ordinal variable. The best way to do that is with a sequential color palette.

# Convert data structure
Vocab$vocabulary <- factor(Vocab$vocabulary, ordered = T)

# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary)) +
    geom_bar()

# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary)) +
  # Add a bar layer with position "fill"
  geom_bar(position = "fill")

Notice warning massage.

# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary)) +
  # Add a bar layer with position "fill"
  geom_bar(position = "fill") +
  # Add a brewer fill scale with default palette
  scale_fill_brewer()

## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Blues is 9
## Returning the palette you asked for with that many colors

library(RColorBrewer)

# Definition of a set of blue colors
blues <- brewer.pal(9, "Blues") # from the RColorBrewer package

# 1 - Make a color range using colorRampPalette() and the set of blues
blue_range <- colorRampPalette(blues)

# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary)) +
  # Add a bar layer with position "fill"
  geom_bar(position = "fill") +
  # Add a brewer fill scale with default palette
 scale_fill_manual(values = blue_range(11))

7.3.4 Line plots

Very well-suited in time series.

Possible geoms: line, path
Essential aes: x, y

When we have multiple lines, we have to consider which aesthetic is more appropriate in allowing us to distinguish individual trends. Aesthetic:

linetype
size
color: The most salient choice, when available, since it allows the easiest way of distinguishing between each series.

geom_area(): which defaults to position "stack", so instead of overlapping time series, they are added together at each point.

geom_area(position = "fill"): we’ll get a proportion the total capture for each fish at each time-point. Note that we’d need to change the y axis label since these are no longer counts! This kind of plot is pretty popular in looking at proportional trends over time. A difficulty with this kind of plot is that only the bottom and top groups are actually drawn on a common scale, all the other ones are irregular shapes so they can be a bit difficult to decipher.

geom_ribbon(): want to have overlapping areas plots. In this case we’d have to force the y-min to be 0. We need to set the alpha level so that we can see the overlap. There is still a challenge in deciphering all the time series, in particularly at the bottom, where there are many overlapping series.

7.3.4.1 Basic line plots

The economics dataset contains a time series for unemployment and population statistics from the Federal Reserve Bank of St. Louis in the United States.

# Print the head of economics
head(economics)

## # A tibble: 6 × 6
##   date         pce    pop psavert uempmed unemploy
##   <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
## 1 1967-07-01  507. 198712    12.6     4.5     2944
## 2 1967-08-01  510. 198911    12.6     4.7     2945
## 3 1967-09-01  516. 199113    11.9     4.6     2958
## 4 1967-10-01  512. 199311    12.9     4.9     3143
## 5 1967-11-01  517. 199498    12.8     4.7     3066
## 6 1967-12-01  525. 199657    11.8     4.8     3018

# Using economics, plot unemploy vs. date
ggplot(economics, aes(date, unemploy)) +
  # Make it a line plot
  geom_line()

# Change the y-axis to the proportion of the population that is unemployed
ggplot(economics, aes(x = date, y = unemploy / pop)) +
  geom_line()

7.3.4.2 Multiple time series

fish.species contains the global capture rates of seven salmon species from 1950–2010.

# Read dataset
fish <- load("data/fish.RData")

str(fish.species)

## 'data.frame':    61 obs. of  8 variables:
##  $ Year    : int  1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 ...
##  $ Pink    : int  100600 259000 132600 235900 123400 244400 203400 270119 200798 200085 ...
##  $ Chum    : int  139300 155900 113800 99800 148700 143700 158480 125377 132407 113114 ...
##  $ Sockeye : int  64100 51200 58200 66100 83800 72000 84800 69676 100520 62472 ...
##  $ Coho    : int  30500 40900 33600 32400 38300 45100 40000 39900 39200 32865 ...
##  $ Rainbow : int  0 100 100 100 100 100 100 100 100 100 ...
##  $ Chinook : int  23200 25500 24900 25300 24500 27700 25300 21200 20900 20335 ...
##  $ Atlantic: int  10800 9701 9800 8800 9600 7800 8100 9000 8801 8700 ...

str(fish.tidy)

## 'data.frame':    427 obs. of  3 variables:
##  $ Species: Factor w/ 7 levels "Pink","Chum",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year   : int  1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 ...
##  $ Capture: int  100600 259000 132600 235900 123400 244400 203400 270119 200798 200085 ...

# Plot the Rainbow Salmon time series
ggplot(fish.species, aes(x = Year, y = Rainbow)) +
  geom_line()

# Plot the Pink Salmon time series
ggplot(fish.species, aes(x = Year, y = Pink)) +
  geom_line()

# Plot multiple time-series by grouping by species
ggplot(fish.tidy, aes(Year, Capture)) +
  geom_line(aes(group = Species))

# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
  geom_line()

7.4 Themes

All non-data ink.
Visual elements not part of the data.

7.4.1 Themes from scratch

7.4.1.1 Moving the legend

To change stylistic elements of a plot, call theme() and set plot properties to a new value. For example, the following changes the legend position.

p + theme(legend.position = new_value)

Here, the new value can be

"top", "bottom", "left", or "right'": place it at that side of the plot.
"none": don’t draw it.
c(x, y): c(0, 0) means the bottom-left and c(1, 1) means the top-right.

plt_prop_unemployed_over_time <- 
    ggplot(economics, aes(date, unemploy/pop )) + 
        geom_line(aes(color = pce)) + 
        theme(legend.position = "right") +
        ggtitle("unemployed proportion over time")
 

# View the default plot
plt_prop_unemployed_over_time

# Remove legend entirely
plt_prop_unemployed_over_time +
  theme(legend.position = "none")

# Position the legend at the bottom of the plot
plt_prop_unemployed_over_time +
  theme(legend.position = "bottom")

# Position the legend inside the plot at (0.6, 0.1)
plt_prop_unemployed_over_time +
  theme(legend.position = c(0.6, 0.1))

7.4.1.2 Modifying theme elements

Many plot elements have multiple properties that can be set. For example, line elements in the plot such as axes and gridlines have a color, a thickness (size), and a line type (solid line, dashed, or dotted). To set the style of a line, you use element_line(). Similarly, element_rect() changes rectangles and element_text() changes text. You can remove a plot element using element_blank().

plt_prop_unemployed_over_time +
  theme(
    # For all rectangles, set the fill color to grey92
    rect = element_rect(fill = "grey92"),
    # For the legend key, turn off the outline
    legend.key = element_rect(color = NA)
  )

plt_prop_unemployed_over_time +
  theme(
    rect = element_rect(fill = "grey92"),
    legend.key = element_rect(color = NA),
    # Turn off axis ticks
    axis.ticks = element_blank(),
    # Turn off the panel grid
    panel.grid = element_blank()
  )

plt_prop_unemployed_over_time +
  theme(
    rect = element_rect(fill = "grey92"),
    legend.key = element_rect(color = NA),
    axis.ticks = element_blank(),
    panel.grid = element_blank(),
    # Add major y-axis panel grid lines back
    panel.grid.major.y = element_line(
      # Set the color to red
      color = "red",
      # Set the size to 0.5
      size = 0.5,
      # Set the line type to dotted
      linetype = "dotted"
      )
    )

## Warning: The `size` argument of `element_line()` is
## deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument
##   instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()`
## to see where this warning was generated.

plt_prop_unemployed_over_time +
  theme(
    rect = element_rect(fill = "grey92"),
    legend.key = element_rect(color = NA),
    axis.ticks = element_blank(),
    panel.grid = element_blank(),
    panel.grid.major.y = element_line(
      color = "red",
      size = 0.5,
      linetype = "dotted"
    ),
    # Set the axis text color to grey25
    axis.text = element_text(color = "grey25"),
    # Set the plot title font face to italic and font size to 16
    plot.title = element_text(size = 16, face = "italic")
  )

7.4.1.3 Modifying whitespace

Whitespace means all the non-visible margins and spacing in the plot.

To set a single whitespace value, use unit(x, unit), where x is the amount and unit is the unit of measure.

Borders require you to set 4 positions, so use margin(top, right, bottom, left, unit).

The default unit is "pt" (points), which scales well with text. Other options include “cm”, “in” (inches) and “lines” (of text).

plt_mpg_vs_wt_by_cyl <- 
    ggplot(mtcars, aes(mpg, wt, color = fcyl)) + 
        geom_point() + 
        theme(panel.border = element_rect(color = "blue",
                                          fill = NA, 
                                          size = 0.6,
                                          linetype = "dotted"),
              legend.box.background = element_rect(color = "blue",
                                                   fill = NA, 
                                                   size = 0.6,
                                                   linetype = "dotted"),
              legend.position = "right")

## Warning: The `size` argument of `element_rect()` is
## deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument
##   instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()`
## to see where this warning was generated.

plt_mpg_vs_wt_by_cyl

plt_mpg_vs_wt_by_cyl +
  theme(
    # Set the axis tick length to 2 lines
    axis.ticks.length = unit(2, "lines")
  )

plt_mpg_vs_wt_by_cyl +
  theme(
    # Set the legend key size to 3 centimeters
    legend.key.size = unit(3, "cm")
  )

plt_mpg_vs_wt_by_cyl +
  theme(
    # Set the legend margin to (20, 30, 40, 50) points
    legend.margin = ggplot2::margin(20, 30, 40, 50, "pt")
  )

plt_mpg_vs_wt_by_cyl +
  theme(
    # Set the plot margin to (10, 30, 50, 70) millimeters
    plot.margin = ggplot2::margin(10, 30, 50, 70, "mm")
  )

7.4.2 Theme flexibility

7.4.2.1 Built-in themes

In addition to making your own themes, there are several out-of-the-box solutions that may save you lots of time.

theme_gray() is the default.
theme_bw() is useful when you use transparency.
theme_classic() is more traditional.
theme_void() removes everything but the data.

# Add a black and white theme
plt_prop_unemployed_over_time +
  theme_bw()

# Add a classic theme
plt_prop_unemployed_over_time +
  theme_classic()

# Add a void theme
plt_prop_unemployed_over_time +
  theme_void()

7.4.2.2 Exploring ggthemes

Outside of ggplot2, another source of built-in themes is the ggthemes package.

library(ggthemes)

## Warning: package 'ggthemes' was built under R version 4.3.1

# Use the fivethirtyeight theme
plt_prop_unemployed_over_time +
  theme_fivethirtyeight()

# Use Tufte's theme
plt_prop_unemployed_over_time +
  theme_tufte()

# Use the Wall Street Journal theme
plt_prop_unemployed_over_time +
  theme_wsj()

7.4.2.3 Setting themes

Reusing a theme across many plots helps to provide a consistent style. You have several options for this.

Assign the theme to a variable, and add it to each plot.
Set your theme as the default using theme_set().

A good strategy that you’ll use here is to begin with a built-in theme then modify it.

# Save the theme as theme_recession
theme_recession <- theme(
  rect = element_rect(fill = "grey92"),
  legend.key = element_rect(color = NA),
  axis.ticks = element_blank(),
  panel.grid = element_blank(),
  panel.grid.major.y = element_line(color = "white", size = 0.5, linetype = "dotted"),
  axis.text = element_text(color = "grey25"),
  plot.title = element_text(face = "italic", size = 16),
  legend.position = c(0.6, 0.1)
)

# Combine the Tufte theme with theme_recession
theme_tufte_recession <- theme_tufte() + theme_recession

# Add the Tufte recession theme to the plot
plt_prop_unemployed_over_time + theme_tufte_recession

Using theme_set() will also get the same output.

# Set theme_tufte_recession as the default theme
theme_set(theme_tufte_recession)

# Draw the plot (without explicitly adding a theme)
plt_prop_unemployed_over_time

7.4.2.4 Publication-quality plots

plt_prop_unemployed_over_time +
  # Add Tufte's theme
  theme_tufte()

plt_prop_unemployed_over_time +
  theme_tufte() +
  # Add individual theme elements
  theme(
    # Turn off the legend
    legend.position = "none",
    # Turn off the axis ticks
    axis.ticks = element_blank()
    )

plt_prop_unemployed_over_time +
  theme_tufte() +
  # Add individual theme elements
  theme(
    legend.position = "none",
    axis.ticks = element_blank(),
    # Set the axis title's text color to grey60
    axis.title = element_text(color = "grey60"),
    # Set the axis text's text color to grey60
    axis.text = element_text(color = "grey60")
  )

plt_prop_unemployed_over_time +
  theme_tufte() +
  # Add individual theme elements
  theme(
    legend.position = "none",
    axis.ticks = element_blank(),
    axis.title = element_text(color = "grey60"),
    axis.text = element_text(color = "grey60"),
    # Set the panel gridlines major y values
    panel.grid.major.y = element_line(
      # Set the color to grey60
      color = "grey60",
      # Set the size to 0.25
      size = 0.25,
      # Set the linetype to dotted
      linetype = "dotted"
    )
  )

7.4.3 Effective explanatory plots

7.4.3.1 Using geoms for explanatory plots

gm2007 <- read_csv("data/gm2007.csv")

# Add a geom_segment() layer
ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
  geom_point(size = 4) +
  geom_segment(aes(xend = 30, yend = country), size = 2)

# Add a geom_text() layer
ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
  geom_point(size = 4) +
  geom_segment(aes(xend = 30, yend = country), size = 2) +
  geom_text(aes(label = lifeExp), color = "white", size = 1.5)

# Set the color scale
palette <- brewer.pal(5, "RdYlBu")[-(2:4)]

# Modify the scales
ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
  geom_point(size = 4) +
  geom_segment(aes(xend = 30, yend = country), size = 2) +
  geom_text(aes(label = round(lifeExp,1)), color = "white", size = 1.5) +
  scale_x_continuous("", expand = c(0, 0), limits = c(30, 90), position = "top") +
  scale_color_gradientn(colors = palette)

# Set the color scale
palette <- brewer.pal(5, "RdYlBu")[-(2:4)]

# Add a title and caption
plt_country_vs_lifeExp <- ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
  geom_point(size = 4) +
  geom_segment(aes(xend = 30, yend = country), size = 2) +
  geom_text(aes(label = round(lifeExp,1)), color = "white", size = 1.5) +
  scale_x_continuous("", 
                     expand = c(0,0), 
                     limits = c(30,90), 
                     position = "top") +
  scale_color_gradientn(colors = palette) +
  labs(title = "Highest and lowest life expectancies, 2007",
       caption = "Source: gapminder")

plt_country_vs_lifeExp

7.4.3.2 Using annotate() for embellishments

In this exercise, you’ll use annotate() to add text and a curve to the plot.

# Define the theme
plt_country_vs_lifeExp +
  theme_classic() +
  theme(axis.line.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text = element_text(color = "black"),
        axis.title = element_blank(),
        legend.position = "none")

gm2007_full <- read_csv("data/gm2007_full.csv")

global_mean <- mean(gm2007_full$lifeExp)
x_start <- global_mean + 4
y_start <- 5.5
x_end <- global_mean
y_end <- 7.5


plt_country_vs_lifeExp +
  theme_classic() +
  theme(axis.line.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text = element_text(color = "black"),
        axis.title = element_blank(),
        legend.position = "none") +
  # Add a vertical line
  geom_vline(xintercept = global_mean, 
             color = "grey40", 
             linetype = 3)

plt_country_vs_lifeExp +
  theme_classic() +
  theme(axis.line.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text = element_text(color = "black"),
        axis.title = element_blank(),
        legend.position = "none") +
  geom_vline(xintercept = global_mean, 
             color = "grey40", 
             linetype = 3) +
  # Add text
  annotate(
    "text",
    x = x_start, y = y_start,
    label = "The\nglobal\naverage",
    vjust = 1, size = 3, color = "grey40"
  )

plt_country_vs_lifeExp +
  theme_classic() +
  theme(axis.line.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text = element_text(color = "black"),
        axis.title = element_blank(),
        legend.position = "none") +
  geom_vline(xintercept = global_mean, 
             color = "grey40", 
             linetype = 3) +
  annotate(
    "text",
    x = x_start, y = y_start,
    label = "The\nglobal\naverage",
    vjust = 1, size = 3, color = "grey40"
  ) +
  # Add a curve
  annotate(
    "curve",
    x = x_start, y = y_start,
    xend = x_end, yend = y_end,
    arrow = arrow(length = unit(0.2, "cm"), type = "closed"),
    color = "grey40"
  )