Chapter 7 Introduction to Data Visualization with ggplot2
7.1 Introduction
7.1.1 Data columns types affect plot types
The mtcars
dataset contains information on 32 cars from a 1973 issue of Motor Trend magazine.
library(ggplot2)
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
After changing variable data type, the x-axis does not contain variables like 5 or 7, only the values that are present in the dataset.
# Original command
ggplot(mtcars, aes(cyl, mpg)) +
geom_point()
# Change the command below so that cyl is treated as factor
ggplot(mtcars, aes(factor(cyl), mpg)) +
geom_point()
7.1.2 The grammar of graphics
The seven grammatical elements
Element | Description | |
Data | The data-set being plotted. | essential grammatical elements |
Aesthetics | The scales onto which we map our data. | essential grammatical elements |
Geometries | The visual elements used for our data. | essential grammatical elements |
Themes | All non-data ink. | |
Statistics | Representations of our data to aid understanding. | |
Coordinates | The space on which the data will be plotted. | |
Facets | Plotting small multiples. |
Jargon for each element
7.1.3 ggplot2 layers
7.1.3.1 Adding geometries
geom_smooth()
adds a smooth trend curve.
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
ggplot(diamonds, aes(carat, price)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'gam' and
## formula = 'y ~ s(x, bs = "cs")'
7.1.3.2 Changing one geom or every geom
geom_point()
has an alpha
argument that controls the opacity of the points. A value of 1 (the default) means that the points are totally opaque; a value of 0 means the points are totally transparent (and therefore invisible).
# Make the points 40% opaque
ggplot(diamonds, aes(carat, price, color = clarity)) +
geom_point(alpha = 0.4) +
geom_smooth()
## `geom_smooth()` using method = 'gam' and
## formula = 'y ~ s(x, bs = "cs")'
7.1.3.3 Saving plots as variables
Plots can be saved as variables, which can be added to later on using the +
operator. This is really useful if you want to make multiple related plots from a common base.
# Draw a ggplot
<- ggplot(
plt_price_vs_carat # Use the diamonds dataset
diamonds,# For the aesthetics, map x to carat and y to price
aes(carat, price)
)
# Add a point layer to plt_price_vs_carat
+ geom_point() plt_price_vs_carat
# Edit this to make points 20% opaque: plt_price_vs_carat_transparent
<- plt_price_vs_carat + geom_point(alpha = 0.2)
plt_price_vs_carat_transparent
# See the plot
plt_price_vs_carat_transparent
# Edit this to map color to clarity,
# Assign the updated plot to a new object
<- plt_price_vs_carat +
plt_price_vs_carat_by_clarity geom_point(aes(color = clarity))
# See the plot
plt_price_vs_carat_by_clarity
7.2 Aesthetics
7.2.1 Visible aesthetics
Typical visible aesthetics
Aesthetic | Description |
---|---|
x | X axis position |
y | Y axis position |
fill | Fill color (點的填滿顏色) |
color | Color of points, outlines of othergeoms (點的外框顏色) |
size | Area or radius of points, thickness of lines |
alpha | Transparency (點的透明度) |
linetype | line dash pattern |
labels | Text on a plot or axes |
shape | Shape of points |
label
and shape
are only applicable to categorical data.
7.2.1.1 Aesthetics: color, shape and size
# transformed cyl, am into a factor fcyl, fam
$fcyl <- factor(mtcars$cyl)
mtcars$fam <- factor(mtcars$am)
mtcarsstr(mtcars)
## 'data.frame': 32 obs. of 13 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
## $ fcyl: Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
## $ fam : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
# Map x to wt, y to mpg and color to fcyl
ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
# Set the shape and size of the points
geom_point(shape = 1, size = 4)
7.2.1.2 Aesthetics: color vs. fill
Typically, the color
aesthetic changes the outline of a geom and the fill
aesthetic changes the inside. geom_point()
is an exception: you use color
(not fill
) for the point color. However, some shapes have special behavior.
The default geom_point()
uses shape = 19
: a solid circle. An alternative is shape = 21
: a circle that allow you to use both fill
for the inside and color
for the outline. This is lets you to map two aesthetics to each point.
All shape values are described on the ?points
help page.
# Map fcyl to fill
ggplot(mtcars, aes(wt, mpg, fill = fcyl)) +
geom_point(shape = 1, size = 4)
ggplot(mtcars, aes(wt, mpg, fill = fcyl)) +
# Change point shape; set alpha
geom_point(shape = 21, size = 4, alpha = 0.6)
# Map color to fam
ggplot(mtcars, aes(wt, mpg, fill = fcyl, color = fam)) +
geom_point(shape = 21, size = 4, alpha = 0.6)
Notice that mapping a categorical variable onto fill doesn’t change the colors, although a legend is generated! This is because the default shape for points only has a color attribute and not a fill attribute! Use fill when you have another shape (such as a bar), or when using a point that does have a fill and a color attribute, such as shape = 21
, which is a circle with an outline. Any time you use a solid color, make sure to use alpha blending to account for over plotting.
7.2.1.3 Comparing aesthetics
# Establish the base layer
<- ggplot(mtcars, aes(x = wt, y = mpg))
plt_mpg_vs_wt
# Map fcyl to size
+
plt_mpg_vs_wt geom_point(aes(size = fcyl))
## Warning: Using size for a discrete variable is not
## advised.
# Map fcyl to alpha, not size
+
plt_mpg_vs_wt geom_point(aes(alpha = fcyl))
## Warning: Using alpha for a discrete variable is not
## advised.
# Map fcyl to shape, not alpha
+
plt_mpg_vs_wt geom_point(aes(shape = fcyl))
# Use text layer and map fcyl to label
+
plt_mpg_vs_wt geom_text(aes(label = fcyl))
7.2.2 Using attributes
Set attributes in geom_*()
. Attributes are always called in the geom layer!
7.2.2.1 Attributes: color, shape, size and alpha
This time you’ll use these arguments to set attributes of the plot, not map variables onto aesthetics.
You can specify colors in R using hex codes: a hash followed by two hexadecimal numbers each for red, green, and blue ("#RRGGBB"
). Hexadecimal is base-16 counting. You have 0 to 9, and A representing 10 up to F representing 15. Pairs of hexadecimal numbers give you a range from 0 to 255. "#000000"
is “black” (no color), "#FFFFFF"
means “white”, and `“#00FFFF” is cyan (mixed green and blue).
# A hexadecimal color
<- "#4ABEFF"
my_blue
ggplot(mtcars, aes(wt, mpg)) +
# Set the point color and alpha
geom_point(color = my_blue, alpha = 0.6)
# Change the color mapping to a fill mapping
ggplot(mtcars, aes(wt, mpg, fill = fcyl)) +
# Set point size and shape
geom_point(color = my_blue,
size = 10,
shape = 1)
7.2.2.2 Attributes: conflicts with aesthetics
ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
# Add point layer with alpha 0.5
geom_point(alpha = 0.5)
ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
# Add text layer with label rownames of the dataset mtcars and color red
geom_text(label = rownames(mtcars),
color = "red")
ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
# Add points layer with shape 24 and color yellow
geom_point(shape = 24,
color = "yellow")
Notice that adding more aesthetic mappings to your plot is not always a good idea! You may just increase complexity and decrease readability.
# 3 aesthetics: qsec vs. mpg, colored by fcyl
ggplot(mtcars, aes(x = mpg, y = qsec, color = fcyl)) +
geom_point()
# 4 aesthetics: add a mapping of shape to fam
ggplot(mtcars, aes(mpg, qsec, color = fcyl, shape = fam)) +
geom_point()
# 5 aesthetics: add a mapping of size to hp / wt
ggplot(mtcars, aes(mpg, qsec, color = fcyl, shape = fam, size = hp/wt)) +
geom_point()
Between the x and y dimensions, the color, shape, and size of the points, your plot displays five dimensions of the dataset.
7.2.3 Modifying aesthetics
7.2.3.1 Adjustment for overlapping
Positions
position = "*"
or position_*()
identity: default. Don’t adjust data positions.
dodge: preserves the vertical position of a geom while adjusting the horizontal position.
position_dodge(width = NULL, preserve = c("total", "single"))
.stack: stacks bars on to of each other. This is the default of
geom_bar
andgeom_area
.fill: stacks bars and standardizes each stack to have constant height.
jitter: add some random noise on both the x and y axes.
position_jitter(width = NULL, height = NULL, seed = num)
jitterdodage
nudge
7.2.3.2 Scale functions
scale_*_*()
Appropriately enough, we can access all the scales with the scale underscore functions. The second part of the function defines which scale we want to modify. The third part must match the type of data we are using.
scale_x_*()
/scale_x_continuous()
scale_y_*()
scale_color_()
/scale_color_discrete()
- Also
scale_colour_*()
/scale_colour_*()
- Also
scale_fill_*()
scale_shape_*()
scale_linetype_*()
scale_size_*()
There are many arguments for the scale functions. Most common are limits
, breaks
, expand
and labels
.
limits
: describe the scale’s range.breaks
: control the tick mark positions.expand
: a numeric vector of length two, giving a multiplicative and additive constant used to expand the range of the scales so that there is a small gap between the data and the axes.labels
: adjust the category names.labs
: change the axis labels.
ggplot(mtcars, aes(fcyl, fill = fam)) +
geom_bar() +
# Set the axis labels
labs(x = "Number of Cylinders",
y = "Count")
scale_fill_manual()
defines properties of the color scale (i.e. axis). The first argument sets the legend title. values
is a named vector of colors to use.
levels(mtcars$fam) <- c("automatic", "manual")
str(mtcars$fam)
## Factor w/ 2 levels "automatic","manual": 2 2 2 1 1 1 1 1 1 1 ...
<- c(automatic = "#377EB8", manual = "#E41A1C")
palette
ggplot(mtcars, aes(fcyl, fill = fam)) +
geom_bar() +
labs(x = "Number of Cylinders", y = "Count") +
# Set the fill color scale
scale_fill_manual("Transmission", values = palette)
# Set the position
ggplot(mtcars, aes(fcyl, fill = fam)) +
geom_bar(position = "dodge") +
labs(x = "Number of Cylinders", y = "Count")
scale_fill_manual("Transmission", values = palette)
## <ggproto object: Class ScaleDiscrete, Scale, gg>
## aesthetics: fill
## axis_order: function
## break_info: function
## break_positions: function
## breaks: waiver
## call: call
## clone: function
## dimension: function
## drop: TRUE
## expand: waiver
## get_breaks: function
## get_breaks_minor: function
## get_labels: function
## get_limits: function
## guide: legend
## is_discrete: function
## is_empty: function
## labels: waiver
## limits: function
## make_sec_title: function
## make_title: function
## map: function
## map_df: function
## n.breaks.cache: NULL
## na.translate: TRUE
## na.value: grey50
## name: Transmission
## palette: function
## palette.cache: NULL
## position: left
## range: environment
## rescale: function
## reset: function
## scale_name: manual
## train: function
## train_df: function
## transform: function
## transform_df: function
## super: <ggproto object: Class ScaleDiscrete, Scale, gg>
7.2.3.3 Setting a dummy aesthetic
You can make univariate plots in ggplot2
, but you will need to add a fake y
axis by mapping y to zero.
When using setting y-axis limits, you can specify the limits as separate arguments, or as a single numeric vector. That is, ylim(low, high)
or ylim(c(low, high))
.
# Plot 0 vs. mpg
ggplot(mtcars, aes(x = mpg, y = 0)) +
# Add jitter
geom_point(position = "jitter") +
# Set the y-axis limits
ylim(-2, 2)
7.3 Geometries
geom_*
Each geom is associated with specific aesthetic mappings, some of which are essential, some of which are optional(e.g, alpha, color, fill, shape, size, stroke).
Common plot types
7.3.1 Scatter plots
Possible geoms: points, jitter, abline, smooth, count
Essential aes: x, y
7.3.1.1 Overplotting 1: large datasets
Typically, alpha blending (i.e. adding transparency) is recommended when using solid shapes. Alternatively, you can use opaque, hollow shapes.
Small points are suitable for large datasets with regions of high density (lots of overlapping).
# Plot price vs. carat, colored by clarity
<- ggplot(diamonds, aes(carat, price, color = clarity))
plt_price_vs_carat_by_clarity
# Add a point layer with tiny points
+
plt_price_vs_carat_by_clarity geom_point(alpha = 0.5, shape = ".")
# Set transparency to 0.5
+ geom_point(alpha = 0.5, shape = 16) plt_price_vs_carat_by_clarity
7.3.1.2 Overplotting 2: Aligned values
This occurs when one axis is continuous and the other is categorical, which can be overcome with some form of jittering.
# Plot base
<- ggplot(mtcars, aes(fcyl, mpg, color = fam))
plt_mpg_vs_fcyl_by_fam
# Default points are shown for comparison
+ geom_point() plt_mpg_vs_fcyl_by_fam
# Alter the point positions by jittering, width 0.3
+
plt_mpg_vs_fcyl_by_fam geom_point(position = position_jitter(width = 0.3))
# Now jitter and dodge the point positions
+
plt_mpg_vs_fcyl_by_fam geom_point(position = position_jitterdodge(jitter.width = 0.3,
dodge.width = 0.3))
7.3.1.3 Overplotting 3: Low-precision data
This results from low-resolution measurements like in the iris dataset, which is measured to 1mm precision. It’s similar to case 2, but in this case we can jitter on both the x and y axis.
Notice that jitter
can be a geom itself (i.e. geom_jitter()
), an argument in geom_point()
(i.e. position = "jitter"
), or a position function, (i.e. position_jitter()
).
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
# Swap for jitter layer with width 0.1
geom_jitter(width = 0.1, alpha = 0.5)
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
# Set the position to jitter
geom_point(alpha = 0.5,
position = "jitter")
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
# Use a jitter position function with width 0.1
geom_point(alpha = 0.5,
position = position_jitter(width = 0.1))
7.3.1.4 Overplotting 4: Integer data
This can be type integer
(i.e. 1 ,2, 3…) or categorical (i.e. class factor
) variables. factor
is just a special class of type integer
.
You’ll typically have a small, defined number of intersections between two variables, which is similar to case 3, but you may miss it if you don’t realize that integer and factor data are the same as low precision data.
library(tidyverse)
<- read_csv("data/Vocab.csv")
Vocab
# Examine the structure of Vocab
str(Vocab)
## spc_tbl_ [21,638 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ year : num [1:21638] 2004 2004 2004 2004 2004 ...
## $ sex : chr [1:21638] "Female" "Female" "Male" "Female" ...
## $ education : num [1:21638] 9 14 14 17 14 14 12 10 11 9 ...
## $ vocabulary: num [1:21638] 3 6 9 8 1 7 6 6 5 1 ...
## - attr(*, "spec")=
## .. cols(
## .. year = col_double(),
## .. sex = col_character(),
## .. education = col_double(),
## .. vocabulary = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# Convert data structure
$sex <- factor(Vocab$sex)
Vocabstr(Vocab)
## spc_tbl_ [21,638 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ year : num [1:21638] 2004 2004 2004 2004 2004 ...
## $ sex : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 2 1 2 2 1 ...
## $ education : num [1:21638] 9 14 14 17 14 14 12 10 11 9 ...
## $ vocabulary: num [1:21638] 3 6 9 8 1 7 6 6 5 1 ...
## - attr(*, "spec")=
## .. cols(
## .. year = col_double(),
## .. sex = col_character(),
## .. education = col_double(),
## .. vocabulary = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# Plot vocabulary vs. education
ggplot(Vocab, aes(education, vocabulary)) +
# Add a point layer
geom_point()
ggplot(Vocab, aes(education, vocabulary)) +
# Change to a jitter layer
geom_jitter()
ggplot(Vocab, aes(education, vocabulary)) +
# Set the transparency to 0.2
geom_jitter(alpha = 0.2)
ggplot(Vocab, aes(education, vocabulary)) +
# Set the shape to 1
geom_jitter(alpha = 0.2, shape = 1)
Notice how jittering and alpha blending serves as a great solution to the overplotting problem here. Setting the shape
to 1
didn’t really help, but it was useful in the previous exercises when you had less data. You need to consider each plot individually.
7.3.2 Histograms
A histogram is a special type of bar plot that shows the binned distribution of a continuous variable.
- Essential aes: x (continuous variable)
- A plot of binned values.
- Always set a meaningful bin widths for your data.
- No spaces between bars.
- X axis labels are between bars (represent intervals and not actual values).
7.3.2.1 Drawing histograms
Recall that histograms cut up a continuous variable into discrete bins and, by default, maps the internally calculated count
variable (the number of observations in each bin) onto the y aesthetic. An internal variable called density
can be accessed by using the ..
notation, i.e. ..density..
. Plotting this variable will show the relative frequency, which is the height times the width of each bin.
# Plot mpg
ggplot(mtcars, aes(mpg)) +
# Add a histogram layer
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better
## value with `binwidth`.
ggplot(mtcars, aes(mpg)) +
# Set the binwidth to 1
geom_histogram(binwidth = 1)
If you want to use density on the y-axis be sure to set your binwidth to an intuitive value.
# Map y to ..density..
ggplot(mtcars, aes(mpg, ..density..)) +
geom_histogram(binwidth = 1)
## Warning: The dot-dot notation (`..density..`) was
## deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()`
## to see where this warning was generated.
<- "#51A8C9"
datacamp_light_blue
ggplot(mtcars, aes(mpg, ..density..)) +
# Set the fill color to datacamp_light_blue
geom_histogram(binwidth = 1, fill = datacamp_light_blue)
7.3.2.2 Positions in histograms
geom_histogram()
, a special case of geom_bar()
, has a position argument.
stack
(default): Bars for different groups are stacked on top of each other.
# Update the aesthetics so the fill color is by fam
ggplot(mtcars, aes(mpg, fill = fam)) +
geom_histogram(binwidth = 1)
dodge
: Bars for different groups are placed side by side.
ggplot(mtcars, aes(mpg, fill = fam)) +
# Change the position to dodge
geom_histogram(binwidth = 1,
position = "dodge")
fill
: Bars for different groups are shown as proportions.
ggplot(mtcars, aes(mpg, fill = fam)) +
# Change the position to fill
geom_histogram(binwidth = 1,
position = "fill")
## Warning: Removed 16 rows containing missing values
## (`geom_bar()`).
identity
: Plot the values as they appear in the dataset.
ggplot(mtcars, aes(mpg, fill = fam)) +
# Change the position to identity, with transparency 0.4
geom_histogram(binwidth = 1,
position = "identity",
alpha = 0.4)
7.3.3 Bar plots
- A categorical X-axis
- Two types:
Absolute counts
Distributions: dynamite plots (avg & sd)
Note that the function geom_col()
is just geom_bar()
where both the position
and stat
arguments are set to "identity"
. It is used when we want the heights of the bars to represent the exact values in the data.
7.3.3.1 Position in bar and col plots
All positions are available.
# Plot fcyl, filled by fam
ggplot(mtcars, aes(fcyl, fill = fam)) +
# Add a bar layer
geom_bar()
ggplot(mtcars, aes(fcyl, fill = fam)) +
# Set the position to "fill"
geom_bar(position = "fill")
ggplot(mtcars, aes(fcyl, fill = fam)) +
# Change the position to "dodge"
geom_bar(position = "dodge")
7.3.3.2 Overlapping bar plots
You can use position_dodge()
(and position_jitter()
) is to specify how much dodging (or jittering) you want.
ggplot(mtcars, aes(cyl, fill = fam)) +
# Change position to use the functional form, with width 0.2
geom_bar(position = position_dodge(width = 0.2))
ggplot(mtcars, aes(cyl, fill = fam)) +
# Set the transparency to 0.6
geom_bar(position = position_dodge(width = 0.2),
alpha = 0.6)
7.3.3.3 Sequential color palette
In this bar plot, we’ll fill each segment according to an ordinal variable. The best way to do that is with a sequential color palette.
# Convert data structure
$vocabulary <- factor(Vocab$vocabulary, ordered = T)
Vocab
# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary)) +
geom_bar()
# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary)) +
# Add a bar layer with position "fill"
geom_bar(position = "fill")
Notice warning massage.
# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary)) +
# Add a bar layer with position "fill"
geom_bar(position = "fill") +
# Add a brewer fill scale with default palette
scale_fill_brewer()
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Blues is 9
## Returning the palette you asked for with that many colors
library(RColorBrewer)
# Definition of a set of blue colors
<- brewer.pal(9, "Blues") # from the RColorBrewer package
blues
# 1 - Make a color range using colorRampPalette() and the set of blues
<- colorRampPalette(blues)
blue_range
# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary)) +
# Add a bar layer with position "fill"
geom_bar(position = "fill") +
# Add a brewer fill scale with default palette
scale_fill_manual(values = blue_range(11))
7.3.4 Line plots
Very well-suited in time series.
Possible geoms: line, path
Essential aes: x, y
When we have multiple lines, we have to consider which aesthetic is more appropriate in allowing us to distinguish individual trends. Aesthetic:
linetype
size
color
: The most salient choice, when available, since it allows the easiest way of distinguishing between each series.
geom_area()
: which defaults to position "stack"
, so instead of overlapping time series, they are added together at each point.
geom_area(position = "fill")
: we’ll get a proportion the total capture for each fish at each time-point. Note that we’d need to change the y
axis label since these are no longer counts! This kind of plot is pretty popular in looking at proportional trends over time. A difficulty with this kind of plot is that only the bottom and top groups are actually drawn on a common scale, all the other ones are irregular shapes so they can be a bit difficult to decipher.
geom_ribbon()
: want to have overlapping areas plots. In this case we’d have to force the y-min
to be 0
. We need to set the alpha
level so that we can see the overlap. There is still a challenge in deciphering all the time series, in particularly at the bottom, where there are many overlapping series.
7.3.4.1 Basic line plots
The economics
dataset contains a time series for unemployment and population statistics from the Federal Reserve Bank of St. Louis in the United States.
# Print the head of economics
head(economics)
## # A tibble: 6 × 6
## date pce pop psavert uempmed unemploy
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1967-07-01 507. 198712 12.6 4.5 2944
## 2 1967-08-01 510. 198911 12.6 4.7 2945
## 3 1967-09-01 516. 199113 11.9 4.6 2958
## 4 1967-10-01 512. 199311 12.9 4.9 3143
## 5 1967-11-01 517. 199498 12.8 4.7 3066
## 6 1967-12-01 525. 199657 11.8 4.8 3018
# Using economics, plot unemploy vs. date
ggplot(economics, aes(date, unemploy)) +
# Make it a line plot
geom_line()
# Change the y-axis to the proportion of the population that is unemployed
ggplot(economics, aes(x = date, y = unemploy / pop)) +
geom_line()
7.3.4.2 Multiple time series
fish.species
contains the global capture rates of seven salmon species from 1950–2010.
# Read dataset
<- load("data/fish.RData")
fish
str(fish.species)
## 'data.frame': 61 obs. of 8 variables:
## $ Year : int 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 ...
## $ Pink : int 100600 259000 132600 235900 123400 244400 203400 270119 200798 200085 ...
## $ Chum : int 139300 155900 113800 99800 148700 143700 158480 125377 132407 113114 ...
## $ Sockeye : int 64100 51200 58200 66100 83800 72000 84800 69676 100520 62472 ...
## $ Coho : int 30500 40900 33600 32400 38300 45100 40000 39900 39200 32865 ...
## $ Rainbow : int 0 100 100 100 100 100 100 100 100 100 ...
## $ Chinook : int 23200 25500 24900 25300 24500 27700 25300 21200 20900 20335 ...
## $ Atlantic: int 10800 9701 9800 8800 9600 7800 8100 9000 8801 8700 ...
str(fish.tidy)
## 'data.frame': 427 obs. of 3 variables:
## $ Species: Factor w/ 7 levels "Pink","Chum",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : int 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 ...
## $ Capture: int 100600 259000 132600 235900 123400 244400 203400 270119 200798 200085 ...
# Plot the Rainbow Salmon time series
ggplot(fish.species, aes(x = Year, y = Rainbow)) +
geom_line()
# Plot the Pink Salmon time series
ggplot(fish.species, aes(x = Year, y = Pink)) +
geom_line()
# Plot multiple time-series by grouping by species
ggplot(fish.tidy, aes(Year, Capture)) +
geom_line(aes(group = Species))
# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
geom_line()
7.4 Themes
- All non-data ink.
- Visual elements not part of the data.
7.4.1 Themes from scratch
7.4.1.1 Moving the legend
To change stylistic elements of a plot, call theme()
and set plot properties to a new value. For example, the following changes the legend position.
+ theme(legend.position = new_value) p
Here, the new value can be
"top"
,"bottom"
,"left"
, or"right'"
: place it at that side of the plot."none"
: don’t draw it.c(x, y)
:c(0, 0)
means the bottom-left andc(1, 1)
means the top-right.
<-
plt_prop_unemployed_over_time ggplot(economics, aes(date, unemploy/pop )) +
geom_line(aes(color = pce)) +
theme(legend.position = "right") +
ggtitle("unemployed proportion over time")
# View the default plot
plt_prop_unemployed_over_time
# Remove legend entirely
+
plt_prop_unemployed_over_time theme(legend.position = "none")
# Position the legend at the bottom of the plot
+
plt_prop_unemployed_over_time theme(legend.position = "bottom")
# Position the legend inside the plot at (0.6, 0.1)
+
plt_prop_unemployed_over_time theme(legend.position = c(0.6, 0.1))
7.4.1.2 Modifying theme elements
Many plot elements have multiple properties that can be set. For example, line elements in the plot such as axes and gridlines have a color, a thickness (size
), and a line type (solid line, dashed, or dotted). To set the style of a line, you use element_line()
. Similarly, element_rect()
changes rectangles and element_text()
changes text. You can remove a plot element using element_blank()
.
+
plt_prop_unemployed_over_time theme(
# For all rectangles, set the fill color to grey92
rect = element_rect(fill = "grey92"),
# For the legend key, turn off the outline
legend.key = element_rect(color = NA)
)
+
plt_prop_unemployed_over_time theme(
rect = element_rect(fill = "grey92"),
legend.key = element_rect(color = NA),
# Turn off axis ticks
axis.ticks = element_blank(),
# Turn off the panel grid
panel.grid = element_blank()
)
+
plt_prop_unemployed_over_time theme(
rect = element_rect(fill = "grey92"),
legend.key = element_rect(color = NA),
axis.ticks = element_blank(),
panel.grid = element_blank(),
# Add major y-axis panel grid lines back
panel.grid.major.y = element_line(
# Set the color to red
color = "red",
# Set the size to 0.5
size = 0.5,
# Set the line type to dotted
linetype = "dotted"
) )
## Warning: The `size` argument of `element_line()` is
## deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument
## instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()`
## to see where this warning was generated.
+
plt_prop_unemployed_over_time theme(
rect = element_rect(fill = "grey92"),
legend.key = element_rect(color = NA),
axis.ticks = element_blank(),
panel.grid = element_blank(),
panel.grid.major.y = element_line(
color = "red",
size = 0.5,
linetype = "dotted"
),# Set the axis text color to grey25
axis.text = element_text(color = "grey25"),
# Set the plot title font face to italic and font size to 16
plot.title = element_text(size = 16, face = "italic")
)
7.4.1.3 Modifying whitespace
Whitespace means all the non-visible margins and spacing in the plot.
To set a single whitespace value, use unit(x, unit)
, where x
is the amount and unit
is the unit of measure.
Borders require you to set 4 positions, so use margin(top, right, bottom, left, unit)
.
The default unit is "pt"
(points), which scales well with text. Other options include “cm”, “in” (inches) and “lines” (of text).
<-
plt_mpg_vs_wt_by_cyl ggplot(mtcars, aes(mpg, wt, color = fcyl)) +
geom_point() +
theme(panel.border = element_rect(color = "blue",
fill = NA,
size = 0.6,
linetype = "dotted"),
legend.box.background = element_rect(color = "blue",
fill = NA,
size = 0.6,
linetype = "dotted"),
legend.position = "right")
## Warning: The `size` argument of `element_rect()` is
## deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument
## instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()`
## to see where this warning was generated.
plt_mpg_vs_wt_by_cyl
+
plt_mpg_vs_wt_by_cyl theme(
# Set the axis tick length to 2 lines
axis.ticks.length = unit(2, "lines")
)
+
plt_mpg_vs_wt_by_cyl theme(
# Set the legend key size to 3 centimeters
legend.key.size = unit(3, "cm")
)
+
plt_mpg_vs_wt_by_cyl theme(
# Set the legend margin to (20, 30, 40, 50) points
legend.margin = ggplot2::margin(20, 30, 40, 50, "pt")
)
+
plt_mpg_vs_wt_by_cyl theme(
# Set the plot margin to (10, 30, 50, 70) millimeters
plot.margin = ggplot2::margin(10, 30, 50, 70, "mm")
)
7.4.2 Theme flexibility
7.4.2.1 Built-in themes
In addition to making your own themes, there are several out-of-the-box solutions that may save you lots of time.
theme_gray()
is the default.theme_bw()
is useful when you use transparency.theme_classic()
is more traditional.theme_void()
removes everything but the data.
# Add a black and white theme
+
plt_prop_unemployed_over_time theme_bw()
# Add a classic theme
+
plt_prop_unemployed_over_time theme_classic()
# Add a void theme
+
plt_prop_unemployed_over_time theme_void()
7.4.2.2 Exploring ggthemes
Outside of ggplot2
, another source of built-in themes is the ggthemes
package.
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.3.1
# Use the fivethirtyeight theme
+
plt_prop_unemployed_over_time theme_fivethirtyeight()
# Use Tufte's theme
+
plt_prop_unemployed_over_time theme_tufte()
# Use the Wall Street Journal theme
+
plt_prop_unemployed_over_time theme_wsj()
7.4.2.3 Setting themes
Reusing a theme across many plots helps to provide a consistent style. You have several options for this.
Assign the theme to a variable, and add it to each plot.
Set your theme as the default using
theme_set()
.
A good strategy that you’ll use here is to begin with a built-in theme then modify it.
# Save the theme as theme_recession
<- theme(
theme_recession rect = element_rect(fill = "grey92"),
legend.key = element_rect(color = NA),
axis.ticks = element_blank(),
panel.grid = element_blank(),
panel.grid.major.y = element_line(color = "white", size = 0.5, linetype = "dotted"),
axis.text = element_text(color = "grey25"),
plot.title = element_text(face = "italic", size = 16),
legend.position = c(0.6, 0.1)
)
# Combine the Tufte theme with theme_recession
<- theme_tufte() + theme_recession
theme_tufte_recession
# Add the Tufte recession theme to the plot
+ theme_tufte_recession plt_prop_unemployed_over_time
Using theme_set()
will also get the same output.
# Set theme_tufte_recession as the default theme
theme_set(theme_tufte_recession)
# Draw the plot (without explicitly adding a theme)
plt_prop_unemployed_over_time
7.4.2.4 Publication-quality plots
+
plt_prop_unemployed_over_time # Add Tufte's theme
theme_tufte()
+
plt_prop_unemployed_over_time theme_tufte() +
# Add individual theme elements
theme(
# Turn off the legend
legend.position = "none",
# Turn off the axis ticks
axis.ticks = element_blank()
)
+
plt_prop_unemployed_over_time theme_tufte() +
# Add individual theme elements
theme(
legend.position = "none",
axis.ticks = element_blank(),
# Set the axis title's text color to grey60
axis.title = element_text(color = "grey60"),
# Set the axis text's text color to grey60
axis.text = element_text(color = "grey60")
)
+
plt_prop_unemployed_over_time theme_tufte() +
# Add individual theme elements
theme(
legend.position = "none",
axis.ticks = element_blank(),
axis.title = element_text(color = "grey60"),
axis.text = element_text(color = "grey60"),
# Set the panel gridlines major y values
panel.grid.major.y = element_line(
# Set the color to grey60
color = "grey60",
# Set the size to 0.25
size = 0.25,
# Set the linetype to dotted
linetype = "dotted"
) )
7.4.3 Effective explanatory plots
7.4.3.1 Using geoms for explanatory plots
<- read_csv("data/gm2007.csv")
gm2007
# Add a geom_segment() layer
ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
geom_point(size = 4) +
geom_segment(aes(xend = 30, yend = country), size = 2)
# Add a geom_text() layer
ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
geom_point(size = 4) +
geom_segment(aes(xend = 30, yend = country), size = 2) +
geom_text(aes(label = lifeExp), color = "white", size = 1.5)
# Set the color scale
<- brewer.pal(5, "RdYlBu")[-(2:4)]
palette
# Modify the scales
ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
geom_point(size = 4) +
geom_segment(aes(xend = 30, yend = country), size = 2) +
geom_text(aes(label = round(lifeExp,1)), color = "white", size = 1.5) +
scale_x_continuous("", expand = c(0, 0), limits = c(30, 90), position = "top") +
scale_color_gradientn(colors = palette)
# Set the color scale
<- brewer.pal(5, "RdYlBu")[-(2:4)]
palette
# Add a title and caption
<- ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
plt_country_vs_lifeExp geom_point(size = 4) +
geom_segment(aes(xend = 30, yend = country), size = 2) +
geom_text(aes(label = round(lifeExp,1)), color = "white", size = 1.5) +
scale_x_continuous("",
expand = c(0,0),
limits = c(30,90),
position = "top") +
scale_color_gradientn(colors = palette) +
labs(title = "Highest and lowest life expectancies, 2007",
caption = "Source: gapminder")
plt_country_vs_lifeExp
7.4.3.2 Using annotate() for embellishments
In this exercise, you’ll use annotate()
to add text and a curve to the plot.
# Define the theme
+
plt_country_vs_lifeExp theme_classic() +
theme(axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text = element_text(color = "black"),
axis.title = element_blank(),
legend.position = "none")
<- read_csv("data/gm2007_full.csv")
gm2007_full
<- mean(gm2007_full$lifeExp)
global_mean <- global_mean + 4
x_start <- 5.5
y_start <- global_mean
x_end <- 7.5
y_end
+
plt_country_vs_lifeExp theme_classic() +
theme(axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text = element_text(color = "black"),
axis.title = element_blank(),
legend.position = "none") +
# Add a vertical line
geom_vline(xintercept = global_mean,
color = "grey40",
linetype = 3)
+
plt_country_vs_lifeExp theme_classic() +
theme(axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text = element_text(color = "black"),
axis.title = element_blank(),
legend.position = "none") +
geom_vline(xintercept = global_mean,
color = "grey40",
linetype = 3) +
# Add text
annotate(
"text",
x = x_start, y = y_start,
label = "The\nglobal\naverage",
vjust = 1, size = 3, color = "grey40"
)
+
plt_country_vs_lifeExp theme_classic() +
theme(axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text = element_text(color = "black"),
axis.title = element_blank(),
legend.position = "none") +
geom_vline(xintercept = global_mean,
color = "grey40",
linetype = 3) +
annotate(
"text",
x = x_start, y = y_start,
label = "The\nglobal\naverage",
vjust = 1, size = 3, color = "grey40"
+
) # Add a curve
annotate(
"curve",
x = x_start, y = y_start,
xend = x_end, yend = y_end,
arrow = arrow(length = unit(0.2, "cm"), type = "closed"),
color = "grey40"
)