Chapter 13 Exploratory Data Analysis in R
13.1 Exploring Categorical Data
13.1.1 Categorical data
13.1.1.1 Contingency table
Create a contingency table by table()
, is a useful way to represent the total counts of observations that fall into each combination of the levels of categorical variables.
comics
dataset is a collection of characteristics on all of the superheroes created by Marvel and DC comics in the last 80 years.
library(tidyverse)
<- read_csv("data/comics.csv", col_types = "ffffffffiff")
comics glimpse(comics)
## Rows: 23,272
## Columns: 11
## $ name <fct> "Spider-Man (Peter Parker)", "Captain America (Steven Rog…
## $ id <fct> Secret, Public, Public, Public, No Dual, Public, Public, …
## $ align <fct> Good, Good, Neutral, Good, Good, Good, Good, Good, Neutra…
## $ eye <fct> Hazel Eyes, Blue Eyes, Blue Eyes, Blue Eyes, Blue Eyes, B…
## $ hair <fct> Brown Hair, White Hair, Black Hair, Black Hair, Blond Hai…
## $ gender <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, Mal…
## $ gsm <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ alive <fct> Living Characters, Living Characters, Living Characters, …
## $ appearances <int> 4043, 3360, 3061, 2961, 2258, 2255, 2072, 2017, 1955, 193…
## $ first_appear <fct> "Aug-62", "Mar-41", "Oct-74", "Mar-63", "Nov-50", "Nov-61…
## $ publisher <fct> marvel, marvel, marvel, marvel, marvel, marvel, marvel, m…
# Check levels of align
levels(comics$align)
## [1] "Good" "Neutral" "Bad"
## [4] "Reformed Criminals"
# Check the levels of gender
levels(comics$gender)
## [1] "Male" "Female" "Other"
# Create a 2-way contingency table
<- table(comics$gender, comics$align); tab tab
##
## Good Neutral Bad Reformed Criminals
## Male 4809 1799 7561 2
## Female 2490 836 1573 1
## Other 17 17 32 0
13.1.1.2 Dropping levels
The contingency table from the last exercise revealed that there are some levels that have very low counts. To simplify the analysis, it often helps to drop such levels.
Two steps:
filtering out any rows with the levels that have very low counts,
removing these levels from the factor variable with
droplevels()
.
(This is because the droplevels()
would keep levels that have just 1 or 2 counts; it only drops levels that don’t exist in a dataset.)
Find out which level of align
has the fewest total entries. Then filter out all rows of comics with that level, then drop the unused level with droplevels()
.
# Remove align level
<- comics %>%
comics_filtered filter(align != "Reformed Criminals") %>%
droplevels()
# See the result
comics_filtered
## # A tibble: 19,856 × 11
## name id align eye hair gender gsm alive appearances first_appear
## <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <int> <fct>
## 1 "Spider-… Secr… Good Haze… Brow… Male <NA> Livi… 4043 Aug-62
## 2 "Captain… Publ… Good Blue… Whit… Male <NA> Livi… 3360 Mar-41
## 3 "Wolveri… Publ… Neut… Blue… Blac… Male <NA> Livi… 3061 Oct-74
## 4 "Iron Ma… Publ… Good Blue… Blac… Male <NA> Livi… 2961 Mar-63
## 5 "Thor (T… No D… Good Blue… Blon… Male <NA> Livi… 2258 Nov-50
## 6 "Benjami… Publ… Good Blue… No H… Male <NA> Livi… 2255 Nov-61
## 7 "Reed Ri… Publ… Good Brow… Brow… Male <NA> Livi… 2072 Nov-61
## 8 "Hulk (R… Publ… Good Brow… Brow… Male <NA> Livi… 2017 May-62
## 9 "Scott S… Publ… Neut… Brow… Brow… Male <NA> Livi… 1955 Sep-63
## 10 "Jonatha… Publ… Good Blue… Blon… Male <NA> Livi… 1934 Nov-61
## # ℹ 19,846 more rows
## # ℹ 1 more variable: publisher <fct>
str(comics_filtered$align)
## Factor w/ 3 levels "Good","Neutral",..: 1 1 2 1 1 1 1 1 2 1 ...
13.1.1.3 Side-by-side bar charts
While a contingency table represents the counts numerically, it’s often more useful to represent them graphically.
Here you’ll construct two side-by-side bar charts of the comics data. This shows that there can often be two or more options for presenting the same data.
# Create side-by-side bar chart of gender by alignment
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar(position = "dodge")
# Create side-by-side bar chart of alignment by gender
ggplot(comics, aes(x = gender, fill = align)) +
geom_bar(position = "dodge") +
theme(axis.text.x = element_text(angle = 90))
13.1.2 Counts vs. proportions
13.1.2.1 Proportions
# Simplify display format
options(scipen = 999, digits = 3)
<- table(comics$id, comics$align)
tab_cnt tab_cnt
##
## Good Neutral Bad Reformed Criminals
## Secret 2475 959 4493 1
## Public 2930 965 2172 1
## No Dual 647 390 474 0
## Unknown 0 2 7 0
Compute the proportions by prop.table(table)
.
prop.table(tab_cnt)
##
## Good Neutral Bad Reformed Criminals
## Secret 0.1595128 0.0618072 0.2895721 0.0000644
## Public 0.1888373 0.0621939 0.1399845 0.0000644
## No Dual 0.0416989 0.0251353 0.0305491 0.0000000
## Unknown 0.0000000 0.0001289 0.0004511 0.0000000
Note that because these are all proportions out of the whole dataset, the sum of all of these proportions is 1.
sum(prop.table(tab_cnt))
## [1] 1
13.1.2.2 Conditional proportions
condition on the rows: prop.table(table, 1)
prop.table(tab_cnt, 1)
##
## Good Neutral Bad Reformed Criminals
## Secret 0.312185 0.120964 0.566726 0.000126
## Public 0.482861 0.159031 0.357943 0.000165
## No Dual 0.428193 0.258107 0.313700 0.000000
## Unknown 0.000000 0.222222 0.777778 0.000000
Because we’re conditioning on row, it’s every row that now sums to one.
rowSums(prop.table(tab_cnt, 1))
## Secret Public No Dual Unknown
## 1 1 1 1
condition on the columns: prop.table(table, 2)
prop.table(tab_cnt, 2)
##
## Good Neutral Bad Reformed Criminals
## Secret 0.408956 0.414076 0.628743 0.500000
## Public 0.484137 0.416667 0.303946 0.500000
## No Dual 0.106907 0.168394 0.066331 0.000000
## Unknown 0.000000 0.000864 0.000980 0.000000
Because we’re conditioning on column, it’s every column that now sums to one.
colSums(prop.table(tab_cnt, 2))
## Good Neutral Bad Reformed Criminals
## 1 1 1 1
Approximately 41% of all good characters are secret.
13.1.2.3 Conditional bar chart
Bar charts can tell dramatically different stories depending on whether they represent counts or proportions and, if proportions, what the proportions are conditioned on.
By adding position = "fill"
to geom_bar()
, you are saying you want the bars to fill the entire height of the plotting window, thus displaying proportions and not raw counts.
Create a stacked bar chart of gender counts.
# Plot of gender by align
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar()
Create a stacked bar chart of gender proportions.
# Plot proportion of gender, conditional on align
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar(position = "fill") +
ylab("proportion")
13.1.3 Distribution of one variable
13.1.3.1 Marginal bar chart
If you are interested in the distribution of alignment of all superheroes, it makes sense to construct a bar chart for just that single variable.
You can improve the interpretability of the plot, though, by implementing some sensible ordering.
# Change the order of the levels in align
$align <- factor(comics$align,
comicslevels = c("Bad", "Neutral", "Good"))
# Create plot of align
ggplot(comics, aes(x = align)) +
geom_bar()
13.1.3.2 Conditional bar chart
Now, if you want to break down the distribution of alignment based on gender, you’re looking for conditional distributions.
You could make these by creating multiple filtered datasets (one for each gender) or by faceting the plot of alignment based on gender.
Create a bar chart of align
faceted by gender
.
# Plot of alignment broken down by gender
ggplot(comics, aes(x = align)) +
geom_bar() +
facet_wrap(~ gender)
13.1.3.3 Improve pie chart
The pie chart is a very common way to represent the distribution of a single categorical variable, but they can be more difficult to interpret than bar charts.
This is a pie chart of a dataset called pies
that contains the favorite pie flavors of 98 people.
<- read_delim("data/pies.txt", delim = ",") pies
## Rows: 98 Columns: 1
## ── Column specification ──────────────────────
## Delimiter: ","
## chr (1): flavor
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
$flavor <- as.factor(pies$flavor)
pies
# pie chart, 很難看出數量很難看出數量高低
ggplot(pies, aes(x = factor(1), fill = flavor)) +
geom_bar(width = 1) +
coord_polar(theta = "y") +
theme_void()
Improve the representation of these data by constructing a bar chart that is ordered in descending order of count.
# Put levels of flavor in descending order
<- c("apple", "key lime", "boston creme", "blueberry", "cherry", "pumpkin", "strawberry")
lev $flavor <- factor(pies$flavor, levels = lev)
pies
# Create bar chart of flavor
ggplot(pies, aes(x = flavor)) +
geom_bar(fill = "chartreuse") +
theme(axis.text.x = element_text(angle = 90))
13.2 Exploring numerical data
13.2.1 Numerical Data
13.2.1.1 Faceted histogram
cars
dataset, which records characteristics on all of the new models of cars for sale in the US in a certain year.
You will investigate the distribution of mileage across a categorical variable,
<- read_csv("data/cars04.csv")
cars
# Learn data structure
str(cars, give.attr = FALSE)
## spc_tbl_ [428 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ name : chr [1:428] "Chevrolet Aveo 4dr" "Chevrolet Aveo LS 4dr hatch" "Chevrolet Cavalier 2dr" "Chevrolet Cavalier 4dr" ...
## $ sports_car : logi [1:428] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ suv : logi [1:428] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ wagon : logi [1:428] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ minivan : logi [1:428] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ pickup : logi [1:428] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ all_wheel : logi [1:428] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ rear_wheel : logi [1:428] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ msrp : num [1:428] 11690 12585 14610 14810 16385 ...
## $ dealer_cost: num [1:428] 10965 11802 13697 13884 15357 ...
## $ eng_size : num [1:428] 1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
## $ ncyl : num [1:428] 4 4 4 4 4 4 4 4 4 4 ...
## $ horsepwr : num [1:428] 103 103 140 140 140 132 132 130 110 130 ...
## $ city_mpg : num [1:428] 28 28 26 26 26 29 29 26 27 26 ...
## $ hwy_mpg : num [1:428] 34 34 37 37 37 36 36 33 36 33 ...
## $ weight : num [1:428] 2370 2348 2617 2676 2617 ...
## $ wheel_base : num [1:428] 98 98 104 104 104 105 105 103 103 103 ...
## $ length : num [1:428] 167 153 183 183 183 174 174 168 168 168 ...
## $ width : num [1:428] 66 66 69 68 69 67 67 67 67 67 ...
Plot a histogram of city_mpg
faceted by suv
, a logical variable indicating whether the car is an SUV or not.
# Create faceted histogram
ggplot(cars, aes(x = city_mpg)) +
geom_histogram() +
facet_wrap(~ suv)
## `stat_bin()` using `bins = 30`. Pick better
## value with `binwidth`.
## Warning: Removed 14 rows containing non-finite values
## (`stat_bin()`).
you can facet a plot by any categorical variable using facet_wrap()
.
13.2.1.2 Boxplots & density plots
The mileage of a car tends to be associated with the size of its engine (as measured by the number of cylinders).
To explore the relationship between these two variables, you’ll try your hand at two alternatives: the box plot and the density plot.
A quick look at ncyl
shows that there are more possible levels of ncyl
than you might think. Here, restrict attention to the most common levels.
unique(cars$ncyl)
## [1] 4 6 3 8 5 12 10 -1
# Filter cars with 4, 6, 8 cylinders
<- filter(cars, cars$ncyl %in% c(4, 6, 8)) common_cyl
Create side-by-side box plots.
# Create box plots of city mpg by ncyl
ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) +
geom_boxplot()
## Warning: Removed 11 rows containing non-finite values
## (`stat_boxplot()`).
Create overlaid density plots.
# Create overlaid density plots for same data
ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) +
geom_density(alpha = .3)
## Warning: Removed 11 rows containing non-finite values
## (`stat_density()`).
13.2.2 Distribution of one variable
13.2.2.1 Marginal & conditional hist
Now, turn your attention to a new variable: horsepwr
. The goal is to get a sense of the marginal distribution of this variable and then compare it to the distribution of horsepower conditional on the price of the car being less than $25,000.
# Create hist of horsepwr
%>%
cars ggplot(aes(x = horsepwr)) +
geom_histogram() +
ggtitle("marginal distribution of horsepower")
## `stat_bin()` using `bins = 30`. Pick better
## value with `binwidth`.
# Create hist of horsepwr for affordable cars
%>%
cars filter(msrp < 25000) %>%
ggplot(aes(x = horsepwr)) +
geom_histogram() +
xlim(c(90, 550)) +
ggtitle("distribution of horsepower conditional on the price less than $25000")
## `stat_bin()` using `bins = 30`. Pick better
## value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values
## (`stat_bin()`).
## Warning: Removed 2 rows containing missing values
## (`geom_bar()`).
13.2.2.2 Binwidths
The binwidth determines how smooth your distribution will appear: the smaller the binwidth, the more jagged your distribution becomes.
It’s good practice to consider several binwidths in order to detect different types of structure in your data.
# Create hist of horsepwr with binwidth of 3
<- cars %>%
plot_A ggplot(aes(x = horsepwr)) +
geom_histogram(binwidth = 3) +
ggtitle("Plot A, binwidth = 3")
# Create hist of horsepwr with binwidth of 30
<- cars %>%
plot_B ggplot(aes(x = horsepwr)) +
geom_histogram(binwidth = 30) +
ggtitle("Plot B, binwidth = 30")
# Create hist of horsepwr with binwidth of 60
<- cars %>%
plot_C ggplot(aes(x = horsepwr)) +
geom_histogram(binwidth = 60) +
ggtitle("Plot C, binwidth = 60")
# put all plot together to compare
::grid.arrange(plot_A, plot_B, plot_C, nrow = 1) gridExtra
Plot A is the only histogram that shows the count for cars with exactly 200 and 300 horsepower.
13.2.3 Box plots
13.2.3.1 Box plots for outliers
In addition to indicating the center and spread of a distribution, a box plot provides a graphical means to detect outliers.
You can apply this method to the msrp
column (manufacturer’s suggested retail price) to detect if there are unusually expensive or cheap cars.
# Construct box plot of msrp
%>%
cars ggplot(aes(x = 1, y = msrp)) +
geom_boxplot()
# Exclude outliers from data
<- cars %>%
cars_no_out filter(msrp < 100000)
# Construct box plot of msrp using the reduced dataset
%>%
cars_no_out ggplot(aes(x = 1, y = msrp)) +
geom_boxplot()
13.2.3.2 Plot selection
Both density plots and box plots display the central tendency and spread of the data, but the box plot is more robust to outliers.
Consider two other columns in the cars dataset: city_mpg
and width
. Which is the most appropriate plot for displaying the important features of their distributions?
For each variable, try both plots and decide which one is better at capturing the important structure.
# Create plot of city_mpg
# box plot
<- cars %>%
city_mpg_boxplot ggplot(aes(x = 1, y = city_mpg)) +
geom_boxplot()
# density plot
<- cars %>%
city_mpg_density ggplot(aes(x = city_mpg)) +
geom_density()
# Create plot of width
# box plot
<- cars %>%
width_boxplot ggplot(aes(x = 1, y = width)) +
geom_boxplot()
# density plot
<- cars %>%
width_density ggplot(aes(x = width)) +
geom_density()
::grid.arrange(city_mpg_boxplot, city_mpg_density, width_boxplot, width_density, nrow = 2) gridExtra
## Warning: Removed 14 rows containing non-finite values
## (`stat_boxplot()`).
## Warning: Removed 14 rows containing non-finite values
## (`stat_density()`).
## Warning: Removed 28 rows containing non-finite values
## (`stat_boxplot()`).
## Warning: Removed 28 rows containing non-finite values
## (`stat_density()`).
Because the city_mpg
variable has a much wider range with its outliers, it’s best to display its distribution as a box plot.
13.2.4 Visual in higher dimensions
Higher dimensional plots:
Shape, Size, Color, Pattern, Movement, x-coordinate, y-coordinate
13.2.4.1 three variables plot
Faceting is a valuable technique for looking at several conditional distributions at the same time.
If the faceted distributions are laid out in a grid, you can consider the association between a variable and two others, one on the rows of the grid and the other on the columns.
Using common_cyl
, create a histogram of hwy_mpg. Grid-facet the plot rowwise by ncyl
and columnwise by suv
.
# Facet hists using hwy mileage and ncyl
%>%
common_cyl ggplot(aes(x = hwy_mpg)) +
geom_histogram() +
facet_grid(ncyl ~ suv) +
ggtitle("hwy_mpg faceted by ncyl and suv")
## `stat_bin()` using `bins = 30`. Pick better
## value with `binwidth`.
## Warning: Removed 11 rows containing non-finite values
## (`stat_bin()`).
13.3 Numerical Summaries
Characteristics of a distribution
Center
Variability
Shape
Outliers
13.3.1 Measures of center
Center: mean, median, mode
The choice of measure for center can have a dramatic impact on what we consider to be a typical observation, so it is important that you consider the shape of the distribution before deciding on the measure.
13.3.1.1 Calculate center measures
You will use data from gapminder
, which tracks demographic data in countries of the world over time. How the life expectancy differs from continent to continent?
# Create dataset of 2007 data
<- filter(gapminder::gapminder, year == 2007)
gap2007
# Compute groupwise mean and median lifeExp
%>%
gap2007 group_by(continent) %>%
summarize(mean(lifeExp),
median(lifeExp))
## # A tibble: 5 × 3
## continent `mean(lifeExp)` `median(lifeExp)`
## <fct> <dbl> <dbl>
## 1 Africa 54.8 52.9
## 2 Americas 73.6 72.9
## 3 Asia 70.7 72.4
## 4 Europe 77.6 78.6
## 5 Oceania 80.7 80.7
# Generate box plots of lifeExp for each continent
%>%
gap2007 ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot()
13.3.2 Measures of variability
Variability: variance, standard deviation, interquartile range(IQR), range
The choice of measure for spread can dramatically impact how variable we consider our data to be, so it is important that you consider the shape of the distribution before deciding on the measure.
IQR is useful when your dataset is heavily skewed or has extreme observations.
13.3.2.1 Calculate spread measures
summarize life expectancies using the sd()
, the IQR()
, and the count of countries, n()
.
# Compute groupwise measures of spread
%>%
gap2007 group_by(continent) %>%
summarize(sd(lifeExp),
IQR(lifeExp),
n())
## # A tibble: 5 × 4
## continent `sd(lifeExp)` `IQR(lifeExp)` `n()`
## <fct> <dbl> <dbl> <int>
## 1 Africa 9.63 11.6 52
## 2 Americas 4.44 4.63 25
## 3 Asia 7.96 10.2 33
## 4 Europe 2.98 4.78 30
## 5 Oceania 0.729 0.516 2
# Graphically compare the spread of these distributions
# Generate overlaid density plots
%>%
gap2007 ggplot(aes(x = lifeExp, fill = continent)) +
geom_density(alpha = 0.3)
13.3.2.2 Choose measures for center & spread
Using the shapes of the density plots, calculate the most appropriate measures of center and spread.
# The distribution of life expectancy in the countries of the Americas.
%>%
gap2007 filter(continent == "Americas") %>%
ggplot(aes(x = lifeExp)) +
geom_density()
# Compute stats for lifeExp in Americas
%>%
gap2007 filter(continent == "Americas") %>%
summarize(mean(lifeExp),
sd(lifeExp))
## # A tibble: 1 × 2
## `mean(lifeExp)` `sd(lifeExp)`
## <dbl> <dbl>
## 1 73.6 4.44
# The distribution of country populations across the entire gap2007 dataset.
%>%
gap2007 ggplot(aes(x = pop)) +
geom_density()
median and IQR measure the central tendency and spread, respectively, but are robust to outliers and non-normal data.
# Compute stats for population
%>%
gap2007 summarize(median(pop),
IQR(pop))
## # A tibble: 1 × 2
## `median(pop)` `IQR(pop)`
## <dbl> <dbl>
## 1 10517531 26702008.
13.3.3 Shape & transformations
Modality
The modality of a distribution is the number of prominent humps that show up in the distribution.
Skew
The skew is where the long tail is.
13.3.3.1 Transformations
Highly skewed distributions can make it very difficult to learn anything from a visualization. Transformations can be helpful in revealing the more subtle structure.
# Original pop shape
<- gap2007 %>%
pop_nontrans_plot ggplot(aes(x = pop)) +
geom_density() +
ggtitle("population without transform")
# Transform the skewed pop variable
<- gap2007 %>%
gap2007 mutate(log_pop = log(pop))
# Create density plot of new variable
<- gap2007 %>%
pop_trans_plot ggplot(aes(x = log_pop)) +
geom_density() +
ggtitle("population with log transform")
# compare transforming
::grid.arrange(pop_nontrans_plot, pop_trans_plot, nrow = 1) gridExtra
13.3.4 Outliers
13.3.4.1 Identify outliers
Consider the distribution, shown here, of the life expectancies of the countries in Asia. The box plot identifies one clear outlier: a country with a notably low life expectancy.
# box plot for Asia lifeExp
%>%
gap2007 filter(continent == "Asia") %>%
ggplot(aes(x = 1, y = lifeExp)) +
geom_boxplot()
# which country with lowest lifeExp
which.min(gap2007$lifeExp), c("country", "lifeExp")] gap2007[
## # A tibble: 1 × 2
## country lifeExp
## <fct> <dbl>
## 1 Swaziland 39.6
Building a plot with that country removed.
# Filter for Asia, add column indicating outliers
<- gap2007 %>%
gap_asia filter(continent == "Asia") %>%
mutate(is_outliner = lifeExp < 50)
# Remove outliers, create box plot of lifeExp
%>%
gap_asia filter(!is_outliner) %>%
ggplot(aes(x = 1, y = lifeExp)) +
geom_boxplot()
13.4 Case Study
13.4.1 New data
What characteristics of an email are associated with it being spam?
Here, you’ll use the email
dataset to settle that question.
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
##
## Attaching package: 'openintro'
## The following objects are masked from 'package:lattice':
##
## ethanol, lsegments
## The following object is masked from 'package:babynames':
##
## births
str(email)
## tibble [3,921 × 21] (S3: tbl_df/tbl/data.frame)
## $ spam : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ to_multiple : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...
## $ from : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ cc : int [1:3921] 0 0 0 0 0 0 0 1 0 0 ...
## $ sent_email : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...
## $ time : POSIXct[1:3921], format: "2012-01-01 14:16:41" "2012-01-01 15:03:59" ...
## $ image : num [1:3921] 0 0 0 0 0 0 0 1 0 0 ...
## $ attach : num [1:3921] 0 0 0 0 0 0 0 1 0 0 ...
## $ dollar : num [1:3921] 0 0 4 0 0 0 0 0 0 0 ...
## $ winner : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ inherit : num [1:3921] 0 0 1 0 0 0 0 0 0 0 ...
## $ viagra : num [1:3921] 0 0 0 0 0 0 0 0 0 0 ...
## $ password : num [1:3921] 0 0 0 0 2 2 0 0 0 0 ...
## $ num_char : num [1:3921] 11.37 10.5 7.77 13.26 1.23 ...
## $ line_breaks : int [1:3921] 202 202 192 255 29 25 193 237 69 68 ...
## $ format : Factor w/ 2 levels "0","1": 2 2 2 2 1 1 2 2 1 2 ...
## $ re_subj : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ exclaim_subj: num [1:3921] 0 0 0 0 0 0 0 0 0 0 ...
## $ urgent_subj : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ exclaim_mess: num [1:3921] 0 1 6 48 1 1 1 18 1 0 ...
## $ number : Factor w/ 3 levels "none","small",..: 3 2 2 2 1 1 3 2 2 2 ...
# label factor for easier to read
$spam <- factor(email$spam, labels = c("non-spam", "spam"))
emailstr(email$spam)
## Factor w/ 2 levels "non-spam","spam": 1 1 1 1 1 1 1 1 1 1 ...
13.4.1.1 Spam and num_char
Is there an association between spam and the length of an email?
# distribution of num_char
ggplot(email, aes(x = num_char)) +
geom_density()
Compute appropriate measures of the center and spread of num_char
for both spam and not-spam.
# Compute summary statistics
%>%
email group_by(spam) %>%
summarise(median(num_char),
IQR(num_char))
## # A tibble: 2 × 3
## spam `median(num_char)` `IQR(num_char)`
## <fct> <dbl> <dbl>
## 1 non-spam 6.83 13.6
## 2 spam 1.05 2.82
Construct side-by-side box plots to visualize the association between the same two variables.
# Create plot
%>%
email mutate(log_num_char = log(num_char)) %>%
ggplot(aes(x = spam, y = log_num_char)) +
geom_boxplot()
Interpretation: the median length of not-spam emails is greater than that of spam emails.
13.4.1.2 Spam and !!!
Let’s look at a more obvious indicator of spam: exclamation marks. exclaim_mess
contains the number of exclamation marks in each message.
Using summary statistics and visualization, see if there is a relationship between this variable and whether or not a message is spam.
Side-by-side box plots
Faceted histograms
Overlaid density plots
# distribution of exclaim_mess
ggplot(email, aes(x = exclaim_mess)) +
geom_histogram() +
facet_wrap(~ spam)
## `stat_bin()` using `bins = 30`. Pick better
## value with `binwidth`.
Calculate appropriate measures of the center and spread of exclaim_mess
for both spam and not-spam.
# Compute center and spread for exclaim_mess by spam
%>%
email group_by(spam) %>%
summarise(median(exclaim_mess),
IQR(exclaim_mess))
## # A tibble: 2 × 3
## spam `median(exclaim_mess)` `IQR(exclaim_mess)`
## <fct> <dbl> <dbl>
## 1 non-spam 1 5
## 2 spam 0 1
Construct an appropriate plot to visualize the association between the same two variables, adding in a log-transformation step if necessary.
log(0)
is -Inf
in R, which isn’t a very useful value! You can get around this by adding a small number (like 0.01
) to the quantity inside the log()
function.
# Create plot for spam and exclaim_mess
%>%
email mutate(log_exclaim_mess = log(0.01 + exclaim_mess)) %>%
ggplot(aes(x = log_exclaim_mess)) +
geom_histogram() +
facet_wrap(~ spam)
## `stat_bin()` using `bins = 30`. Pick better
## value with `binwidth`.
The dominant feature of the exclaim mess variable, though, is the large proportion of cases that report zero or on this log scale, -4 (point) 6 exclamation marks. This is a common occurrence in data analysis that is often termed “zero-inflation”.
13.4.2 Check-in 1
Zero inflation strategies
Analyze the two components separately (zero / non-zero)
Collapse into two-level categorical variable (zero / non-zero)
13.4.2.1 Collapsing levels
The number of images attached to each email (image
) poses even more of a challenge.
# get a sense of image's distribution
table(email$image)
##
## 0 1 2 3 4 5 9 20
## 3811 76 17 11 2 2 1 1
Given the very low counts at the higher number of images, let’s collapse image into a categorical variable that indicates whether or not the email had at least one image.
# Create plot of proportion of spam by image
%>%
email mutate(has_image = image > 0) %>%
ggplot(aes(x = has_image, fill = spam)) +
geom_bar(position = "fill")
An email without an image is more likely to be not-spam than spam.
13.4.2.2 Data Integrity
In the process of exploring a dataset, you’ll sometimes come across something that will lead you to question how the data were compiled.
Consider the variables image
and attach
. Do attached images count as attached files in this dataset?
# Test if images count as attachments
sum(email$image > email$attach)
## [1] 0
Since image
is never greater than attach
, we can infer that images are counted as attachments.
13.4.2.3 Answer questions with chains
Build a chain to answer each of the following questions, both about the variable dollar
.
For emails containing the word “dollar”, does the typical spam email contain a greater number of occurrences of the word than the typical non-spam email?
# Question 1
%>%
email filter(dollar > 0) %>%
group_by(spam) %>%
summarize(median(num_char))
## # A tibble: 2 × 2
## spam `median(num_char)`
## <fct> <dbl>
## 1 non-spam 12.3
## 2 spam 1.41
If you encounter an email with greater than 10 occurrences of the word dollar, is it more likely to be spam
or not-spam
?
# Question 2
%>%
email filter(dollar > 10) %>%
ggplot(aes(x = spam)) +
geom_bar()
13.4.3 Check-in 2
13.4.3.1 What’s in a number?
Turn your attention to the variable called number
. It’s a factor variable saying whether there was no number, a small number (under 1 million), or a big number.
To explore the association between this variable and spam
, select and construct an informative plot.
Faceted bar charts
Side-by-side bar charts
Stacked and normalized bar charts.
# Reorder the levels of number so that they preserve the natural ordering.
$number_reordered <- factor(email$number, levels = c("none", "small", "big"))
email
# Construct a faceted bar chart of the association between number_reordered and spam.
ggplot(email, aes(x = number_reordered)) +
geom_bar() +
facet_wrap(~ spam)