Chapter 24 Intermediate SQL
24.1 Selecting Data
24.1.1 Querying a database
Setup films database
# Creating a new database
# Connect to the default postgres database
library(DBI)
<- dbConnect(RSQLite::SQLite(), "")
con ::opts_chunk$set(connection = "con")
knitr
# Loading data
<- read.csv("data/films.csv", header = FALSE, na.strings = "")
films colnames(films) <- c("id", "title", "release_year", "country", "duration", "language", "certification", "gross", "budget")
<- read.csv("data/people.csv", header = FALSE, col.names = c("id", "name", "birthdate", "deathdate"), na.strings = "")
people
<- read.csv("data/reviews.csv", header = FALSE, col.names = c("id", "film_id","num_user", "num_critic", "imdb_score", "num_votes", "facebook_likes"), na.strings = "")
reviews
<- read.csv("data/roles.csv", header = FALSE, col.names = c("id", "film_id", "person_id", "role"), na.strings = "")
roles
# Create database tables
dbWriteTable(con, "films", films)
dbWriteTable(con, "people", people)
dbWriteTable(con, "reviews", reviews)
dbWriteTable(con, "roles", roles)
# List database tables
dbListTables(con)
## [1] "films" "people" "reviews" "roles"
24.1.1.1 COUNT & DISTINCT
COUNT
Counts the number of records with a value in a field
COUNT(field_name)
counts values in a field, multiple fields: separate by,
COUNT(*)
counts records in a table: total observation
Use an
AS
for clarityIncludes duplicates
DISTINCT
DISTINCT
removes duplicates to return only unique valuesCombine
COUNT()
withDISTINCT
to count unique values
-- Count the number of records in the people table
SELECT COUNT(*) AS count_records
FROM people;
count_records |
---|
8397 |
-- Count the number of birthdates in the people table
SELECT COUNT(birthdate) AS count_birthdate
FROM people;
count_birthdate |
---|
6152 |
-- Count the records for languages and countries represented in the films table
SELECT COUNT(language) AS count_languages, COUNT(country) AS count_countries
FROM films;
count_languages | count_countries |
---|---|
4957 | 4966 |
SELECT DISTINCT
-- Return the unique countries from the films table
SELECT DISTINCT country
FROM films;
country |
---|
USA |
Germany |
Japan |
Denmark |
UK |
Italy |
France |
West Germany |
Sweden |
Soviet Union |
-- Count the distinct countries from the films table
SELECT COUNT(DISTINCT country) AS count_distinct_countries
FROM films
count_distinct_countries |
---|
64 |
24.1.2 Query execution
Order of execution
SQL is not processed in its written order
- e.g., SQL needs to know where to
SELECT
dataFROM
before it canLIMIT
the results.
- e.g., SQL needs to know where to
Good to know processing order for debugging and aliasing
Most common errors
Misspelling
Incorrect capitalization
Incorrect or missing punctuation, especially commas
24.2 Filtering Records
24.2.1 Filtering numbers
WHERE
Comparison operators
>
Greater than or after<
Less than or before=
Equal to>=
Greater than or equal to<=
Less than or equal to<>
Not equal to 👈
WHERE
with strings- Use single-quotes around strings we want to filter:
WHERE country = 'Japan'
- Use single-quotes around strings we want to filter:
24.2.1.1 WHERE
Using WHERE with numbers
In this case, you’ll want to filter your data to a specific budget
range.
-- Select film_ids and imdb_score with an imdb_score over 7.0
SELECT film_id, imdb_score
FROM reviews
WHERE imdb_score > 7;
film_id | imdb_score |
---|---|
3934 | 7.1 |
74 | 7.6 |
1254 | 8.0 |
4841 | 8.1 |
3252 | 7.2 |
1181 | 7.3 |
3929 | 7.1 |
3298 | 7.4 |
2744 | 7.4 |
4707 | 7.4 |
-- Select imdb_score with an imdb_score over 7.0
SELECT COUNT(imdb_score)
FROM reviews
WHERE imdb_score > 7;
COUNT(imdb_score) |
---|
1536 |
-- Select film_ids and facebook_likes for ten records with less than 1000 likes
SELECT film_id, facebook_likes
FROM reviews
WHERE facebook_likes < 1000
LIMIT 10;
film_id | facebook_likes |
---|---|
3405 | 0 |
478 | 491 |
74 | 930 |
740 | 0 |
2869 | 689 |
1181 | 0 |
2020 | 0 |
2312 | 912 |
1820 | 872 |
831 | 975 |
-- Count the records with at least 100,000 votes
SELECT COUNT(num_votes) AS films_over_100K_votes
FROM reviews
WHERE num_votes >= 100000;
films_over_100K_votes |
---|
1211 |
Using WHERE with text
WHERE
can also filter string values.
-- Count the Spanish-language films
SELECT COUNT(language) AS count_spanish
FROM films
WHERE language = 'Spanish';
count_spanish |
---|
40 |
There are 40 Spanish-language films in this table.
24.2.2 Multiple criteria
Multiple criteria
OR
,AND
,BETWEEN
-- OR
SELECT *
FROM coats
WHERE color = 'yellow' OR length = 'short';
-- AND
SELECT *
FROM coats
WHERE color = 'yellow' AND length = 'short';
-- BETWEEN, AND: BETWEEN 上下界都包含(1,5也算) 👈
SELECT *
FROM coats
WHERE buttons BETWEEN 1 AND 5;
-- AND, OR: Enclose individual clauses in parentheses👈
SELECT title
FROM films
WHERE (release_year = 1994 OR release_year = 1995)
AND (certification = 'PG' OR certification = 'R');
-- BETWEEN, AND, OR
SELECT title
FROM films
WHERE release_year
BETWEEN 1994 AND 2000 AND country='UK';
24.2.2.1 AND
The following exercises combine AND
and OR
with the WHERE
clause.
Combining conditions with AND
will prove to be very useful when we want our query to return a specific subset of records.
-- Select the title and release_year for all German-language films released before 2000
SELECT title, release_year
FROM films
WHERE (language = 'German')
AND (release_year < 2000);
title | release_year |
---|---|
Metropolis | 1927 |
Pandora’s Box | 1929 |
The Torture Chamber of Dr. Sadism | 1967 |
Das Boot | 1981 |
Run Lola Run | 1998 |
Aimee & Jaguar | 1999 |
-- Update the query to see all German-language films released after 2000
SELECT title, release_year
FROM films
WHERE (release_year > 2000)
AND (language = 'German');
title | release_year |
---|---|
Good Bye Lenin! | 2003 |
Downfall | 2004 |
Summer Storm | 2004 |
The Lives of Others | 2006 |
The Baader Meinhof Complex | 2008 |
The Wave | 2008 |
Cargo | 2009 |
Soul Kitchen | 2009 |
The White Ribbon | 2009 |
3 | 2010 |
Select all details for German-language films released after 2000 but before 2010 using only WHERE
and AND
.
-- Select all records for German-language films released after 2000 and before 2010
SELECT *
FROM films
WHERE (language = 'German')
AND (release_year > 2000 AND release_year < 2010);
id | title | release_year | country | duration | language | certification | gross | budget |
---|---|---|---|---|---|---|---|---|
1952 | Good Bye Lenin! | 2003 | Germany | 121 | German | R | 4063859 | 4800000 |
2130 | Downfall | 2004 | Germany | 178 | German | R | 5501940 | 13500000 |
2224 | Summer Storm | 2004 | Germany | 98 | German | R | 95016 | 2700000 |
2709 | The Lives of Others | 2006 | Germany | 137 | German | R | 11284657 | 2000000 |
3100 | The Baader Meinhof Complex | 2008 | Germany | 184 | German | R | 476270 | 20000000 |
3143 | The Wave | 2008 | Germany | 107 | German | NA | NA | 5000000 |
3220 | Cargo | 2009 | Switzerland | 112 | German | NA | NA | 4500000 |
3346 | Soul Kitchen | 2009 | Germany | 99 | German | NA | 274385 | 4000000 |
3412 | The White Ribbon | 2009 | Germany | 144 | German | R | 2222647 | 12000000 |
24.2.2.2 OR
This time you’ll write a query to get the title
and release_year
of films
released in 1990 or 1999, which were in English or Spanish and took in more than $2,000,000 gross
.
-- Find the title and year of films from the 1990 or 1999
SELECT title, release_year
FROM films
WHERE (release_year = 1990 OR release_year = 1999)
-- Add a filter to see only English or Spanish-language films
AND (language = 'English' OR language = 'Spanish')
-- Filter films with more than $2,000,000 gross
AND (gross > 2000000);
title | release_year |
---|---|
Arachnophobia | 1990 |
Back to the Future Part III | 1990 |
Child’s Play 2 | 1990 |
Dances with Wolves | 1990 |
Days of Thunder | 1990 |
Dick Tracy | 1990 |
Die Hard 2 | 1990 |
Edward Scissorhands | 1990 |
Flatliners | 1990 |
Ghost | 1990 |
24.2.2.3 BETWEEN
Let’s use BETWEEN
with AND
on the films
database to get the title
and release_year
of all Spanish-language films released between 1990 and 2000 (inclusive) with budgets
over $100 million.
-- Select the title and release_year for films released between 1990 and 2000
SELECT title, release_year
FROM films
WHERE (release_year BETWEEN 1990 AND 2000)
-- Narrow down your query to films with budgets > $100 million
AND (budget > 100000000)
-- Restrict the query to only Spanish-language films
AND (language = 'Spanish');
title | release_year |
---|---|
Tango | 1998 |
SELECT title, release_year
FROM films
WHERE release_year BETWEEN 1990 AND 2000
AND budget > 100000000
-- Amend the query to include Spanish or French-language films
AND (language = 'Spanish' OR language = 'French');
title | release_year |
---|---|
Les couloirs du temps: Les visiteurs II | 1998 |
Tango | 1998 |
24.2.3 Filtering text
Filter a pattern rather than specific text.
LIKE
eg.,
WHERE name LIKE '%r'
,WHERE name LIKE '___t%'
NOT LIKE
eg.,
WHERE name NOT LIKE 'A.%'
IN
eg.,
WHERE release_year IN (1920, 1930, 1940)
,WHERE country IN ('Germany', 'France')
Pattern
%
match zero, one, or many characters_
match a single character
24.2.3.1 LIKE & NOT LIKE
The LIKE
and NOT LIKE
operators can be used to find records that either match or do not match a specified pattern, respectively. They can be coupled with the wildcards %
and _
.
This is useful when you want to filter text, but not to an exact word.
-- Select the names that start with B
SELECT name
FROM people
WHERE name LIKE 'B%';
name |
---|
B.J. Novak |
Babak Najafi |
Babar Ahmed |
Bahare Seddiqi |
Bai Ling |
Bailee Madison |
Balinese Tari Legong Dancers |
Bálint Péntek |
Baltasar Kormákur |
Balthazar Getty |
SELECT name
FROM people
-- Select the names that have r as the second letter
WHERE name LIKE '_r%'
name |
---|
Ara Celi |
Aramis Knight |
Arben Bajraktaraj |
Arcelia RamÃrez |
Archie Kao |
Archie Panjabi |
Aretha Franklin |
Ari Folman |
Ari Gold |
Ari Graynor |
SELECT name
FROM people
-- Select names that don't start with A
WHERE name NOT LIKE 'A%'
name |
---|
50 Cent |
Ãlex Angulo |
Ãlex de la Iglesia |
Ãngela Molina |
B.J. Novak |
Babak Najafi |
Babar Ahmed |
Bahare Seddiqi |
Bai Ling |
Bailee Madison |
SELECT COUNT(name) AS count_name_beginA
FROM people
-- Select names that don't start with A
WHERE name NOT LIKE 'A%'
count_name_beginA |
---|
7768 |
24.2.3.2 WHERE IN
You can query multiple conditions using the IN
operator and a set of parentheses. It is a valuable piece of code that helps us keep our queries clean and concise.
-- Find the title and release_year for all films over two hours in length released in 1990 and 2000
SELECT title, release_year
FROM films
WHERE (duration > 120)
AND (release_year IN (1990, 2000));
title | release_year |
---|---|
Dances with Wolves | 1990 |
Die Hard 2 | 1990 |
Ghost | 1990 |
Goodfellas | 1990 |
Mo’ Better Blues | 1990 |
Pretty Woman | 1990 |
The Godfather: Part III | 1990 |
The Hunt for Red October | 1990 |
All the Pretty Horses | 2000 |
Almost Famous | 2000 |
-- Find the title and language of all films in English, Spanish, and French
SELECT title, language
FROM films
WHERE language IN ('English', 'Spanish', 'French');
title | language |
---|---|
The Broadway Melody | English |
Hell’s Angels | English |
A Farewell to Arms | English |
42nd Street | English |
She Done Him Wrong | English |
It Happened One Night | English |
Top Hat | English |
Modern Times | English |
The Charge of the Light Brigade | English |
Snow White and the Seven Dwarfs | English |
-- Find the title, certification, and language all films certified NC-17 or R that are in English, Italian, or Greek
SELECT title, certification, language
FROM films
WHERE (certification IN ('NC-17', 'R'))
AND (language IN ('English', 'Italian', 'Greek'));
title | certification | language |
---|---|---|
Psycho | R | English |
A Fistful of Dollars | R | Italian |
Rosemary’s Baby | R | English |
The Wild Bunch | R | English |
Catch-22 | R | English |
Cotton Comes to Harlem | R | English |
The Ballad of Cable Hogue | R | English |
The Conformist | R | Italian |
Woodstock | R | English |
Sweet Sweetback’s Baadasssss Song | R | English |
24.2.3.3 Combine filtering & selecting
How many 90’s films we have in our dataset that would be suitable for English-speaking teens?
(You will be using DISTINCT
here too because, surprise, there are two movies named ‘Hamlet’ in this dataset.)
-- Count the unique titles
SELECT COUNT(DISTINCT title) AS nineties_english_films_for_teens
FROM films
-- Filter to release_years to between 1990 and 1999
WHERE (release_year BETWEEN 1990 AND 1999)
-- Filter to English-language films
AND (language = 'English')
-- Narrow it down to G, PG, and PG-13 certifications
AND (certification IN ('G', 'PG', 'PG-13'));
nineties_english_films_for_teens |
---|
310 |
24.2.4 NULL values
null
: Missing values
COUNT(field_name)
includes only non-missing valuesCOUNT(*)
includes missing values
Use IS NULL
or IS NOT NULL
to:
Identify missing values
Select missing values
Exclude missing values
-- List all film titles with missing budgets
SELECT title AS no_budget_info
FROM films
WHERE budget IS NULL
no_budget_info |
---|
Pandora’s Box |
The Prisoner of Zenda |
The Blue Bird |
Bambi |
State Fair |
Open Secret |
Deadline - U.S.A. |
Ordet |
The Party’s Over |
The Torture Chamber of Dr. Sadism |
-- Count the number of films we have language data for
SELECT COUNT(title) AS count_language_known
FROM films
WHERE language IS NOT NULL;
count_language_known |
---|
4957 |
24.3 Aggregate Functions
24.3.1 Summarizing data
Aggregate functions
AVG()
, SUM()
, MIN()
, MAX()
, COUNT()
Numerical fields only
AVG()
,SUM()
Various data types
MIN()
,MAX()
,COUNT()
MIN() <-> MAX() in non-numerical data
A <-> Z
1715 <-> 2022
Aliasing when summarizing
Perhaps you’d like to know how old the oldest film in the films table is, what the most expensive film is, or how many films you have listed.
-- Query the sum of film durations
SELECT SUM(duration) AS total_duration
FROM films;
total_duration |
---|
534882 |
-- Calculate the average duration of all films
SELECT AVG(duration) AS average_duration
FROM films;
average_duration |
---|
108 |
-- Find the latest release_year
SELECT MAX(release_year) AS latest_year
FROM films;
latest_year |
---|
2016 |
-- Find the duration of the shortest film
SELECT MIN(duration) AS shortest_film
FROM films;
shortest_film |
---|
7 |
24.3.2 Summarizing subsets
Using
WHERE
with aggregate functionsROUND()
: Round a number to a specified decimalROUND(number_to_round, decimal_places)
whole number:
ROUND(number_to_round, 0)
小數點第二位:
ROUND(number_to_round, 2)
to thousands:
ROUND(number_to_round, -3)
- 123456 ⟶ 12000, negative parameter
24.3.2.1 Aggregate functions & WHERE
In your film-industry role, as an example, you may like to summarize each certification category to compare how they each perform or if one certification has a higher average budget than another.
-- Calculate the sum of gross from the year 2000 or later
SELECT SUM(gross) AS total_gross
FROM films
WHERE release_year >= 2000;
total_gross |
---|
150900926358 |
-- Calculate the average gross of films that start with A
SELECT AVG(gross) AS avg_gross_A
FROM films
WHERE title LIKE 'A%';
avg_gross_A |
---|
47893236 |
-- Calculate the lowest gross film in 1994
SELECT MIN(gross) AS lowest_gross
FROM films
WHERE release_year = 1994;
lowest_gross |
---|
125169 |
-- Calculate the highest gross film released between 2000-2012
SELECT MAX(gross) AS highest_gross
FROM films
WHERE release_year BETWEEN 2000 AND 2012;
highest_gross |
---|
760505847 |
24.3.2.2 ROUND()
Aggregate functions work great with numerical values; however, these results can sometimes get unwieldy when dealing with long decimal values. SQL provides you with the ROUND()
function to tame these long decimals.
-- Round the average number of facebook_likes to one decimal place
SELECT ROUND(AVG(facebook_likes), 1) AS avg_facebook_likes
FROM reviews;
avg_facebook_likes |
---|
7803 |
A useful thing you can do with ROUND()
is have a negative number as the decimal place parameter. This can come in handy if your manager only needs to know the average number of facebook_likes
to the hundreds since granularity below one hundred likes won’t impact decision making.
-- Calculate the average budget rounded to the thousands, 39902000
SELECT ROUND(AVG(budget), -3) AS avg_budget_thousands
FROM films;
avg_budget_thousands |
---|
39902826 |
The ROUND()
function is very handy when making financial calculations to get a top-level view or specify to the penny or cent.
24.3.3 Aliasing and arithmetic
Arithmetic
+
, -
, *
, and /
When dividing, SQL assumes that we want to get an integer back if we divide an integer by an integer.We can add decimal places to our numbers if we want more precision
SELECT (4 / 3);
⟶ 1SELECT (4.0 / 3.0);
⟶ 1.333…
Aliasing with arithmetic & function
Aggregate functions vs. arithmetic
aggregate functions: perform their operations on the fields vertically
arithmetic: adds up the records horizontally
-- default
SELECT (2 / 10);
(2 / 10) |
---|
0 |
-- add decimal
SELECT (2.0 / 10.0);
(2.0 / 10.0) |
---|
0.2 |
24.3.3.1 Aliasing with functions
Aliasing can be a lifesaver, especially as we start to do more complex SQL queries with multiple criteria. Aliases help you keep your code clean and readable.
For example, if you want to find the MAX()
value of several fields without aliasing, you’ll end up with the result with several columns called max
and no idea which is which. You can fix this with aliasing.
-- Calculate the title and duration_hours from films
SELECT title, (duration / 60.0) AS duration_hours
FROM films;
title | duration_hours |
---|---|
Intolerance: Love’s Struggle Throughout the Ages | 2.05 |
Over the Hill to the Poorhouse | 1.83 |
The Big Parade | 2.52 |
Metropolis | 2.42 |
Pandora’s Box | 1.83 |
The Broadway Melody | 1.67 |
Hell’s Angels | 1.60 |
A Farewell to Arms | 1.32 |
42nd Street | 1.48 |
She Done Him Wrong | 1.10 |
-- Calculate the percentage of people who are no longer alive
SELECT COUNT(deathdate) * 100.0 / COUNT(*) AS percentage_dead
FROM people;
percentage_dead |
---|
9.37 |
-- Find the number of decades in the films table
SELECT (MAX(release_year) - MIN(release_year)) / 10.0 AS number_of_decades
FROM films;
number_of_decades |
---|
10 |
films
table covers films released over one hundred years!
24.3.3.2 Rounding results
In the previous exercise, many of the results were inconveniently long. Update the query by adding ROUND()` around the calculation.
-- Round duration_hours to two decimal places
SELECT title, ROUND((duration / 60.0), 2) AS duration_hours
FROM films;
title | duration_hours |
---|---|
Intolerance: Love’s Struggle Throughout the Ages | 2.05 |
Over the Hill to the Poorhouse | 1.83 |
The Big Parade | 2.52 |
Metropolis | 2.42 |
Pandora’s Box | 1.83 |
The Broadway Melody | 1.67 |
Hell’s Angels | 1.60 |
A Farewell to Arms | 1.32 |
42nd Street | 1.48 |
She Done Him Wrong | 1.10 |
24.4 Sorting and Grouping
24.4.1 Sorting results
ORDER BY
: ASC
, DESC
Text
ASC
: Alphabetically (A-Z)ORDER BY multiple fields:
ORDER BY field_one, field_two
Different orders:
ORDER BY birthdate, name DESC;
24.4.1.1 ORDER BY
Sorting single fields
-- Select name from people and sort alphabetically
SELECT name
FROM people
ORDER BY name ASC;
name |
---|
50 Cent |
A. Michael Baldwin |
A. Raven Cruz |
A.J. Buckley |
A.J. DeLucia |
A.J. Langer |
AJ Michalka |
Aaliyah |
Aaron Ashmore |
Aaron Hann |
-- Select the title and duration from longest to shortest film
SELECT title, duration
FROM films
ORDER BY duration DESC;
title | duration |
---|---|
Carlos | 334 |
Blood In, Blood Out | 330 |
Heaven’s Gate | 325 |
The Legend of Suriyothai | 300 |
Das Boot | 293 |
Apocalypse Now | 289 |
The Company | 286 |
Gods and Generals | 280 |
Gettysburg | 271 |
Arn: The Knight Templar | 270 |
Sorting multiple fields
It will sort by the first field specified, then sort by the next, and so on.
-- Select the release year, duration, and title sorted by release year and duration
SELECT release_year, duration, title
FROM films
WHERE release_year IS NOT NULL
ORDER BY release_year, duration;
release_year | duration | title |
---|---|---|
1916 | 123 | Intolerance: Love’s Struggle Throughout the Ages |
1920 | 110 | Over the Hill to the Poorhouse |
1925 | 151 | The Big Parade |
1927 | 145 | Metropolis |
1929 | 100 | The Broadway Melody |
1929 | 110 | Pandora’s Box |
1930 | 96 | Hell’s Angels |
1932 | 79 | A Farewell to Arms |
1933 | 66 | She Done Him Wrong |
1933 | 89 | 42nd Street |
-- Select the certification, release year, and title sorted by certification and release year
-- Ordered first by certification (alphabetically) and second by release year, starting with the most recent year
SELECT certification, release_year, title
FROM films
WHERE certification IS NOT NULL
ORDER BY certification ASC, release_year DESC;
certification | release_year | title |
---|---|---|
Approved | 1967 | In Cold Blood |
Approved | 1967 | Point Blank |
Approved | 1967 | You Only Live Twice |
Approved | 1966 | A Funny Thing Happened on the Way to the Forum |
Approved | 1966 | A Man for All Seasons |
Approved | 1966 | Batman: The Movie |
Approved | 1966 | The Good, the Bad and the Ugly |
Approved | 1966 | Torn Curtain |
Approved | 1965 | Major Dundee |
Approved | 1965 | Thunderball |
24.4.2 Grouping data
24.4.2.1 GROUP BY
GROUP BY
multiple fields:GROUP BY certification, language
GROUP BY single fields
-- Find the release_year and film_count of each year
SELECT release_year, COUNT(*) AS film_count
FROM films
WHERE release_year IS NOT NULL
GROUP BY release_year;
release_year | film_count |
---|---|
1916 | 1 |
1920 | 1 |
1925 | 1 |
1927 | 1 |
1929 | 2 |
1930 | 1 |
1932 | 1 |
1933 | 2 |
1934 | 1 |
1935 | 1 |
-- Find the release_year and average duration of films for each year
SELECT release_year, AVG(duration) AS avg_duration
FROM films
WHERE release_year IS NOT NULL
GROUP BY release_year;
release_year | avg_duration |
---|---|
1916 | 123.0 |
1920 | 110.0 |
1925 | 151.0 |
1927 | 145.0 |
1929 | 105.0 |
1930 | 96.0 |
1932 | 79.0 |
1933 | 77.5 |
1934 | 65.0 |
1935 | 81.0 |
Using GROUP BY with a time or date field such as release_year
can help us identify trends such as a period of time where movies were really short!
GROUP BY multiple fields
Look at the maximum budget for each country in each year
-- Find the release_year, country, and max_budget, then group and order by release_year and country
SELECT release_year, country, MAX(budget) AS max_budget
FROM films
WHERE release_year IS NOT NULL
GROUP BY release_year, country
ORDER BY release_year, country;
release_year | country | max_budget |
---|---|---|
1916 | USA | 385907 |
1920 | USA | 100000 |
1925 | USA | 245000 |
1927 | Germany | 6000000 |
1929 | Germany | NA |
1929 | USA | 379000 |
1930 | USA | 3950000 |
1932 | USA | 800000 |
1933 | USA | 439000 |
1934 | USA | 325000 |
Answering business questions
Which release_year
had the most language diversity?
SELECT release_year, COUNT(DISTINCT language) AS num_language
FROM films
GROUP BY release_year
ORDER BY num_language DESC;
release_year | num_language |
---|---|
2006 | 16 |
2015 | 15 |
2005 | 14 |
2013 | 13 |
2008 | 13 |
2009 | 12 |
2004 | 12 |
2007 | 11 |
2011 | 10 |
2010 | 10 |
24.4.3 Filtering grouped data
HAVING vs WHERE
HAVING | WHERE |
---|---|
HAVING filters grouped records | WHERE filters individual records |
In what years was the average film duration over two hours? | What films were released in the year 2000? |
|
|
24.4.3.1 HAVING
Filtering grouped data can be especially handy when working with a large dataset. When working with thousands or even millions of rows, HAVING
will allow you to filter for just the group of data you want
Find out which countries (or country) have the most varied film certifications.
-- Select the country and distinct count of certification as certification_count
SELECT country, COUNT(DISTINCT certification) AS certification_count
FROM films
-- Group by country
GROUP BY country
-- Filter results to countries with more than 10 different certifications
HAVING COUNT(DISTINCT certification) > 10;
country | certification_count |
---|---|
USA | 12 |
24.4.3.2 HAVING and sorting
Writing a query showing what countries have the highest average film budgets.
-- Select the country and average_budget from films
SELECT country, AVG(budget) AS average_budget
FROM films
-- Group by country
GROUP BY country
-- Filter to countries with an average_budget of more than one billion
HAVING AVG(budget) > 1000000000
-- Order by descending order of the aggregated budget
ORDER BY average_budget DESC;
country | average_budget |
---|---|
South Korea | 1383960000 |
Hungary | 1260000000 |
South Korea and Hungary seem to have pricey films… or do they? Actually, these budgets are pretty standard for their local currency.
24.4.3.3 All together
You’ll write a query that returns the average budget and gross earnings for films each year after 1990 if the average budget is greater than 60 million.
-- Select the release_year, average budget and average gross
-- for films released after 1990 grouped by year
SELECT release_year, AVG(budget) AS avg_budget, AVG(gross) AS avg_gross
FROM films
WHERE release_year > 1990
GROUP BY release_year
-- Only years with an avg_budget of more than 60 million
HAVING AVG(budget) > 60000000
-- Order the results from highest to lowest average gross and limit to one
ORDER BY avg_gross DESC
LIMIT 1;
release_year | avg_budget | avg_gross |
---|---|---|
2005 | 70323938 | 41159143 |
# disconnect the sql connection
dbDisconnect(con)