Chapter 24 Intermediate SQL

24.1 Selecting Data

24.1.1 Querying a database

Setup films database

# Creating a new database
# Connect to the default postgres database
library(DBI)
con <- dbConnect(RSQLite::SQLite(), "")
knitr::opts_chunk$set(connection = "con")

# Loading data
films <- read.csv("data/films.csv", header = FALSE, na.strings = "")
colnames(films) <- c("id", "title", "release_year", "country", "duration", "language", "certification", "gross", "budget")

people <- read.csv("data/people.csv", header = FALSE, col.names = c("id", "name", "birthdate", "deathdate"), na.strings = "") 

reviews <- read.csv("data/reviews.csv", header = FALSE, col.names = c("id", "film_id","num_user", "num_critic", "imdb_score", "num_votes", "facebook_likes"), na.strings = "")

roles <- read.csv("data/roles.csv", header = FALSE, col.names = c("id", "film_id", "person_id", "role"), na.strings = "")

# Create database tables
dbWriteTable(con, "films", films)
dbWriteTable(con, "people", people)
dbWriteTable(con, "reviews", reviews)
dbWriteTable(con, "roles", roles)

# List database tables
dbListTables(con)
## [1] "films"   "people"  "reviews" "roles"

24.1.1.1 COUNT & DISTINCT

COUNT

  • Counts the number of records with a value in a field

    • COUNT(field_name) counts values in a field, multiple fields: separate by ,

    • COUNT(*) counts records in a table: total observation

  • Use an AS for clarity

  • Includes duplicates

DISTINCT

  • DISTINCT removes duplicates to return only unique values

  • Combine COUNT() with DISTINCT to count unique values

-- Count the number of records in the people table
SELECT COUNT(*) AS count_records
FROM people;
Table 24.1: 1 records
count_records
8397
-- Count the number of birthdates in the people table
SELECT COUNT(birthdate) AS count_birthdate
FROM people;
Table 24.2: 1 records
count_birthdate
6152
-- Count the records for languages and countries represented in the films table
SELECT COUNT(language) AS count_languages, COUNT(country) AS count_countries
FROM films;
Table 24.3: 1 records
count_languages count_countries
4957 4966

SELECT DISTINCT

-- Return the unique countries from the films table
SELECT DISTINCT country
FROM films;
Table 24.4: Displaying records 1 - 10
country
USA
Germany
Japan
Denmark
UK
Italy
France
West Germany
Sweden
Soviet Union
-- Count the distinct countries from the films table
SELECT COUNT(DISTINCT country) AS count_distinct_countries
FROM films
Table 24.5: 1 records
count_distinct_countries
64

24.1.2 Query execution

Order of execution

  • SQL is not processed in its written order

    • e.g., SQL needs to know where to SELECT data FROM before it can LIMIT the results.
  • Good to know processing order for debugging and aliasing

Most common errors

  • Misspelling

  • Incorrect capitalization

  • Incorrect or missing punctuation, especially commas

24.1.3 SQL style

Holywell’s style guide

Other notices:

  • Semicolon ;

    • Indicates the end of a query

    • Easier to translate between SQL flavors

  • Dealing with non-standard field names

    • Put non-standard field names in double-quotes: "release year"

      SELECT title, "release year", country

24.2 Filtering Records

24.2.1 Filtering numbers

WHERE

  • Comparison operators

    • > Greater than or after

    • < Less than or before

    • = Equal to

    • >= Greater than or equal to

    • <= Less than or equal to

    • <> Not equal to 👈

  • WHERE with strings

    • Use single-quotes around strings we want to filter: WHERE country = 'Japan'

24.2.1.1 WHERE

Using WHERE with numbers

In this case, you’ll want to filter your data to a specific budget range.

-- Select film_ids and imdb_score with an imdb_score over 7.0
SELECT film_id, imdb_score
FROM reviews
WHERE imdb_score > 7;
Table 24.6: Displaying records 1 - 10
film_id imdb_score
3934 7.1
74 7.6
1254 8.0
4841 8.1
3252 7.2
1181 7.3
3929 7.1
3298 7.4
2744 7.4
4707 7.4
-- Select imdb_score with an imdb_score over 7.0
SELECT COUNT(imdb_score)
FROM reviews
WHERE imdb_score > 7;
Table 24.7: 1 records
COUNT(imdb_score)
1536
-- Select film_ids and facebook_likes for ten records with less than 1000 likes
SELECT film_id, facebook_likes
FROM reviews
WHERE facebook_likes < 1000
LIMIT 10;
Table 24.8: Displaying records 1 - 10
film_id facebook_likes
3405 0
478 491
74 930
740 0
2869 689
1181 0
2020 0
2312 912
1820 872
831 975
-- Count the records with at least 100,000 votes
SELECT COUNT(num_votes) AS films_over_100K_votes
FROM reviews
WHERE num_votes >= 100000;
Table 24.9: 1 records
films_over_100K_votes
1211

Using WHERE with text

WHERE can also filter string values.

-- Count the Spanish-language films
SELECT COUNT(language) AS count_spanish
FROM films
WHERE language = 'Spanish';
Table 24.10: 1 records
count_spanish
40

There are 40 Spanish-language films in this table.

24.2.2 Multiple criteria

Multiple criteria

  • OR , AND , BETWEEN
-- OR
SELECT *
FROM coats
WHERE color = 'yellow' OR length = 'short';

-- AND
SELECT *
FROM coats
WHERE color = 'yellow' AND length = 'short';

-- BETWEEN, AND: BETWEEN 上下界都包含(1,5也算) 👈
SELECT *
FROM coats
WHERE buttons BETWEEN 1 AND 5;

-- AND, OR: Enclose individual clauses in parentheses👈
SELECT title
FROM films
WHERE (release_year = 1994 OR release_year = 1995)
    AND (certification = 'PG' OR certification = 'R');
    
-- BETWEEN, AND, OR
SELECT title
FROM films
WHERE release_year
    BETWEEN 1994 AND 2000 AND country='UK';

24.2.2.1 AND

The following exercises combine AND and OR with the WHERE clause.

Combining conditions with AND will prove to be very useful when we want our query to return a specific subset of records.

-- Select the title and release_year for all German-language films released before 2000
SELECT title, release_year
FROM films
WHERE (language = 'German')
    AND (release_year < 2000);
Table 24.11: 6 records
title release_year
Metropolis 1927
Pandora’s Box 1929
The Torture Chamber of Dr. Sadism 1967
Das Boot 1981
Run Lola Run 1998
Aimee & Jaguar 1999
-- Update the query to see all German-language films released after 2000
SELECT title, release_year
FROM films
WHERE (release_year > 2000)
    AND (language = 'German');
Table 24.12: Displaying records 1 - 10
title release_year
Good Bye Lenin! 2003
Downfall 2004
Summer Storm 2004
The Lives of Others 2006
The Baader Meinhof Complex 2008
The Wave 2008
Cargo 2009
Soul Kitchen 2009
The White Ribbon 2009
3 2010

Select all details for German-language films released after 2000 but before 2010 using only WHERE and AND.

-- Select all records for German-language films released after 2000 and before 2010
SELECT *
FROM films
WHERE (language = 'German')
    AND (release_year > 2000 AND release_year < 2010);
Table 24.13: 9 records
id title release_year country duration language certification gross budget
1952 Good Bye Lenin! 2003 Germany 121 German R 4063859 4800000
2130 Downfall 2004 Germany 178 German R 5501940 13500000
2224 Summer Storm 2004 Germany 98 German R 95016 2700000
2709 The Lives of Others 2006 Germany 137 German R 11284657 2000000
3100 The Baader Meinhof Complex 2008 Germany 184 German R 476270 20000000
3143 The Wave 2008 Germany 107 German NA NA 5000000
3220 Cargo 2009 Switzerland 112 German NA NA 4500000
3346 Soul Kitchen 2009 Germany 99 German NA 274385 4000000
3412 The White Ribbon 2009 Germany 144 German R 2222647 12000000

24.2.2.2 OR

This time you’ll write a query to get the title and release_year of films released in 1990 or 1999, which were in English or Spanish and took in more than $2,000,000 gross.

-- Find the title and year of films from the 1990 or 1999
SELECT title, release_year
FROM films
WHERE (release_year = 1990 OR release_year = 1999)
    -- Add a filter to see only English or Spanish-language films
    AND (language = 'English' OR language = 'Spanish')
    -- Filter films with more than $2,000,000 gross
    AND (gross > 2000000);
Table 24.14: Displaying records 1 - 10
title release_year
Arachnophobia 1990
Back to the Future Part III 1990
Child’s Play 2 1990
Dances with Wolves 1990
Days of Thunder 1990
Dick Tracy 1990
Die Hard 2 1990
Edward Scissorhands 1990
Flatliners 1990
Ghost 1990

24.2.2.3 BETWEEN

Let’s use BETWEEN with AND on the films database to get the title and release_year of all Spanish-language films released between 1990 and 2000 (inclusive) with budgets over $100 million.

-- Select the title and release_year for films released between 1990 and 2000
SELECT title, release_year
FROM films
WHERE (release_year BETWEEN 1990 AND 2000)
    -- Narrow down your query to films with budgets > $100 million
    AND (budget > 100000000)
    -- Restrict the query to only Spanish-language films
    AND (language = 'Spanish');
Table 24.15: 1 records
title release_year
Tango 1998
SELECT title, release_year
FROM films
WHERE release_year BETWEEN 1990 AND 2000
    AND budget > 100000000
    -- Amend the query to include Spanish or French-language films
    AND (language = 'Spanish' OR language = 'French');
Table 24.16: 2 records
title release_year
Les couloirs du temps: Les visiteurs II 1998
Tango 1998

24.2.3 Filtering text

Filter a pattern rather than specific text.

  • LIKE

    eg., WHERE name LIKE '%r', WHERE name LIKE '___t%'

  • NOT LIKE

    eg., WHERE name NOT LIKE 'A.%'

  • IN

    eg., WHERE release_year IN (1920, 1930, 1940),

    WHERE country IN ('Germany', 'France')

Pattern

  • % match zero, one, or many characters

  • _ match a single character

24.2.3.1 LIKE & NOT LIKE

The LIKE and NOT LIKE operators can be used to find records that either match or do not match a specified pattern, respectively. They can be coupled with the wildcards % and _.

This is useful when you want to filter text, but not to an exact word.

-- Select the names that start with B
SELECT name
FROM people
WHERE name LIKE 'B%';
Table 24.17: Displaying records 1 - 10
name
B.J. Novak
Babak Najafi
Babar Ahmed
Bahare Seddiqi
Bai Ling
Bailee Madison
Balinese Tari Legong Dancers
Bálint Péntek
Baltasar Kormákur
Balthazar Getty
SELECT name
FROM people
-- Select the names that have r as the second letter
WHERE name LIKE '_r%'
Table 24.18: Displaying records 1 - 10
name
Ara Celi
Aramis Knight
Arben Bajraktaraj
Arcelia Ramírez
Archie Kao
Archie Panjabi
Aretha Franklin
Ari Folman
Ari Gold
Ari Graynor
SELECT name
FROM people
-- Select names that don't start with A
WHERE name NOT LIKE 'A%'
Table 24.19: Displaying records 1 - 10
name
50 Cent
Álex Angulo
Álex de la Iglesia
Ángela Molina
B.J. Novak
Babak Najafi
Babar Ahmed
Bahare Seddiqi
Bai Ling
Bailee Madison
SELECT COUNT(name) AS count_name_beginA
FROM people
-- Select names that don't start with A
WHERE name NOT LIKE 'A%'
Table 24.20: 1 records
count_name_beginA
7768

24.2.3.2 WHERE IN

You can query multiple conditions using the IN operator and a set of parentheses. It is a valuable piece of code that helps us keep our queries clean and concise.

-- Find the title and release_year for all films over two hours in length released in 1990 and 2000
SELECT title, release_year
FROM films
WHERE (duration > 120)
    AND (release_year IN (1990, 2000));
Table 24.21: Displaying records 1 - 10
title release_year
Dances with Wolves 1990
Die Hard 2 1990
Ghost 1990
Goodfellas 1990
Mo’ Better Blues 1990
Pretty Woman 1990
The Godfather: Part III 1990
The Hunt for Red October 1990
All the Pretty Horses 2000
Almost Famous 2000
-- Find the title and language of all films in English, Spanish, and French
SELECT title, language
FROM films
WHERE language IN ('English', 'Spanish', 'French');
Table 24.22: Displaying records 1 - 10
title language
The Broadway Melody English
Hell’s Angels English
A Farewell to Arms English
42nd Street English
She Done Him Wrong English
It Happened One Night English
Top Hat English
Modern Times English
The Charge of the Light Brigade English
Snow White and the Seven Dwarfs English
-- Find the title, certification, and language all films certified NC-17 or R that are in English, Italian, or Greek
SELECT title, certification, language
FROM films
WHERE (certification IN ('NC-17', 'R'))
    AND (language IN ('English', 'Italian', 'Greek'));
Table 24.23: Displaying records 1 - 10
title certification language
Psycho R English
A Fistful of Dollars R Italian
Rosemary’s Baby R English
The Wild Bunch R English
Catch-22 R English
Cotton Comes to Harlem R English
The Ballad of Cable Hogue R English
The Conformist R Italian
Woodstock R English
Sweet Sweetback’s Baadasssss Song R English

24.2.3.3 Combine filtering & selecting

How many 90’s films we have in our dataset that would be suitable for English-speaking teens?

(You will be using DISTINCT here too because, surprise, there are two movies named ‘Hamlet’ in this dataset.)

-- Count the unique titles
SELECT COUNT(DISTINCT title) AS nineties_english_films_for_teens
FROM films
-- Filter to release_years to between 1990 and 1999
WHERE (release_year BETWEEN 1990 AND 1999)
-- Filter to English-language films
    AND (language = 'English')
-- Narrow it down to G, PG, and PG-13 certifications
    AND (certification IN ('G', 'PG', 'PG-13'));
Table 24.24: 1 records
nineties_english_films_for_teens
310

24.2.4 NULL values

null : Missing values

  • COUNT(field_name) includes only non-missing values

  • COUNT(*) includes missing values

Use IS NULL or IS NOT NULL to:

  • Identify missing values

  • Select missing values

  • Exclude missing values

-- List all film titles with missing budgets
SELECT title AS no_budget_info
FROM films
WHERE budget IS NULL 
Table 24.25: Displaying records 1 - 10
no_budget_info
Pandora’s Box
The Prisoner of Zenda
The Blue Bird
Bambi
State Fair
Open Secret
Deadline - U.S.A.
Ordet
The Party’s Over
The Torture Chamber of Dr. Sadism
-- Count the number of films we have language data for
SELECT COUNT(title) AS count_language_known
FROM films
WHERE language IS NOT NULL;
Table 24.26: 1 records
count_language_known
4957

24.3 Aggregate Functions

24.3.1 Summarizing data

Aggregate functions

AVG() , SUM() , MIN() , MAX() , COUNT()

  • Numerical fields only

    • AVG(), SUM()
  • Various data types

    • MIN() , MAX() , COUNT()

    • MIN() <-> MAX() in non-numerical data

      • A <-> Z

      • 1715 <-> 2022

  • Aliasing when summarizing

Perhaps you’d like to know how old the oldest film in the films table is, what the most expensive film is, or how many films you have listed.

-- Query the sum of film durations
SELECT SUM(duration) AS total_duration
FROM films;
Table 24.27: 1 records
total_duration
534882
-- Calculate the average duration of all films
SELECT AVG(duration) AS average_duration
FROM films;
Table 24.28: 1 records
average_duration
108
-- Find the latest release_year
SELECT MAX(release_year) AS latest_year
FROM films;
Table 24.29: 1 records
latest_year
2016
-- Find the duration of the shortest film
SELECT MIN(duration) AS shortest_film
FROM films;
Table 24.30: 1 records
shortest_film
7

24.3.2 Summarizing subsets

  • Using WHERE with aggregate functions

  • ROUND() : Round a number to a specified decimal

    • ROUND(number_to_round, decimal_places)

    • whole number: ROUND(number_to_round, 0)

    • 小數點第二位: ROUND(number_to_round, 2)

    • to thousands: ROUND(number_to_round, -3)

      • 123456 ⟶ 12000, negative parameter

24.3.2.1 Aggregate functions & WHERE

In your film-industry role, as an example, you may like to summarize each certification category to compare how they each perform or if one certification has a higher average budget than another.

-- Calculate the sum of gross from the year 2000 or later
SELECT SUM(gross) AS total_gross
FROM films
WHERE release_year >= 2000;
Table 24.31: 1 records
total_gross
150900926358
-- Calculate the average gross of films that start with A
SELECT AVG(gross) AS avg_gross_A
FROM films
WHERE title LIKE 'A%';
Table 24.32: 1 records
avg_gross_A
47893236
-- Calculate the lowest gross film in 1994
SELECT MIN(gross) AS lowest_gross
FROM films
WHERE release_year = 1994;
Table 24.33: 1 records
lowest_gross
125169
-- Calculate the highest gross film released between 2000-2012
SELECT MAX(gross) AS highest_gross
FROM films
WHERE release_year BETWEEN 2000 AND 2012;
Table 24.34: 1 records
highest_gross
760505847

24.3.2.2 ROUND()

Aggregate functions work great with numerical values; however, these results can sometimes get unwieldy when dealing with long decimal values. SQL provides you with the ROUND() function to tame these long decimals.

-- Round the average number of facebook_likes to one decimal place
SELECT ROUND(AVG(facebook_likes), 1) AS avg_facebook_likes
FROM reviews;
Table 24.35: 1 records
avg_facebook_likes
7803

A useful thing you can do with ROUND() is have a negative number as the decimal place parameter. This can come in handy if your manager only needs to know the average number of facebook_likes to the hundreds since granularity below one hundred likes won’t impact decision making.

-- Calculate the average budget rounded to the thousands, 39902000
SELECT ROUND(AVG(budget), -3) AS avg_budget_thousands
FROM films;
Table 24.36: 1 records
avg_budget_thousands
39902826

The ROUND() function is very handy when making financial calculations to get a top-level view or specify to the penny or cent.

24.3.3 Aliasing and arithmetic

Arithmetic

+ , - , * , and /

  • When dividing, SQL assumes that we want to get an integer back if we divide an integer by an integer.We can add decimal places to our numbers if we want more precision

    • SELECT (4 / 3); ⟶ 1

    • SELECT (4.0 / 3.0); ⟶ 1.333…

  • Aliasing with arithmetic & function

Aggregate functions vs. arithmetic

  • aggregate functions: perform their operations on the fields vertically

  • arithmetic: adds up the records horizontally

-- default
SELECT (2 / 10);
Table 24.37: 1 records
(2 / 10)
0
-- add decimal
SELECT (2.0 / 10.0);
Table 24.38: 1 records
(2.0 / 10.0)
0.2

24.3.3.1 Aliasing with functions

Aliasing can be a lifesaver, especially as we start to do more complex SQL queries with multiple criteria. Aliases help you keep your code clean and readable.

For example, if you want to find the MAX() value of several fields without aliasing, you’ll end up with the result with several columns called max and no idea which is which. You can fix this with aliasing.

-- Calculate the title and duration_hours from films
SELECT title, (duration / 60.0) AS duration_hours
FROM films;
Table 24.39: Displaying records 1 - 10
title duration_hours
Intolerance: Love’s Struggle Throughout the Ages 2.05
Over the Hill to the Poorhouse 1.83
The Big Parade 2.52
Metropolis 2.42
Pandora’s Box 1.83
The Broadway Melody 1.67
Hell’s Angels 1.60
A Farewell to Arms 1.32
42nd Street 1.48
She Done Him Wrong 1.10
-- Calculate the percentage of people who are no longer alive
SELECT COUNT(deathdate) * 100.0 / COUNT(*) AS percentage_dead
FROM people;
Table 24.40: 1 records
percentage_dead
9.37
-- Find the number of decades in the films table
SELECT (MAX(release_year) - MIN(release_year)) / 10.0 AS number_of_decades
FROM films;
Table 24.41: 1 records
number_of_decades
10

films table covers films released over one hundred years!

24.3.3.2 Rounding results

In the previous exercise, many of the results were inconveniently long. Update the query by adding ROUND()` around the calculation.

-- Round duration_hours to two decimal places
SELECT title, ROUND((duration / 60.0), 2) AS duration_hours
FROM films;
Table 24.42: Displaying records 1 - 10
title duration_hours
Intolerance: Love’s Struggle Throughout the Ages 2.05
Over the Hill to the Poorhouse 1.83
The Big Parade 2.52
Metropolis 2.42
Pandora’s Box 1.83
The Broadway Melody 1.67
Hell’s Angels 1.60
A Farewell to Arms 1.32
42nd Street 1.48
She Done Him Wrong 1.10

24.4 Sorting and Grouping

24.4.1 Sorting results

ORDER BY : ASC, DESC

  • Text ASC : Alphabetically (A-Z)

  • ORDER BY multiple fields: ORDER BY field_one, field_two

  • Different orders: ORDER BY birthdate, name DESC;

24.4.1.1 ORDER BY

Sorting single fields

-- Select name from people and sort alphabetically
SELECT name
FROM people
ORDER BY name ASC;
Table 24.43: Displaying records 1 - 10
name
50 Cent
A. Michael Baldwin
A. Raven Cruz
A.J. Buckley
A.J. DeLucia
A.J. Langer
AJ Michalka
Aaliyah
Aaron Ashmore
Aaron Hann
-- Select the title and duration from longest to shortest film
SELECT title, duration
FROM films
ORDER BY duration DESC;
Table 24.44: Displaying records 1 - 10
title duration
Carlos 334
Blood In, Blood Out 330
Heaven’s Gate 325
The Legend of Suriyothai 300
Das Boot 293
Apocalypse Now 289
The Company 286
Gods and Generals 280
Gettysburg 271
Arn: The Knight Templar 270

Sorting multiple fields

It will sort by the first field specified, then sort by the next, and so on.

-- Select the release year, duration, and title sorted by release year and duration
SELECT release_year, duration, title
FROM films
WHERE release_year IS NOT NULL
ORDER BY release_year, duration;
Table 24.45: Displaying records 1 - 10
release_year duration title
1916 123 Intolerance: Love’s Struggle Throughout the Ages
1920 110 Over the Hill to the Poorhouse
1925 151 The Big Parade
1927 145 Metropolis
1929 100 The Broadway Melody
1929 110 Pandora’s Box
1930 96 Hell’s Angels
1932 79 A Farewell to Arms
1933 66 She Done Him Wrong
1933 89 42nd Street
-- Select the certification, release year, and title sorted by certification and release year
-- Ordered first by certification (alphabetically) and second by release year, starting with the most recent year
SELECT certification, release_year, title
FROM films
WHERE certification IS NOT NULL
ORDER BY certification ASC, release_year DESC;
Table 24.46: Displaying records 1 - 10
certification release_year title
Approved 1967 In Cold Blood
Approved 1967 Point Blank
Approved 1967 You Only Live Twice
Approved 1966 A Funny Thing Happened on the Way to the Forum
Approved 1966 A Man for All Seasons
Approved 1966 Batman: The Movie
Approved 1966 The Good, the Bad and the Ugly
Approved 1966 Torn Curtain
Approved 1965 Major Dundee
Approved 1965 Thunderball

24.4.2 Grouping data

24.4.2.1 GROUP BY

  • GROUP BY multiple fields: GROUP BY certification, language

GROUP BY single fields

-- Find the release_year and film_count of each year
SELECT release_year, COUNT(*) AS film_count
FROM films
WHERE release_year IS NOT NULL
GROUP BY release_year;
Table 24.47: Displaying records 1 - 10
release_year film_count
1916 1
1920 1
1925 1
1927 1
1929 2
1930 1
1932 1
1933 2
1934 1
1935 1
-- Find the release_year and average duration of films for each year
SELECT release_year, AVG(duration) AS avg_duration
FROM films
WHERE release_year IS NOT NULL
GROUP BY release_year;
Table 24.48: Displaying records 1 - 10
release_year avg_duration
1916 123.0
1920 110.0
1925 151.0
1927 145.0
1929 105.0
1930 96.0
1932 79.0
1933 77.5
1934 65.0
1935 81.0

Using GROUP BY with a time or date field such as release_year can help us identify trends such as a period of time where movies were really short!

GROUP BY multiple fields

Look at the maximum budget for each country in each year

-- Find the release_year, country, and max_budget, then group and order by release_year and country
SELECT release_year, country, MAX(budget) AS max_budget
FROM films
WHERE release_year IS NOT NULL
GROUP BY release_year, country
ORDER BY release_year, country;
Table 24.49: Displaying records 1 - 10
release_year country max_budget
1916 USA 385907
1920 USA 100000
1925 USA 245000
1927 Germany 6000000
1929 Germany NA
1929 USA 379000
1930 USA 3950000
1932 USA 800000
1933 USA 439000
1934 USA 325000

Answering business questions

Which release_year had the most language diversity?

SELECT release_year, COUNT(DISTINCT language) AS num_language
FROM films
GROUP BY release_year
ORDER BY num_language DESC;
Table 24.50: Displaying records 1 - 10
release_year num_language
2006 16
2015 15
2005 14
2013 13
2008 13
2009 12
2004 12
2007 11
2011 10
2010 10

24.4.3 Filtering grouped data

HAVING vs WHERE

HAVING WHERE
HAVING filters grouped records WHERE filters individual records
In what years was the average film duration over two hours? What films were released in the year 2000?
SELECT release_year
FROM films
GROUP BY release_year
HAVING AVG(duration) > 120;
SELECT title
FROM films
WHERE release_year = 2000;

24.4.3.1 HAVING

Filtering grouped data can be especially handy when working with a large dataset. When working with thousands or even millions of rows, HAVING will allow you to filter for just the group of data you want

Find out which countries (or country) have the most varied film certifications.

-- Select the country and distinct count of certification as certification_count
SELECT country, COUNT(DISTINCT certification) AS certification_count
FROM films
-- Group by country
GROUP BY country
-- Filter results to countries with more than 10 different certifications
HAVING COUNT(DISTINCT certification) > 10;
Table 24.51: 1 records
country certification_count
USA 12

24.4.3.2 HAVING and sorting

Writing a query showing what countries have the highest average film budgets.

-- Select the country and average_budget from films
SELECT country, AVG(budget) AS average_budget
FROM films
-- Group by country
GROUP BY country
-- Filter to countries with an average_budget of more than one billion
HAVING AVG(budget) > 1000000000
-- Order by descending order of the aggregated budget
ORDER BY average_budget DESC;
Table 24.52: 2 records
country average_budget
South Korea 1383960000
Hungary 1260000000

South Korea and Hungary seem to have pricey films… or do they? Actually, these budgets are pretty standard for their local currency.

24.4.3.3 All together

You’ll write a query that returns the average budget and gross earnings for films each year after 1990 if the average budget is greater than 60 million.

-- Select the release_year, average budget and average gross
-- for films released after 1990 grouped by year
SELECT release_year, AVG(budget) AS avg_budget, AVG(gross) AS avg_gross
FROM films
WHERE release_year > 1990
GROUP BY release_year
-- Only years with an avg_budget of more than 60 million
HAVING AVG(budget) > 60000000
-- Order the results from highest to lowest average gross and limit to one
ORDER BY avg_gross DESC
LIMIT 1;
Table 24.53: 1 records
release_year avg_budget avg_gross
2005 70323938 41159143
# disconnect the sql connection
dbDisconnect(con)