Solution Practice Set 6
Get the solution to the exercise of viewing the information using aggregate functions queried from a database.
We'll cover the following...
Solution Practice Set 6
The database relationship model is reprinted below for reference.
Connect to the terminal below by clicking in the widget. Once connected, the command line prompt will show up. Enter or copy and paste the command ./DataJek/Lessons/quiz2.sh and wait for the MySQL prompt to start-up.
-- The lesson queries are reproduced below for convenient copy/paste into the terminal.-- Question # 1, Query 1SELECT Distributor, COUNT(Id) FROM MoviesGROUP BY DistributorHAVING COUNT(Id) > 1ORDER BY 2 DESC;-- Question # 1, Query 2SELECT y.Distributor, y.Name, y.CollectionInMillionsFROM (SELECT Distributor, COUNT(Id)FROM MoviesGROUP BY DistributorHAVING COUNT(Id) > 1ORDER BY 2 DESC) AS xINNER JOIN Movies yON y.Distributor = x.Distributor;-- Question # 1, Query 3SELECT y.Distributor, y.Name, y.CollectionInMillionsFROM (SELECT Distributor, COUNT(Id)FROM MoviesGROUP BY DistributorHAVING COUNT(Id) > 1ORDER BY 2 DESC) AS xINNER JOIN Movies yON y.Distributor = x.DistributorORDER BY y.Distributor, y.CollectionInMillions DESC;-- Question # 1, Query 4SET @dist_rank := 0, @current_dist := '';SELECT y.Distributor, y.Name, y.CollectionInMillions,@dist_rank := IF(@current_dist = y.Distributor, @dist_rank + 1, 1) AS distributor_rank,@current_dist := y.Distributor AS dummyFROM (SELECT Distributor, COUNT(Id)FROM MoviesGROUP BY DistributorHAVING COUNT(Id) > 1ORDER BY 2 DESC) AS xINNER JOIN Movies yON y.Distributor = x.DistributorORDER BY y.Distributor, y.CollectionInMillions DESC;-- Question # 1, Query 5SELECT Distributor, Name AS Movie, CollectionInMillions, distributor_rank AS RankFROM (SELECT y.Distributor, y.Name, y.CollectionInMillions,@dist_rank := IF(@current_dist = y.Distributor, @dist_rank + 1, 1) AS distributor_rank,@current_dist := y.Distributor AS dummyFROM (SELECT Distributor, COUNT(Id)FROM MoviesGROUP BY DistributorHAVING COUNT(Id) > 1ORDER BY 2 DESC) AS xINNER JOIN Movies yON y.Distributor = x.DistributorORDER BY y.Distributor, y.CollectionInMillions DESC) as zWHERE distributor_rank <= 2;-- Question # 1, Query 6( SELECT Distributor, Name AS Movie, CollectionInMillionsFROM MoviesWHERE Distributor = 'Paramount Pictures'ORDER BY CollectionInMillions DESCLIMIT 2)UNION( SELECT Distributor, Name AS Movie, CollectionInMillionsFROM MoviesWHERE Distributor = 'Warner Bros'ORDER BY CollectionInMillions DESCLIMIT 2);-- Question # 2, Query 1SELECT 'Total' AS Name, SUM(BudgetInMillions) AS BudgetInMillions, SUM(CollectionInMillions) AS CollectionInMillionsFROM MoviesUNIONSELECT 'Average', AVG(BudgetInMillions), AVG(CollectionInMillions) FROM MoviesUNIONSELECT 'Minimum', MIN(BudgetInMillions), MIN(CollectionInMillions) FROM MoviesUNIONSELECT 'Maximum', MAX(BudgetInMillions), MAX(CollectionInMillions) FROM Movies;-- Question # 2, Query 2SELECT 'Total' AS Name, SUM(BudgetInMillions) AS BudgetInMillions, SUM(CollectionInMillions) AS CollectionInMillionsFROM MoviesUNIONSELECT 'Average', AVG(BudgetInMillions), AVG(CollectionInMillions) FROM MoviesUNIONSELECT 'Minimum', MIN(BudgetInMillions), MIN(CollectionInMillions) FROM MoviesUNIONSELECT 'Maximum', MAX(BudgetInMillions), MAX(CollectionInMillions) FROM MoviesUNIONSELECT Name, BudgetInMillions, CollectionInMillions FROM Movies;-- Question # 3, Query 1SELECT @row_num := @row_num + 1 AS rownum, RunningTimeFROM Movies, (SELECT @row_num := 0)xORDER BY RunningTime;-- Question # 3, Query 2SELECT ROUND(AVG(RunningTime),1) As MedianFROM (SELECT @row_num := @row_num + 1 as rownum, RunningTimeFROM Movies, (SELECT @row_num := 0)xORDER BY RunningTime)mWHERE m.rownum IN( FLOOR( (@row_num+1)/2), CEIL( (@row_num+1)/2) );-- Question # 3, Query 3SELECT RunningTime, Count(*) AS FrequencyFROM MoviesGROUP BY RunningTimeORDER BY Frequency DESC;-- Question # 3, Query 4SELECT RunningTime, Count(*) AS FrequencyFROM MoviesGROUP BY RunningTimeHAVING Count(*) >= ALL (SELECT Count(*) FROM Movies GROUP BY RunningTime);-- Question # 3, Query 5SELECT 'Mean' AS 'Measure' , ROUND(AVG(RunningTime),1) AS 'Value'FROM MoviesUNIONSELECT 'Median', ROUND(AVG(RunningTime),1)FROM (SELECT @row_num := @row_num + 1 as rownum, RunningTimeFROM Movies, (SELECT @row_num := 0)xORDER BY RunningTime)mWHERE m.rownum IN( FLOOR( (@row_num+1)/2), CEIL( (@row_num+1)/2) )UNIONSELECT 'Mode', GROUP_CONCAT(RunningTime)FROM(SELECT RunningTime, Count(*) AS FrequencyFROM MoviesGROUP BY RunningTimeHAVING Count(*) >= ALL(SELECT Count(*) FROM Movies GROUP BY RunningTime))m;-- Question # 4, Query 1SELECT @av_budget := AVG(BudgetInMillions),@av_collection := AVG(CollectionInMillions),@av_time := AVG(RunningTime),@stdv_budget := STDDEV_SAMP(BudgetInMillions),@stdv_collection := STDDEV_SAMP(CollectionInMillions),@stdv_time := STDDEV_SAMP(RunningTime)FROM Movies;-- Question # 4, Query 2SELECT 'BudgetInMillions' AS Row, 'BudgetInMillions' AS Col,ROUND(SUM( ( BudgetInMillions - @av_budget ) * (BudgetInMillions - @av_budget) ) / ((COUNT(BudgetInMillions) -1) * (@stdv_budget * @stdv_budget)),2) AS CorrelationFROM MoviesUNIONSELECT 'BudgetInMillions' AS Row, 'CollectionInMillions' AS Col,ROUND( SUM( ( BudgetInMillions - @av_budget ) * (CollectionInMillions - @av_collection) ) / ((COUNT(BudgetInMillions) -1) * (@stdv_budget * @stdv_collection)),2) AS CorrelationFROM MoviesUNIONSELECT 'BudgetInMillions' AS Row, 'RunningTime' AS Col,ROUND( SUM( ( BudgetInMillions - @av_budget ) * (RunningTime - @av_time) ) / ((COUNT(BudgetInMillions) -1) * (@stdv_budget * @stdv_time)),2) AS CorrelationFROM MoviesUNIONSELECT 'CollectionInMillions' AS Row, 'BudgetInMillions' AS Col,ROUND( SUM( ( CollectionInMillions - @av_collection ) * (BudgetInMillions - @av_budget) ) / ((COUNT(CollectionInMillions) -1) * (@stdv_collection * @stdv_budget)),2) AS CorrelationFROM MoviesUNIONSELECT 'CollectionInMillions' AS Row, 'CollectionInMillions' AS Col,ROUND( SUM( ( CollectionInMillions - @av_collection ) * (CollectionInMillions - @av_collection) ) / ((COUNT(CollectionInMillions) -1) * (@stdv_collection * @stdv_collection)),2) AS CorrelationFROM MoviesUNIONSELECT 'CollectionInMillions' AS Row, 'RunningTime' AS Col,ROUND( SUM( ( CollectionInMillions - @av_collection ) * (RunningTime - @av_time) ) / ((COUNT(CollectionInMillions) -1) * (@stdv_collection * @stdv_time)),2) AS CorrelationFROM MoviesUNIONSELECT 'RunningTime' AS Row, 'BudgetInMillions' AS Col,ROUND( SUM( ( RunningTime - @av_time ) * (BudgetInMillions - @av_budget) ) / ((COUNT(RunningTime) -1) * (@stdv_time * @stdv_budget)),2) AS CorrelationFROM MoviesUNIONSELECT 'RunningTime' AS Row, 'CollectionInMillions' AS Col,ROUND( SUM( ( RunningTime - @av_time ) * (CollectionInMillions - @av_collection) ) / ((COUNT(RunningTime) -1) * (@stdv_time * @stdv_collection)),2) AS CorrelationFROM MoviesUNIONSELECT 'RunningTime' AS Row, 'RunningTime' AS Col,ROUND( SUM( ( RunningTime - @av_time ) * (RunningTime - @av_time) ) / ((COUNT(RunningTime) -1) * (@stdv_time * @stdv_time)),2) AS CorrelationFROM Movies;--Question # 4, Query 3SELECT Row,SUM(CASE WHEN Col='BudgetInMillions' THEN Correlation ELSE 0 END) AS Budget,SUM(CASE WHEN Col='CollectionInMillions' THEN Correlation ELSE 0 END) AS Collection,SUM(CASE WHEN Col='RunningTime' THEN Correlation ELSE 0 END) AS RunningTimeFROM (SELECT 'BudgetInMillions' AS Row, 'BudgetInMillions' AS Col,ROUND(SUM( ( BudgetInMillions - @av_budget ) * (BudgetInMillions - @av_budget) ) / ((COUNT(BudgetInMillions) -1) * (@stdv_budget * @stdv_budget)),2) AS CorrelationFROM MoviesUNIONSELECT 'BudgetInMillions' AS Row, 'CollectionInMillions' AS Col,ROUND( SUM( ( BudgetInMillions - @av_budget ) * (CollectionInMillions - @av_collection) ) / ((COUNT(BudgetInMillions) -1) * (@stdv_budget * @stdv_collection)),2) AS CorrelationFROM MoviesUNIONSELECT 'BudgetInMillions' AS Row, 'RunningTime' AS Col,ROUND( SUM( ( BudgetInMillions - @av_budget ) * (RunningTime - @av_time) ) / ((COUNT(BudgetInMillions) -1) * (@stdv_budget * @stdv_time)),2) AS CorrelationFROM MoviesUNIONSELECT 'CollectionInMillions' AS Row, 'BudgetInMillions' AS Col,ROUND( SUM( ( CollectionInMillions - @av_collection ) * (BudgetInMillions - @av_budget) ) / ((COUNT(CollectionInMillions) -1) * (@stdv_collection * @stdv_budget)),2) AS CorrelationFROM MoviesUNIONSELECT 'CollectionInMillions' AS Row, 'CollectionInMillions' AS Col,ROUND( SUM( ( CollectionInMillions - @av_collection ) * (CollectionInMillions - @av_collection) ) / ((COUNT(CollectionInMillions) -1) * (@stdv_collection * @stdv_collection)),2) AS CorrelationFROM MoviesUNIONSELECT 'CollectionInMillions' AS Row, 'RunningTime' AS Col,ROUND(SUM( ( CollectionInMillions - @av_collection ) * (RunningTime - @av_time) ) / ((COUNT(CollectionInMillions) -1) * (@stdv_collection * @stdv_time)),2) AS CorrelationFROM MoviesUNIONSELECT 'RunningTime' AS Row, 'BudgetInMillions' AS Col,ROUND( SUM( ( RunningTime - @av_time ) * (BudgetInMillions - @av_budget) ) / ((COUNT(RunningTime) -1) * (@stdv_time * @stdv_budget)),2) AS CorrelationFROM MoviesUNIONSELECT 'RunningTime' AS Row, 'CollectionInMillions' AS Col,ROUND( SUM( ( RunningTime - @av_time ) * (CollectionInMillions - @av_collection) ) / ((COUNT(RunningTime) -1) * (@stdv_time * @stdv_collection)),2) AS CorrelationFROM MoviesUNIONSELECT 'RunningTime' AS Row, 'RunningTime' AS Col,ROUND( SUM( ( RunningTime - @av_time ) * (RunningTime - @av_time) ) / ((COUNT(RunningTime) -1) * (@stdv_time * @stdv_time)),2) AS CorrelationFROM Movies) TGROUP BY RowORDER BY Row;-- Question # 5, Query 1SELECT Distributor, SUM(CollectionInMillions)FROM MoviesGROUP BY DistributorORDER BY 2 DESCLIMIT 3;-- Question # 5, Query 2CREATE TEMPORARY TABLE T1(SELECT Distributor, SUM(CollectionInMillions)FROM MoviesGROUP BY DistributorORDER BY 2 DESCLIMIT 3);-- Question # 5, Query 3CREATE TEMPORARY TABLE T2SELECT * FROM T1;SELECT * FROM T1UNIONSELECT 'All Others' as Distributor, SUM(CollectionInMillions)FROM MoviesWHERE Distributor NOT IN (SELECT DistributorFROM T2);
Question # 1
Find the top two movies of distributors who have more than one movie to their name.
First we’ll check our Movies table for information on Distributors:
SELECT Distributor, COUNT(Id) FROM Movies
GROUP BY Distributor
HAVING COUNT(Id) > 1
ORDER BY 2 DESC;
We have two distributors with more than one movie in our table; Paramount Pictures and Warner Bros with three movies each.
Next we will find the movies of these distributors. We need to join the results of the previous query with the Movies table:
SELECT y.Distributor, y.Name, y.CollectionInMillions
FROM (
SELECT Distributor, COUNT(Id)
FROM Movies
GROUP BY Distributor
HAVING COUNT(Id) > 1
ORDER BY 2 DESC
) AS x
INNER JOIN Movies y
ON y.Distributor = x.Distributor;
This query lists all the movies from distributors who meet our criterion of having more than one movie in the table. We are a step closer to finding the top two movies of each distributor.
Now we need ORDER BY clause to put the results in order according to CollectionInMillions:
SELECT y.Distributor, y.Name, y.CollectionInMillions
FROM (
SELECT Distributor, COUNT(Id)
FROM Movies
GROUP BY Distributor
HAVING COUNT(Id) > 1
ORDER BY 2 DESC
) AS x
INNER JOIN Movies y
ON y.Distributor = x.Distributor
ORDER BY y.Distributor, y.CollectionInMillions DESC;
To find the top two movies, we need to rank the rows returned by the previous query. For older versions of MySQL which do not support the ROW_NUMBER() function, it can be emulated using session variables.
The @current_dist variable is updated in each iteration as @current_dist := y.Distributor AS dummy. The variable @dist_rank is incremented if @current_dist is the same, otherwise it is reset to 1.
@dist_rank := IF( @current_dist = y.Distributor, @dist_rank + 1, 1)
We will initialize the session variables and use them in the previously created query as follows:
SET @dist_rank := 0, @current_dist := '';
SELECT y.Distributor, y.Name, y.CollectionInMillions,
@dist_rank := IF(@current_dist = y.Distributor, @dist_rank + 1, 1) AS distributor_rank,
@current_dist := y.Distributor AS dummy
FROM (
SELECT Distributor, COUNT(Id)
FROM Movies
GROUP BY Distributor
HAVING COUNT(Id) > 1
ORDER BY 2 DESC
) AS x
INNER JOIN Movies y
ON y.Distributor = x.Distributor
ORDER BY y.Distributor, y.CollectionInMillions DESC;
Two more columns are added to the result set and we have now numbered the rows.
The last step is selecting the top 2 rows based on the ranking. We will use the above query as a subquery:
SELECT Distributor, Name AS Movie,
CollectionInMillions, distributor_rank AS Rank
FROM (
SELECT y.Distributor, y.Name, y.CollectionInMillions,
@dist_rank := IF(@current_dist = y.Distributor, @dist_rank + 1, 1) AS distributor_rank,
@current_dist := y.Distributor AS dummy
FROM (
SELECT Distributor, COUNT(Id)
FROM Movies
GROUP BY Distributor
HAVING COUNT(Id) > 1
ORDER BY 2 DESC
) AS x
INNER JOIN Movies y
ON y.Distributor = x.Distributor
ORDER BY y.Distributor, y.CollectionInMillions DESC
) as z
WHERE distributor_rank <= 2;
This is not the most optimal query because of the number of sub-queries and joins but it serves our purpose of dynamically selecting groups with more than one entry in the table and then finding the top two rows from those groups. A much cleaner and better performing query uses UNION as follows:
(
SELECT Distributor, Name AS Movie, CollectionInMillions
FROM Movies
WHERE Distributor = 'Paramount Pictures'
ORDER BY CollectionInMillions DESC
LIMIT 2
)
UNION
(
SELECT Distributor, Name AS Movie, CollectionInMillions
FROM Movies
WHERE Distributor = 'Warner Bros'
ORDER BY CollectionInMillions DESC
LIMIT 2
);
Here we are ordering the movies according to their CollectionInMillions and then limiting those results to top two. UNION then combines all the results. The drawback with this query is that we have to manually specify the Distributors we want in the result set. This is not a problem with only two distributors but it’s not hard to imagine what will happen if the number of distributors who meet our criterion is large.
Using the analytic ROW_NUMBER() function, this question can be solved without using any session variables. The sub-query in our solution becomes:
SELECT Distributor, Name AS Movie, CollectionInMillions,
ROW_NUMBER() OVER(PARTITION BY Distributor ORDER BY RevenueInMillions DESC) AS movie_rank
FROM Movies
The above query will return the Movies tables numbered according to distributors. Next step is to select only those who have more than one movie to their credit. We will join the results of the query above with the one we created in the first step as follows:
SELECT x.Distributor, Name AS Movie, CollectionInMillions, movie_rank
FROM
(
SELECT Distributor, Name AS Movie, CollectionInMillions,
ROW_NUMBER() OVER(PARTITION BY Distributor ORDER BY RevenueInMillions DESC) as movie_rank
FROM Movies) x
JOIN
(SELECT Distributor, Count(MovieID)
FROM Movies
GROUP BY Distributor
HAVING COUNT(MovieID) > 1
) y
ON x.Distributor = y.Distributor
)
WHERE movie_rank <=2;
Question # 2
Find the total, average, minimum and maximum of the production budget and revenue earned from the Movies table and append the summary data to the top of the table.
Summary tables provide a quick look into the data when the tables are large or contain a lot of numerical values. Summary statistics include sum, average, maximum, and minimum. Maximum and minimum show the degree of variability of the values in the columns.
We can calculate summary statistics for the BudgetInMillions and CollectionInMillions columns using four separate queries and join the results using UNION clause as follows:
...