Aggregate Expressions
Learn how to use SQL to calculate aggregates over a group or rows.
Overview
So far, we’ve seen how to use GROUP BY
to calculate aggregates such as min, max, sum, and count at different levels using the GROUP BY
clause. However, there is another way to calculate aggregates in SQL without using GROUP BY
.
To demonstrate aggregations, we are going to use a table with temperature data from two cities:
CREATE TABLE temperatures (day DATE,city VARCHAR(30),temperature INT);INSERT INTO temperatures (day, city, temperature)VALUES('2021-01-01', 'NY', 10),('2021-01-02', 'NY', 12),('2021-01-03', 'NY', 13),('2021-01-04', 'NY', 14),('2021-01-05', 'NY', 18),('2021-01-06', 'NY', 15),('2021-01-07', 'NY', 16),('2021-01-08', 'NY', 17),('2021-01-01', 'LA', 22),('2021-01-02', 'LA', 21),('2021-01-03', 'LA', 19),('2021-01-04', 'LA', 22),('2021-01-05', 'LA', 25),('2021-01-06', 'LA', 27),('2021-01-07', 'LA', 25),('2021-01-08', 'LA', 27);SELECT * FROM temperatures;
The table stores the daily temperature in each city.
Simple aggregation
Let's say we want to produce a report showing the difference between each day, and the hottest day ever in the same city. Using MAX
, we can find the maximum temperature in all cities, at all times:
SELECTMAX(temperature)FROMtemperatures;
In the result, we get only one row—the hottest day in all cities at all times. To compare each day with the hottest day in the same city, we need to add the city to the GROUP BY
clause:
SELECTcity,MAX(temperature)FROMtemperaturesGROUP BYcity;
Using GROUP BY
we defined two groups—one for each distinct value in the column city
. The database then calculated the maximum temperature for each group, and we got two rows in the result with the hottest temperature in each city.
At this point, we might consider using the aggregate result and the temperature at each row to calculate the difference, but if we try it, it will fail:
SELECTcity,temperature - MAX(temperature) as diff_from_hottestFROMtemperaturesGROUP BYcity;
The query failed because if we want to use the temperature
and MAX(temperature)
in the same expression, we must add temperature
to the GROUP BY
clause. However, if we do that, we won't be able to get the maximum temperature per city. Using GROUP BY
we have expressions with both aggregate and non-aggregate.
Another approach is to join the results of the aggregate query to the temperature table:
WITH hottest AS (SELECTcity,MAX(temperature) as temperatureFROMtemperaturesGROUP BYcity)SELECTt.*,h.temperature as hottest_temperature,t.temperature - h.temperature AS diff_from_hottestFROMtemperatures tJOIN hottest h ON t.city = h.cityORDER BYt.city,t.day;
We first pre-calculated the hottest temperature in each city and stored it in a common table expression called hottest
. We then joined the results to the temperatures tables based on the city and calculated the difference between the temperature and the hottest temperature in the diff_from_hottest
column.
This type of analysis, where each value is compared to an aggregate of a larger group of values, is very common. In the next sections, we are going to use aggregate expressions to produce the same results more easily.
Aggregate expression with partition
Using aggregate expressions, we can calculate an aggregate over a group of rows without reducing the number of rows. For example, to add a column with the highest temperature in each city, we can use the following aggregate expression:
SELECT*,MAX(temperature) OVER (PARTITION BY city) AS max_temperature_at_cityFROMtemperatures;
The results now include a max_temperature_at_city
column with the maximum temperature in each city.
Adding the OVER()
keyword to the aggregate MAX
function, turns it into an aggregate expression. The result of the aggregate MAX(temperature) OVER (PARTITION BY city)
expression is the maximum temperature in each city.
We use the PARTITION
clause to tell the database what groups to use to calculate the aggregate expression, exactly like in the GROUP BY
clause. In this case, we want a group for each city, so we partition it by the city
column.
Notice that even though we used an aggregate MAX
function, we still get all the rows in the table. The database calculated the results of the aggregate expression over each partition in addition to the existing data set. This is the main difference between regular aggregation using GROUP BY
, and aggregate expressions.
To finish off our report, we can add an expression to calculate the difference between the temperature and the hottest temperature using our aggregate expression:
SELECT*,temperature - MAX(temperature) OVER (PARTITION BY city) AS diff_from_hottestFROMtemperatures;
The query uses the result of the aggregate expression to calculate the difference between the temperature in each row, with the hottest temperature in each city. The query is very simple and requires no subqueries, common table expressions, or joins.
Partition by expression
The value provided to PARTITION
can be an expression. For example, we can extract the year from the date, and add a column with the max temperature in each year:
SELECT*,EXTRACT('YEAR' from day) as year,MAX(temperature) OVER (PARTITION BY EXTRACT('YEAR' from day)) AS max_temperature_in_yearFROMtemperatures;
Partition by multiple expressions
Partition is not restricted to just one column or expression. We can partition by multiple columns and expressions. For example, we can partition by both the city and the year, and include a column with the hottest temperature in the same city at that year:
SELECT*,EXTRACT('YEAR' from day) as year,MAX(temperature) OVER (PARTITION BY city, EXTRACT('YEAR' from day)) AS max_temperature_in_city_at_yearFROMtemperatures;
The PARTITION
clause can take multiple columns and expressions, just like the GROUP BY
clause.
Partition by all rows
To produce an aggregate expression on all the rows, we can provide the PARTITION
clause with a constant value:
SELECT*,EXTRACT('YEAR' from day) as year,MAX(temperature) OVER (PARTITION BY 1) AS max_temperatureFROMtemperatures;
If we use a constant as a partition, it will cause all the rows in the table to be evaluated as a single partition. The result of the aggregate expression will be the highest temperature, similar to the result of the query, SELECT MAX(temperature) FROM temperatures
.
Another way to produce an aggregate over all the rows is to simply omit the partition clause:
SELECT*,EXTRACT('YEAR' from day) as year,MAX(temperature) OVER () AS max_temperatureFROMtemperatures;
Omitting the partition clause will cause all rows to be considered as a single partition.
Multiple aggregate expressions
A single query can include multiple aggregate expressions for different groups. For example, we can have a single query with the hottest day in each city—the coldest day in each city, and the hottest and coldest day across all cities:
SELECT*,MAX(temperature) OVER () AS max_temperature,MIN(temperature) OVER () AS min_temperature,MAX(temperature) OVER (PARTITION BY city) AS max_temperature_in_city,MIN(temperature) OVER (PARTITION BY city) AS min_temperature_in_cityFROMtemperatures;
The query above contains multiple aggregate expressions using different partitions and different aggregate functions.
Named windows
If you have many aggregate expressions in a query, and some of them are using the same window, you can define a named window and reuse it in the same query:
SELECT*,MAX(temperature) OVER all_rows AS max_temperature,MIN(temperature) OVER all_rows AS min_temperature,MAX(temperature) OVER city_rows AS max_temperature_in_city,MIN(temperature) OVER city_rows AS min_temperature_in_cityFROMtemperaturesWINDOWcity_rows AS (PARTITION BY city),all_rows AS (PARTITION BY 1);
The query defines two names windows:
city_rows
: Window for each city.all_rows
: Window including all the rows in the table. This is achieved by partitioning by the constant value1
.
Using named windows can reduce code duplication and at the same time make the query more readable.