Overview

So far, we’ve seen how to use GROUP BY to calculate aggregates such as min, max, sum, and count at different levels using the GROUP BY clause. However, there is another way to calculate aggregates in SQL without using GROUP BY.

To demonstrate aggregations, we are going to use a table with temperature data from two cities:

Press + to interact
CREATE TABLE temperatures (
day DATE,
city VARCHAR(30),
temperature INT
);
INSERT INTO temperatures (day, city, temperature)
VALUES
('2021-01-01', 'NY', 10),
('2021-01-02', 'NY', 12),
('2021-01-03', 'NY', 13),
('2021-01-04', 'NY', 14),
('2021-01-05', 'NY', 18),
('2021-01-06', 'NY', 15),
('2021-01-07', 'NY', 16),
('2021-01-08', 'NY', 17),
('2021-01-01', 'LA', 22),
('2021-01-02', 'LA', 21),
('2021-01-03', 'LA', 19),
('2021-01-04', 'LA', 22),
('2021-01-05', 'LA', 25),
('2021-01-06', 'LA', 27),
('2021-01-07', 'LA', 25),
('2021-01-08', 'LA', 27)
;
SELECT * FROM temperatures;

The table stores the daily temperature in each city.

Simple aggregation

Let's say we want to produce a report showing the difference between each day, and the hottest day ever in the same city. Using MAX, we can find the maximum temperature in all cities, at all times:

Press + to interact
SELECT
MAX(temperature)
FROM
temperatures
;

In the result, we get only one row—the hottest day in all cities at all times. To compare each day with the hottest day in the same city, we need to add the city to the GROUP BY clause:

Press + to interact
SELECT
city,
MAX(temperature)
FROM
temperatures
GROUP BY
city;

Using GROUP BY we defined two groups—one for each distinct value in the column city. The database then calculated the maximum temperature for each group, and we got two rows in the result with the hottest temperature in each city.

At this point, we might consider using the aggregate result and the temperature at each row to calculate the difference, but if we try it, it will fail:

Press + to interact
SELECT
city,
temperature - MAX(temperature) as diff_from_hottest
FROM
temperatures
GROUP BY
city;

The query failed because if we want to use the temperature and MAX(temperature) in the same expression, we must add temperature to the GROUP BY clause. However, if we do that, we won't be able to get the maximum temperature per city. Using GROUP BY we have expressions with both aggregate and non-aggregate.

Another approach is to join the results of the aggregate query to the temperature table:

Press + to interact
WITH hottest AS (
SELECT
city,
MAX(temperature) as temperature
FROM
temperatures
GROUP BY
city
)
SELECT
t.*,
h.temperature as hottest_temperature,
t.temperature - h.temperature AS diff_from_hottest
FROM
temperatures t
JOIN hottest h ON t.city = h.city
ORDER BY
t.city,
t.day
;

We first pre-calculated the hottest temperature in each city and stored it in a common table expression called hottest. We then joined the results to the temperatures tables based on the city and calculated the difference between the temperature and the hottest temperature in the diff_from_hottest column.

This type of analysis, where each value is compared to an aggregate of a larger group of values, is very common. In the next sections, we are going to use aggregate expressions to produce the same results more easily.

Aggregate expression with partition

Using aggregate expressions, we can calculate an aggregate over a group of rows without reducing the number of rows. For example, to add a column with the highest temperature in each city, we can use the following aggregate expression:

Press + to interact
SELECT
*,
MAX(temperature) OVER (PARTITION BY city) AS max_temperature_at_city
FROM
temperatures;

The results now include a max_temperature_at_city column with the maximum temperature in each city.

Adding the OVER() keyword to the aggregate MAX function, turns it into an aggregate expression. The result of the aggregate MAX(temperature) OVER (PARTITION BY city) expression is the maximum temperature in each city.

We use the PARTITION clause to tell the database what groups to use to calculate the aggregate expression, exactly like in the GROUP BY clause. In this case, we want a group for each city, so we partition it by the city column.

Notice that even though we used an aggregate MAX function, we still get all the rows in the table. The database calculated the results of the aggregate expression over each partition in addition to the existing data set. This is the main difference between regular aggregation using GROUP BY, and aggregate expressions.

To finish off our report, we can add an expression to calculate the difference between the temperature and the hottest temperature using our aggregate expression:

Press + to interact
SELECT
*,
temperature - MAX(temperature) OVER (PARTITION BY city) AS diff_from_hottest
FROM
temperatures;

The query uses the result of the aggregate expression to calculate the difference between the temperature in each row, with the hottest temperature in each city. The query is very simple and requires no subqueries, common table expressions, or joins.

Partition by expression

The value provided to PARTITION can be an expression. For example, we can extract the year from the date, and add a column with the max temperature in each year:

Press + to interact
SELECT
*,
EXTRACT('YEAR' from day) as year,
MAX(temperature) OVER (PARTITION BY EXTRACT('YEAR' from day)) AS max_temperature_in_year
FROM
temperatures;

Partition by multiple expressions

Partition is not restricted to just one column or expression. We can partition by multiple columns and expressions. For example, we can partition by both the city and the year, and include a column with the hottest temperature in the same city at that year:

Press + to interact
SELECT
*,
EXTRACT('YEAR' from day) as year,
MAX(temperature) OVER (
PARTITION BY city, EXTRACT('YEAR' from day)
) AS max_temperature_in_city_at_year
FROM
temperatures;

The PARTITION clause can take multiple columns and expressions, just like the GROUP BY clause.

Partition by all rows

To produce an aggregate expression on all the rows, we can provide the PARTITION clause with a constant value:

Press + to interact
SELECT
*,
EXTRACT('YEAR' from day) as year,
MAX(temperature) OVER (PARTITION BY 1) AS max_temperature
FROM
temperatures;

If we use a constant as a partition, it will cause all the rows in the table to be evaluated as a single partition. The result of the aggregate expression will be the highest temperature, similar to the result of the query, SELECT MAX(temperature) FROM temperatures.

Another way to produce an aggregate over all the rows is to simply omit the partition clause:

Press + to interact
SELECT
*,
EXTRACT('YEAR' from day) as year,
MAX(temperature) OVER () AS max_temperature
FROM
temperatures;

Omitting the partition clause will cause all rows to be considered as a single partition.

Multiple aggregate expressions

A single query can include multiple aggregate expressions for different groups. For example, we can have a single query with the hottest day in each city—the coldest day in each city, and the hottest and coldest day across all cities:

Press + to interact
SELECT
*,
MAX(temperature) OVER () AS max_temperature,
MIN(temperature) OVER () AS min_temperature,
MAX(temperature) OVER (PARTITION BY city) AS max_temperature_in_city,
MIN(temperature) OVER (PARTITION BY city) AS min_temperature_in_city
FROM
temperatures;

The query above contains multiple aggregate expressions using different partitions and different aggregate functions.

Named windows

If you have many aggregate expressions in a query, and some of them are using the same window, you can define a named window and reuse it in the same query:

Press + to interact
SELECT
*,
MAX(temperature) OVER all_rows AS max_temperature,
MIN(temperature) OVER all_rows AS min_temperature,
MAX(temperature) OVER city_rows AS max_temperature_in_city,
MIN(temperature) OVER city_rows AS min_temperature_in_city
FROM
temperatures
WINDOW
city_rows AS (PARTITION BY city),
all_rows AS (PARTITION BY 1)
;

The query defines two names windows:

  1. city_rows: Window for each city.

  2. all_rows: Window including all the rows in the table. This is achieved by partitioning by the constant value 1.

Using named windows can reduce code duplication and at the same time make the query more readable.