Using LATERAL Joins To Get “Top N” Records From Each Group In MySQL 8.0.14

Cyberdime
Published: December 1, 2022

One type of SQL query that has been particularly hard, historically, is getting the “Top N” rows for each group in a MySQL aggregation. When MySQL added LATERAL joins in 8.0.14, however, they opened the door for some simple but rather powerful query techniques. A few months ago, I looked at using LATERAL joins to gather row-specific aggregates. In that post, I used traditional aggregate functions like COUNT() and MAX(). But, as I recently demonstrated, MySQL’s JSON aggregates (5.7.22+) can be nested for some pretty exciting outcomes. In this post, I want to look at combining LATERAL joins with JSON aggregates to read the “Top N” rows from a GROUP BY query.

To demonstrate this, in the context of this blog, I’m going to craft a SQL query that gets the oldest 10 members; and, for each member, I want to get the newest 5 comments. Getting the oldest 10 members is simple and has been possible in SQL since the dawn of time:

SELECT
	m.id,
	m.name
FROM
	member m
WHERE
	m.id <= 10

In this case, I’m leaning on the fact that my id is an AUTO_INCREMENT column. Which means, the first 10 id values in the table represent the oldest members in the table.

ASIDE: In a simple query like this, I could have done ORDER BY m.id DSEC LIMIT 10 to truly get the oldest 10 members. But, using this technique blows up when you start joining to other tables.

Now, I’m going to add a LATERAL derived join table that will get the most recent 5 comments for each member. Since the LATERAL sub-query is executed for each row of the outer table, it means that our LATERAL SQL can reference columns in the outer row. In this case, we’re going to match on m.id:

SELECT
	m.id,
	m.name
FROM
	member m
-- By using a LATERAL JOIN (MySQL 8+), we can create a derived table PER EACH ROW of the
-- outer table. This per-row derived table is made available within the SELECT clause.
INNER JOIN LATERAL
	(
		-- Since this lateral join derived table is being calculated per row, we can
		-- easily grab the "TOP N" corresponding records (using LIMIT) for each row in the
		-- outer table.
		SELECT
			c.id,
			c.createdAt
		FROM
			blog_comment c
		WHERE
			c.memberID = m.id -- Per-row join condition.
		ORDER BY
			c.id DESC
		LIMIT
			5
	) AS recent
WHERE
	m.id <= 10

As you can see here, the LATERAL join allows us to use a per-row join condition:

WHERE c.memberID = m.id

… which, in turn, allows us to gather unique blog_comment records for each of the member records.

Of course, this INNER JOIN LATERAL still works like any other INNER JOIN which means that the result of this SQL query is a cross product of the two tables. This gives us way more rows that we wanted:

MySQL recordset showing that the LATERAL join returns 5 rows of comments per 1 row of members, creating a cross-product between the two tables.

As you can see, we end up with “Ben Nadel” 5 times in this resultset since we get the cross product of each member row and the (at most) 5 comments that each member has left on the blog.

To fix this, we can now GROUP BY the outer row. In some SQL engines, your GROUP BY clause has to contain all the columns that you want to group. However, one of the very nice things in the MySQL-specific syntax is that you only need to include just one of the columns and MySQL figures it out. In this case, we we are going to group by the member id:

SELECT
	m.id,
	m.name
FROM
	member m
-- By using a LATERAL JOIN (MySQL 8+), we can create a derived table PER EACH ROW of the
-- outer table. This per-row derived table is made available within the SELECT clause.
INNER JOIN LATERAL
	(
		-- Since this lateral join derived table is being calculated per row, we can
		-- easily grab the "TOP N" corresponding records (using LIMIT) for each row in the
		-- outer table.
		SELECT
			c.id,
			c.createdAt
		FROM
			blog_comment c
		WHERE
			c.memberID = m.id -- Per-row join condition.
		ORDER BY
			c.id DESC
		LIMIT
			5
	) AS recent
WHERE
	m.id <= 10
-- To reduce the cross product of the member table and the blog_comment table, we can
-- group the results by the member.
GROUP BY
	m.id

Now, when we run this, with the GROUP BY m.id, we get the following MySQL results:

MySQL recordset showing that the added GROUP BY has collapsed the LATERAL join cross-product resulting in a single row per member.

As you can see, we’re back down to a single record per member. The “Top N” comments that we generated in our LATERAL join have been collapsed. In order to extract the comment information, we’re going to use MySQL’s JSON aggregate functions. The JSON_ARRAYAGG() function allows us to compose grouped expressions into a JSON array. And, the JSON_OBJECT() function allows us to aggregate row columns as the expression being composed into the JSON array:

JSON_ARRAYAGG(
	-- Applied to each row in the LATERAL derived table.
	JSON_OBJECT(
		'id', recent.id,
		'createdAt', recent.createdAt
	)
) AS comments

Here, the JSON_OBJECT() function is being applied to each row in the grouping. Meaning, we’re looking at each blog_comment row in the LATERAL derived table and we’re constructing a JSON object. Then, the JSON_ARRAYAGG() function takes those JSON objects and composes them into a single array per member record.

The whole query looks like this:

SELECT
	m.id,
	m.name,
	-- Since we performed a GROUP BY on the outer table, our LATER JOIN derived table is
	-- now available for aggregation.
	COUNT( * ) AS commentCount,
	-- Using the JSON functions (MySQL 5.7.22+), we can collapse the "TOP N" rows for
	-- each outer row into a JSON payload (array of objects).
	JSON_ARRAYAGG(
		-- Applied to each row in the LATERAL derived table.
		JSON_OBJECT(
			'id', recent.id,
			'createdAt', recent.createdAt
		)
	) AS comments
FROM
	member m
-- By using a LATERAL JOIN (MySQL 8+), we can create a derived table PER EACH ROW of the
-- outer table. This per-row derived table is made available within the SELECT clause.
INNER JOIN LATERAL
	(
		-- Since this lateral join derived table is being calculated per row, we can
		-- easily grab the "TOP N" corresponding records (using LIMIT) for each row in the
		-- outer table.
		SELECT
			c.id,
			c.createdAt
		FROM
			blog_comment c
		WHERE
			c.memberID = m.id -- Per row join condition.
		ORDER BY
			c.id DESC
		LIMIT
			5
	) AS recent
WHERE
	m.id <= 10
GROUP BY
	m.id
;

And, when we run this in MySQL 8.0.14+, we get the following output:

MySQL recordset showing that the added JSON_ARRAYAGG() and JSON_OBJECT() functions have exposed the collapsed LATERAL join records as a JSON payload.

As you can see, the JSON_ARRAYAGG() and JSON_OBJECT() functions have taken the collapsed / grouped LATERAL join records and exposed them as a JSON payload in the MySQL recordset. This has allowed us to get the “Top N” comments for each member in the group. Of course, in your application code you’ll have to deserialize the JSON payload to get the comments as an array; however, that should be a capability natively built into your application runtime.

Isn’t SQL just thrilling! Every now and then, I’ll hear someone diminish SQL as something that people “have” to use. Forget that! I love SQL. And, I love that the SQL engines are constantly adding new and groovy functionality.

Check out the license.

Source: www.bennadel.com