The rough layout I have for some complex queries follows this table structure in Postgres:
CREATE TABLE primari (
id BIGINT NOT NULL PRIMARY KEY,
name TEXT NOT NULL
);
CREATE TABLE secondary (
primary_id BIGINT NOT NULL references primari (id),
id BIGINT NOT NULL PRIMARY KEY,
name TEXT NOT NULL
);
CREATE TABLE tertiary (
primary_id BIGINT NOT NULL references primari (id),
id BIGINT NOT NULL PRIMARY KEY,
name TEXT NOT NULL
);
The relationship is one-to-*, as in "one to zero or more", for all non-primary tables. A primary might be associated with 25 secondaries and no tertiaries, or 10 tertiaries and no secondaries, or many of either, or none of either. I always want more than one column back from each table being queried. Expand this out to a few more tables for what I’m trying to do. In completely invalid SQL, I’m looking for something like:
SELECT
primari.*,
ARRAY_AGG(secondary.*) WHERE secondary.id = primari.id,
ARRAY_AGG(tertiary.*) WHERE tertiary.id = primari.id
FROM primari
WHERE id IN (..)
AND (..);
Insert multisets, and multiset functionality via jOOQ.
I recently learned about multisets in SQL via jOOQ and got excited. Theoretically, this would allow me to push quite a bit of query logic from my application layer to the database, where I could retrieve everything in one fell swoop instead of wiring together all the grouping and querying from my application. I have run into a bit of a problem: the queries don’t seem to scale particularly well, in the first case becoming basically unusable when querying for just a few thousand elements. I realize multisets aren’t natively supported in Postgres yet, but would still like to use the functionality if I can get it even reasonably performant.
My first attempt was to put the multiset in the SELECT as a sort-of-kind-of scalar subquery:
SELECT
primari.*,
(
SELECT
-- reminder: this is a jOOQ multiset
JSONB_PRETTY(COALESCE(JSONB_AGG(JSONB_BUILD_OBJECT('id', s.id, 'primary_id', s.primary_id, 'name', s.name)), JSONB_BUILD_ARRAY()))
FROM (
SELECT secondary.* FROM secondary WHERE secondary.primary_id = primari.id
) AS s
) AS secondaries,
(
SELECT
JSONB_PRETTY(COALESCE(JSONB_AGG(JSONB_BUILD_OBJECT('id', t.id, 'primary_id', t.primary_id, 'name', t.name)), JSONB_BUILD_ARRAY()))
FROM (
SELECT tertiary.* FROM tertiary WHERE tertiary.primary_id = primari.id
) AS t
) AS tertiaries
FROM primari
WHERE primary.id IN (..);
This pattern was quite readable in both SQL and jOOQ, and in test datasets seemed to perform fine. But when I upped my scale, the query did not. From the EXPLAIN ANALYZE, I believe I learned that this was executing one query to each non-primary table per row returned by the primary query. That’s not great.
The only other thing I have thought of is to perform one subquery per non-primary table and aggregate rows there, then join those subqueries back to the original table.
SELECT
primari.*,
JSONB_PRETTY(COALESCE(secondary_sq.secondaries, JSONB_BUILD_ARRAY())) AS secondaries,
JSONB_PRETTY(COALESCE(tertiary_sq.tertiaries, JSONB_BUILD_ARRAY())) AS tertiaries
FROM primari
LEFT OUTER JOIN (
SELECT
secondary.primary_id,
COALESCE(JSONB_AGG(JSONB_BUILD_OBJECT('id', id, 'primary_id', primary_id, 'name', name)), JSONB_BUILD_ARRAY()) AS secondaries
FROM secondary
WHERE primary_id IN (..)
GROUP BY primary_id
) AS secondary_sq
ON secondary_sq.primary_id = primari.id
LEFT OUTER JOIN (
SELECT
tertiary.primary_id,
COALESCE(JSONB_AGG(JSONB_BUILD_OBJECT('id', id, 'primary_id', primary_id, 'name', name)), JSONB_BUILD_ARRAY()) AS tertiaries
FROM tertiary
WHERE primary_id IN (..)
GROUP BY primary_id
) AS tertiary_sq
ON tertiary_sq.primary_id = primari.id
WHERE primary_id IN (..);
This solution seems to scale much better, but still gets slow as the number of ids to filter on gets very large.
The question then: am I thinking about this wrong, am I missing something, or is this the best I can do? It’s still faster than doing individual queries and grouping data in application code, but it kind of feels like I’m missing something that would make this run faster as my data scales.
Before anyone brings it up, there are indexes on all tables based on what the potential filters are for each table. For all non-primary tables, that means there’s at least one index with the first column primary_id
.
I set up a dbfiddle to get folks started if you’d like to tinker.
2
Answers
I’ve investigated performance of the nesting approach on various RDBMS including PostgreSQL, comparing it with alternatives. In short:
Indeed, nesting stuff in SQL tends to produce nested loop joins, which have higher complexity than the more desirable hash joins or merge joins, which would otherwise arise from joins. Higher complexity means worse performance for large data sets. At some point, as with any Big-O consideration, there’s a break even where the worse complexity starts getting truly worse (when for smaller data sets, it can be better).
At this point, you’ll probably be best off with classic joins and 2 queries, one per join, i.e. a query joining
primary
andsecondary
and a query joiningprimary
andtertiary
, and then assemble everything in memory. You can’t run everything with a single query, because of the cartesian product betweensecondary
andtertiary
, which is hard and again slow to resolveSo, while
MULTISET
is very powerful and convenient for small to medium sized data sets (which is usually the case, especially for the nested data sets), it doesn’t scale for reports.I’m not sure why you need the subselects and multiple levels of COALESCE etc. Can’t you just join the tables and generate the required JSON? e.g.