Postgresql - Efficiently query several one-to-zero-or-many relationships

simonsays
July 4, 2023
128 views
0 votes
2 Answers

The rough layout I have for some complex queries follows this table structure in Postgres:

CREATE TABLE primari (
  id   BIGINT NOT NULL PRIMARY KEY,
  name TEXT   NOT NULL
);

CREATE TABLE secondary (
  primary_id BIGINT NOT NULL references primari (id),
  id         BIGINT NOT NULL PRIMARY KEY,
  name       TEXT   NOT NULL
);

CREATE TABLE tertiary (
  primary_id BIGINT NOT NULL references primari (id),
  id         BIGINT NOT NULL PRIMARY KEY,
  name       TEXT   NOT NULL
);

The relationship is one-to-*, as in "one to zero or more", for all non-primary tables. A primary might be associated with 25 secondaries and no tertiaries, or 10 tertiaries and no secondaries, or many of either, or none of either. I always want more than one column back from each table being queried. Expand this out to a few more tables for what I’m trying to do. In completely invalid SQL, I’m looking for something like:

SELECT
  primari.*,
  ARRAY_AGG(secondary.*) WHERE secondary.id = primari.id,
  ARRAY_AGG(tertiary.*) WHERE tertiary.id = primari.id
FROM primari
WHERE id IN (..)
AND (..);

Insert multisets, and multiset functionality via jOOQ.

I recently learned about multisets in SQL via jOOQ and got excited. Theoretically, this would allow me to push quite a bit of query logic from my application layer to the database, where I could retrieve everything in one fell swoop instead of wiring together all the grouping and querying from my application. I have run into a bit of a problem: the queries don’t seem to scale particularly well, in the first case becoming basically unusable when querying for just a few thousand elements. I realize multisets aren’t natively supported in Postgres yet, but would still like to use the functionality if I can get it even reasonably performant.

My first attempt was to put the multiset in the SELECT as a sort-of-kind-of scalar subquery:

SELECT
  primari.*,
  (
    SELECT
      -- reminder: this is a jOOQ multiset
      JSONB_PRETTY(COALESCE(JSONB_AGG(JSONB_BUILD_OBJECT('id', s.id, 'primary_id', s.primary_id, 'name', s.name)), JSONB_BUILD_ARRAY()))
    FROM (
      SELECT secondary.* FROM secondary WHERE secondary.primary_id = primari.id
    ) AS s
  ) AS secondaries,
  (
    SELECT
      JSONB_PRETTY(COALESCE(JSONB_AGG(JSONB_BUILD_OBJECT('id', t.id, 'primary_id', t.primary_id, 'name', t.name)), JSONB_BUILD_ARRAY()))
    FROM (
      SELECT tertiary.* FROM tertiary WHERE tertiary.primary_id = primari.id
    ) AS t
  ) AS tertiaries
FROM primari
WHERE primary.id IN (..);

This pattern was quite readable in both SQL and jOOQ, and in test datasets seemed to perform fine. But when I upped my scale, the query did not. From the EXPLAIN ANALYZE, I believe I learned that this was executing one query to each non-primary table per row returned by the primary query. That’s not great.

The only other thing I have thought of is to perform one subquery per non-primary table and aggregate rows there, then join those subqueries back to the original table.

SELECT
  primari.*,
  JSONB_PRETTY(COALESCE(secondary_sq.secondaries, JSONB_BUILD_ARRAY())) AS secondaries,
  JSONB_PRETTY(COALESCE(tertiary_sq.tertiaries, JSONB_BUILD_ARRAY())) AS tertiaries
FROM primari
LEFT OUTER JOIN (
  SELECT
    secondary.primary_id,
    COALESCE(JSONB_AGG(JSONB_BUILD_OBJECT('id', id, 'primary_id', primary_id, 'name', name)), JSONB_BUILD_ARRAY()) AS secondaries
  FROM secondary
  WHERE primary_id IN (..)
  GROUP BY primary_id
) AS secondary_sq
ON secondary_sq.primary_id = primari.id
LEFT OUTER JOIN (
  SELECT
    tertiary.primary_id,
    COALESCE(JSONB_AGG(JSONB_BUILD_OBJECT('id', id, 'primary_id', primary_id, 'name', name)), JSONB_BUILD_ARRAY()) AS tertiaries
  FROM tertiary
  WHERE primary_id IN (..)
  GROUP BY primary_id
) AS tertiary_sq
ON tertiary_sq.primary_id = primari.id
WHERE primary_id IN (..);

This solution seems to scale much better, but still gets slow as the number of ids to filter on gets very large.

The question then: am I thinking about this wrong, am I missing something, or is this the best I can do? It’s still faster than doing individual queries and grouping data in application code, but it kind of feels like I’m missing something that would make this run faster as my data scales.

Before anyone brings it up, there are indexes on all tables based on what the potential filters are for each table. For all non-primary tables, that means there’s at least one index with the first column primary_id.

I set up a dbfiddle to get folks started if you’d like to tinker.

Answers

- LukasEder
- July 4, 2023 at 11:12 am
- 0 votes
0
I’ve investigated performance of the nesting approach on various RDBMS including PostgreSQL, comparing it with alternatives. In short:

Indeed, nesting stuff in SQL tends to produce nested loop joins, which have higher complexity than the more desirable hash joins or merge joins, which would otherwise arise from joins. Higher complexity means worse performance for large data sets. At some point, as with any Big-O consideration, there’s a break even where the worse complexity starts getting truly worse (when for smaller data sets, it can be better).

At this point, you’ll probably be best off with classic joins and 2 queries, one per join, i.e. a query joining primary and secondary and a query joining primary and tertiary, and then assemble everything in memory. You can’t run everything with a single query, because of the cartesian product between secondary and tertiary, which is hard and again slow to resolve

So, while MULTISET is very powerful and convenient for small to medium sized data sets (which is usually the case, especially for the nested data sets), it doesn’t scale for reports.

Login or Signup to reply.

I’m not sure why you need the subselects and multiple levels of COALESCE etc. Can’t you just join the tables and generate the required JSON? e.g.

SELECT
primari.*                                                                                                                                            ,JSONB_PRETTY( COALESCE( JSONB_AGG( JSONB_BUILD_OBJECT('id', secondary.id, 'primary_id', secondary.primary_id, 'name', secondary.name) ), JSONB_BUILD_ARRAY() )) AS secondaries 
,JSONB_PRETTY( COALESCE( JSONB_AGG( JSONB_BUILD_OBJECT('id', tertiary.id, 'primary_id', tertiary.primary_id, 'name', tertiary.name) ), JSONB_BUILD_ARRAY() ))    AS tertiaries
FROM
        primari
LEFT OUTER JOIN
        secondary
ON
        secondary.primary_id = primari.id
LEFT OUTER JOIN
        tertiary
ON
        tertiary.primary_id = primari.id
GROUP BY
        primari.id ,
        primari.name;

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Efficiently query several one-to-zero-or-many relationships

Answers