PostgreSQL select distinct JSONB poor performance

JohnnyMetz
November 14, 2022
290 views
0 votes
2 Answers

Let’s say I have the following PostgreSQL table called products which has millions of records:

CREATE TABLE IF NOT EXISTS mytable (
    id serial NOT NULL PRIMARY KEY,
    label VARCHAR(50) NOT NULL,
    info jsonb NOT NULL,
    created_at timestamp NOT NULL DEFAULT now()
);

I have a SELECT statement targeting this table with several JOIN‘s that is generating duplicate rows. This appears to be common: Why do multiple-table joins produce duplicate rows?

I know I can fix this issue using SELECT DISTINCT .... However, the query is taking several seconds, whereas the vanilla SELECT ... query takes milliseconds.

I presume this has to do with the info JSONB field, which can be very large. When I remove info from the DISTINCT calculation by using SELECT DISTINCT ON (id) ... then the query is much faster.

However, DISTINCT ON breaks some of my queries that use ORDER BY [non-id field] due to this condition:

SELECT DISTINCT ON expressions must match initial ORDER BY expressions

I’ve noticed I can fix the error by using subqueries (good example here):

SELECT * FROM (
    SELECT DISTINCT ON (id) ...
) ORDER BY [non-id field]

Two questions:

Is passing a JSONB field through DISTINCT a known performance problem? I want to make sure my theory is reasonable.
Is my solution of using subqueries a good solution for fixing the SELECT DISTINCT ON expressions must match initial ORDER BY expressions error? Or is there a better solution I’m not thinking of?

Answers

- ErwinBrandstetter
- November 14, 2022 at 6:19 am
- 0 votes
0
Is passing a JSONB field through DISTINCT a known performance problem?

Passing any large column through DISTINCT is a known performance problem. jsonb isn’t special in this regard.

Or is there a better solution I’m not thinking of?

That depends. You didn’t provide requested information, yet.

Login or Signup to reply.

- LaurenzAlbe
- November 14, 2022 at 8:12 am
- 0 votes
0
You can get by with a single sort and still have good performance:
```
SELECT DISTINCT ON ([non-id column], id) ...
ORDER BY [non-id column], id;
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.