Postgresql - How to improve sort performance in a huge dataset of 20 million rows?

ManishBoro
February 21, 2024
239 views
0 votes
2 Answers

So I have this query which is extremely slow to run –

SELECT
    "conversation".*,
    "teamBotLocation"."id" AS "teamBotLocation.id",
    "teamBotLocation"."name" AS "teamBotLocation.name"
FROM
    "conversation" AS "conversation"
    LEFT OUTER JOIN "teamBotLocation" AS "teamBotLocation" ON "conversation"."teamBotLocationId" = "teamBotLocation"."id"
WHERE
    "conversation"."teamBotId" = 'some_id'
    AND "conversation"."teamBotLocationId" IN ('some_id','some_id')
    AND "conversation"."isDeleted" = false
    AND "conversation"."isBlocked" = false
    AND "conversation"."isBroadcast" = false
    AND "conversation"."isGroup" = false
ORDER BY
     CASE
         WHEN "firstName" = '' THEN NULL
         ELSE 0
     END ASC NULLS LAST,
    "conversation"."firstName" NULLS LAST,"conversation"."id" DESC
LIMIT
    10 OFFSET 100;

What this query actually does ?

So I have a conversation table that has almost 20 million rows. And I am displaying all these conversations in a table with a default limit of 10. We are sorting the table by these 5 columns firstName, lastName, phone, email and createdAt in ascending/descending order.

I am currently using this query in production. For smaller no. of results the execution time is below 1s which is fine. But when results are in millions let’s say 1 million then the execution time is too slow well around 10s. I found out that the primary reason for the slowness is sorting.

This is my query plan in text –

Limit  (cost=353265.17..353265.20 rows=10 width=1742) (actual time=130013.935..130013.939 rows=10 loops=1)
  ->  Sort  (cost=353265.17..353481.27 rows=86438 width=1742) (actual time=130013.934..130013.937 rows=10 loops=1)
"        Sort Key: (CASE WHEN ((conversation.""firstName"")::text = ''::text) THEN NULL::integer ELSE 0 END), conversation.""firstName"", conversation.id DESC"
        Sort Method: top-N heapsort  Memory: 37kB
        ->  Hash Left Join  (cost=454.80..351397.28 rows=86438 width=1742) (actual time=12.856..128302.754 rows=1387007 loops=1)
"              Hash Cond: (conversation.""teamBotLocationId"" = ""teamBotLocation"".id)"
"              ->  Index Scan using ""conversation_teambot_teambotLocation_email"" on conversation  (cost=0.56..350499.44 rows=86438 width=1706) (actual time=1.165..127696.780 rows=1387007 loops=1)"
"                    Index Cond: ((""teamBotId"" = 'f92eb170-28b2-11ed-a128-098b9f10e3d6'::uuid) AND (""teamBotLocationId"" = ANY ('{f9978ff0-28b3-11ed-a128-098b9f10e3d6,f1e505b0-f033-11ed-b347-43d7d9fb1309}'::uuid[])))"
"                    Filter: ((NOT ""isDeleted"") AND (NOT ""isBlocked"") AND (NOT ""isBroadcast"") AND (NOT ""isGroup""))"
                    Rows Removed by Filter: 6940
              ->  Hash  (cost=434.66..434.66 rows=1566 width=32) (actual time=11.677..11.678 rows=1566 loops=1)
                    Buckets: 2048  Batches: 1  Memory Usage: 115kB
"                    ->  Seq Scan on ""teamBotLocation""  (cost=0.00..434.66 rows=1566 width=32) (actual time=0.004..11.326 rows=1566 loops=1)"
Planning Time: 35.488 ms
Execution Time: 130014.015 ms

I need a solution that executes the query in under 1-2s keeping in mind that the sorting can be done in either ascending or descending order for these 5 columns firstName, lastName, phone, email and createdAt.

**Note **-

If any row has an empty string in a particular column, it should keep that row at the last while sorting.
To keep in mind, all these conversations belong to different accounts in a 1-to-1 mapping. So technically this query is for a particular account. Also each account can have different locations. So the conversations also has a 1-to-1 mapping with these locations.

Answers

- user22760463
- February 21, 2024 at 8:23 am
- 0 votes
0
You can sort firstly the firstName column with limit 10 and just after that to sort the result table with the four other columns
sorting 10 rows is much more quickly

Login or Signup to reply.

- LaurenzAlbe
- February 21, 2024 at 8:33 am
- 0 votes
0
My answer assumes that "firstName" is a column of conversation. Otherwise, there is little hope for your query.

You can simplify the ORDER BY clause to
```
ORDER BY conversation."firstName" = '',
         conversation."firstName",
         conversation.id DESC
```
because FALSE < TRUE.

Then you need an index to support the ORDER BY clause (and, in part, the WHERE condition):
```
CREATE INDEX ON conversation (
   "teamBotId",
   ("firstName" = ''),
   "firstName",
   id DESC
);
```
and an index for the nested loop join (on "teamBotLocation".id, but if that is the primary key, you already have an index there).

If you need to support different ORDER BY clauses, you need an index for each of these. Yes, that is expensive and will slow down data modifications. But if you need these queries to be fast, that’s the price you have to pay.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – How to improve sort performance in a huge dataset of 20 million rows?

Answers