The Problem
We have a relational table where we store user activity. A query like the following takes 77 seconds!
FROM "site_activity"
WHERE
(
NOT "site_activity"."is_deleted"
AND "site_activity"."user_id" = 68812389
AND NOT (
"site_activity"."kind" IN (
'updated',
'duplicated',
'reapplied'
)
)
AND NOT (
"site_activity"."content_type_id" = 14
AND "site_activity"."kind" = 'created'
)
)
ORDER BY
"site_activity"."created_at" DESC,
"site_activity"."id" DESC
LIMIT 9;
The query plan looks like this
QUERY PLAN
--------------------------------------------------------------------------------------------
Limit
(cost=17750.72..27225.75 rows=9 width=16)
(actual time=199501.336..199501.338 rows=9 loops=1)
Output: id, created_at
Buffers: shared hit=4502362 read=693523 written=37273
I/O Timings: read=190288.205 write=446.870
-> Incremental Sort
(cost=17750.72..2003433582.97 rows=1902974 width=16)
(actual time=199501.335..199501.336 rows=9 loops=1)
Output: id, created_at
Sort Key: site_activity.created_at DESC, site_activity.id DESC
Presorted Key: site_activity.created_at
Full-sort Groups: 1 Sort Method: quicksort Average Memory: 25kB Peak Memory: 25kB
Buffers: shared hit=4502362 read=693523 written=37273
I/O Timings: read=190288.205 write=446.870
-> Index Scan Backward using site_activity_created_at_company_id_idx on public.site_activity
(cost=0.58..2003345645.30 rows=1902974 width=16)
(actual time=198971.283..199501.285 rows=10 loops=1)
Output: id, created_at
Filter: (
(NOT site_activity.is_deleted) AND (site_activity.user_id = 68812389)
AND ((site_activity.kind)::text <> ALL ('{updated,duplicated,reapplied}'::text[]))
AND ((site_activity.content_type_id <> 14) OR ((site_activity.kind)::text <> 'created'::text))
)
Rows Removed by Filter: 14735308
Buffers: shared hit=4502353 read=693523 written=37273
I/O Timings: read=190288.205 write=446.870
Settings: effective_cache_size = '261200880kB',
effective_io_concurrency = '400',
jit = 'off',
max_parallel_workers = '24',
random_page_cost = '1.5',
work_mem = '64MB'
Planning:
Buffers: shared hit=344
Planning Time: 6.429 ms
Execution Time: 199501.365 ms
(22 rows)
Time: 199691.997 ms (03:19.692)
Table Facts
-
It contains a little more than 4 billion rows.
-
The table structure is
Table "public.site_activity" Column | Type | Collation | Nullable | Default ----------------+--------------------------+-----------+----------+---------------------------------------------- id | bigint | | not null | nextval('site_activity_id_seq'::regclass) created_at | timestamp with time zone | | not null | modified_at | timestamp with time zone | | not null | is_deleted | boolean | | not null | object_id | bigint | | not null | kind | character varying(32) | | not null | context | text | | not null | company_id | integer | | not null | content_type_id | integer | | not null | user_id | integer | | | Indexes: "site_activity_pkey" PRIMARY KEY, btree (id) "site_activity_modified_at_idx" btree (modified_at) "site_activity_company_id_idx" btree (company_id) "site_activity_created_at_company_id_idx" btree (created_at, company_id) "site_activity_object_id_idx" btree (object_id) "site_activity_content_type_id_idx" btree (content_type_id) "site_activity_kind_idx" btree (kind) "site_activity_kind_idx1" btree (kind varchar_pattern_ops) "site_activity_user_id_idx" btree (user_id) Foreign-key constraints: "site_activity_company_id_fk_site_company_id" FOREIGN KEY (company_id) REFERENCES site_company(id) DEFERRABLE INITIALLY DEFERRED "site_activity_content_type_id_fk_django_co" FOREIGN KEY (content_type_id) REFERENCES django_content_type(id) DEFERRABLE INITIALLY DEFERRED "site_activity_user_id_fk_site_user_id" FOREIGN KEY (user_id) REFERENCES site_user(id) DEFERRABLE INITIALLY DEFERRED
a.
kind
is treated as anenum
. In db we store it as varchar. But in the application (python) we treat it as enum. So the values are fixed. There are around 100 values in it.b.
content_type_id
has around 80 values. -
This is the distribution of values,
a.
context
is actually JSON with a max 8Mb size.a. 3
content_type_id
values holds 92% of the rows. They are 14 and 19.a. 3
kind
consumes 75% rows. These arecreated
,updated
andsent
.a. The combination of
kind
andcontent_type_id
creates 460 values. Among them, 1 combination contains 35% of rows and we exclude them in the query all time. -
The replica instance has type
db.r5.12xlarge
. 24 cores, 48 vCPUs, 384GB Mem, storage type io1.
Question
- How do we handle if the table grows to 100 billion? In the current projection, this can happen in the next 3-5 years.
- Is NoSQL a good solution? Note we are not accessing the documents with only id or kind.
Notes
- The facts that I presented might bias the solution to replication in the same host and then later sharding over multiple hosts. But if there is some other solution that can keep up to the 100 billion mark, we should be good.
- We don’t have to use AWS. But preferred.
2
Answers
The current plan is to scan the rows already ordered by "created_at" (using an index) and then stop once it finds 10 (plus maybe a few rows to account for ties) passing the rest of of the conditions. It thinks it will do this very quickly, after only about 1/73,000 of the table (27225.75 / 2003433582.97). but in fact it had to scan much more than that (14735308 / 4000000000, or 1/270 of the table). So it grossly misestimated that part. I don’t know if it misestimated it because the number of rows meeting the conditions was estimated incorrectly (It thought there would be 1902974, we don’t know how many there actually were, since it stopped early and so stopped counting them) or because it assumed the matching rows would be evenly dispersed over the index, when they were not.
The best index for you will probably be on
(user_id, created_at)
. That way you get to jump to just the part of the index which has the correct user_id (which I assume is where the vast majority of your selectivity comes from) and then still walk that part already in order by "created_at". And you can drop the original index just on(user_id)
, as the new one will be good for anything that the old one is good for. You could also add "is_deleted" between the other two columns in that index, as it will not spoil the ordering property and will provide some additional selectivity (but probably not much). Any other columns added there will spoil the ordering property, however.Query
Start by formatting the
WHERE
clause to make it easier to understand. Comes down to:Index
You commented you always exclude rows for these two conditions. So this partial, multicolumn index would be the optimum:
Adding
id
only makes sense if there are many rows with the same(user_id, created_at)
. Else dropid
from the index.Excluding large, irrelevant parts of the table from the index can pay for such big indexes. (But you may prevent HOT updates for changes on any of the columns involved in the index, including the ones in the
WHERE
clause.)The index can only be used while its filters are an obvious subset of the filters in the query.
Table definition
It would pay to optimize your table definition. Like:
Most importantly,
kind
now occupies 2 bytes instead of 33 bytes or more. See:Plus substantial savings from rearranging the column order. See:
The big column
context
("with a max 8Mb size") will typically be stored out-of-line in a TOAST table for most rows, so the tuples to work with shrink to half their size. This makes a difference for most operations.And I suspect that some of your indexes may be expendable.