We have Postgres DB design that in simplified form looks like this:
CREATE TABLE products
(
id varchar,
version bigint,
profile varchar,
PRIMARY KEY(id, profile, version)
);
CREATE TABLE product_attributes
(
productId varchar,
key varchar,
value varchar,
profile varchar,
version bigint,
PRIMARY KEY(productId, key, value, profile, version)
);
INSERT INTO products (id, version) VALUES ('prod_123', 1);
INSERT INTO product_attributes (productId, key, value, version)
VALUES ('prod_123', 'country', 'US', 1);
INSERT INTO product_attributes (productId, key, value, version)
VALUES ('prod_123', 'country', 'MX', 1);
INSERT INTO product_attributes (productId, key, value, version)
VALUES ('prod_123', 'country', 'ES', 1);
INSERT INTO product_attributes (productId, key, value, version)
VALUES ('prod_123', 'currency', 'USD', 1);
INSERT INTO product_attributes (productId, key, value, version)
VALUES ('prod_123', 'startDate', '2023-01-01', 1);
Large volume of data is inserted to DB and each product would have 20-30 versions, but we need to retrieve only latest version. Some of the attributes are collections, some are single values, but it was designed this way for easier attributes abstraction in the application to Map[String, Set[String]], and also to make it easier to add new attributes.
We were retrieving data using query like:
SELECT e.*
FROM products as e
JOIN (SELECT id, profile, MAX(version) AS version
FROM products
GROUP BY id, profile) AS vs ON e.id = vs.id
AND e.profile = vs.profile
AND e.version = vs.version
JOIN products_attributes oav ON e.id = oav.productId
AND e.version = oav.version;
However, performance of this query was very really poor and moving all attributes to jsonb field in products table seemed like good idea as it was significant performance improvement when we need to retrieve a whole set of products in single query.
{
"endDate": [
"2024-02-20T21:00:00.000Z"
],
"countries": [
"US", "MX","ES"
],
"type": [
"RETAIL"
],
"startDate": [
"2024-02-13T08:00:00.000Z"
],
"categories": [
"ELECTRONICS"
],
"currency": [
"USD"
],
"status": [
"ACTIVE"
]
}
Now, we are considering having server side filtering and most of the filters would filter data based on attributes, e.g. filter all products where at least one of the countries is in the list [US, ES, IT] and/or products where start date is after certain date.
I’m looking for some ideas on how to move further on to achieve the best performance with this server side filtering, as it’s most critical aspect. Filtering over jsonb field seem a bit complicated, and reverting back separate attribute table would affect performance of queries where all products are needed.
UPDATE:
We are having these indexes:
indexes that are currently added
products_pk(id, profile, version)
products_version_key(version desc)
products_id_profile_version_inded(id, asc, profile asc, version desc)
producs_attributes_pk(product_id, key, value, profile, version)
product_attributes_id_profile_version(product_id asc, profile asc, version desc)
explain(analyze, verbose, buffer, settings) results for 2 queries. First query performs better:
explain(analyze, verbose, buffers, settings)
SELECT e.*
FROM products as e
JOIN (SELECT id, profile, MAX(version) AS version FROM products GROUP BY id, profile) as vs
ON e.id = vs.id and e.profile = vs.profile and e.version = vs.version
JOIN product_attribute_values oav on e.id = oav.product_id and e.version = oav.version and e.profile = oav.profile;
Nested Loop (cost=1.25..59210.34 rows=11 width=885) (actual time=0.051..587.036 rows=125034 loops=1)
" Output: e.id, e.name, e.description, e.discount_id, e.legacy, e.author, e.datetime, e.profile, e.version, e.deleted, e.start_date, e.end_date, e.type, e.status, e.attributes"
Inner Unique: true
Join Filter: (((e.id)::text = (products.id)::text) AND ((e.profile)::text = (products.profile)::text) AND (e.version = (max(products.version))))
Buffers: shared hit=607499
-> Nested Loop (cost=0.84..59209.24 rows=1 width=100) (actual time=0.038..204.454 rows=125034 loops=1)
" Output: products.id, products.profile, (max(products.version)), oav.product_id, oav.version, oav.profile"
Buffers: shared hit=107363
-> GroupAggregate (cost=0.41..23863.19 rows=10721 width=50) (actual time=0.016..39.909 rows=12007 loops=1)
" Output: products.id, products.profile, max(products.version)"
" Group Key: products.id, products.profile"
Buffers: shared hit=23986
-> Index Only Scan using products_id_profile_version_index on public.products (cost=0.41..23453.43 rows=40339 width=50) (actual time=0.007..28.559 rows=42433 loops=1)
" Output: products.id, products.profile, products.version"
Heap Fetches: 8381
Buffers: shared hit=23986
-> Index Only Scan using product_attribute_values_product_id_profile_version_index on public.product_attribute_values oav (cost=0.42..3.29 rows=1 width=50) (actual time=0.008..0.011 rows=10 loops=12007)
" Output: oav.product_id, oav.profile, oav.version"
Index Cond: ((oav.product_id = (products.id)::text) AND (oav.profile = (products.profile)::text) AND (oav.version = (max(products.version))))
Heap Fetches: 41197
Buffers: shared hit=83377
-> Index Scan using products_version_key on public.products e (cost=0.41..1.09 rows=1 width=885) (actual time=0.002..0.002 rows=1 loops=125034)
" Output: e.id, e.name, e.description, e.discount_id, e.legacy, e.author, e.datetime, e.profile, e.version, e.deleted, e.start_date, e.end_date, e.type, e.status, e.attributes"
Index Cond: (e.version = oav.version)
Filter: (((e.id)::text = (oav.product_id)::text) AND ((e.profile)::text = (oav.profile)::text))
Buffers: shared hit=500136
"Settings: maintenance_io_concurrency = '1', effective_cache_size = '21585496kB', search_path = 'public'"
Query Identifier: -2874336356913203089
Planning:
Buffers: shared hit=46
Planning Time: 1.284 ms
Execution Time: 594.111 ms
SELECT e.*
FROM products as e
JOIN (SELECT id, profile, deleted, version, row_number() OVER (PARTITION BY id, profile ORDER BY version DESC) as rn
FROM products) as vs
ON e.id = vs.id and e.profile = vs.profile and e.version = vs.version
JOIN product_attribute_values oav on e.id = oav.product_id and e.version = oav.version and e.profile = oav.profile;
Gather (cost=1001.25..53915.82 rows=201 width=885) (actual time=0.466..849.478 rows=460724 loops=1)
" Output: e.id, e.name, e.description, e.discount_id, e.legacy, e.author, e.datetime, e.profile, e.version, e.deleted, e.start_date, e.end_date, e.type, e.status, e.attributes"
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=2082508
-> Nested Loop (cost=1.25..52895.72 rows=84 width=885) (actual time=0.105..788.148 rows=153575 loops=3)
" Output: e.id, e.name, e.description, e.discount_id, e.legacy, e.author, e.datetime, e.profile, e.version, e.deleted, e.start_date, e.end_date, e.type, e.status, e.attributes"
Inner Unique: true
Join Filter: (((e.id)::text = (products.id)::text) AND ((e.profile)::text = (products.profile)::text) AND (e.version = products.version))
Buffers: shared hit=2082508
Worker 0: actual time=0.138..797.222 rows=156639 loops=1
Buffers: shared hit=708216
Worker 1: actual time=0.127..787.080 rows=152762 loops=1
Buffers: shared hit=690788
-> Nested Loop (cost=0.84..52837.28 rows=53 width=100) (actual time=0.083..194.444 rows=153575 loops=3)
" Output: products.id, products.profile, products.version, oav.product_id, oav.version, oav.profile"
Buffers: shared hit=239610
Worker 0: actual time=0.114..199.550 rows=156639 loops=1
Buffers: shared hit=81659
Worker 1: actual time=0.097..193.384 rows=152762 loops=1
Buffers: shared hit=79739
-> Parallel Index Only Scan using products_id_profile_version_index on public.products (cost=0.41..23218.12 rows=16808 width=59) (actual time=0.032..13.041 rows=14144 loops=3)
" Output: products.id, products.profile, NULL::boolean, products.version, NULL::bigint"
Heap Fetches: 8381
Buffers: shared hit=23986
Worker 0: actual time=0.041..13.810 rows=14433 loops=1
Buffers: shared hit=8087
Worker 1: actual time=0.045..12.751 rows=14153 loops=1
Buffers: shared hit=7931
-> Index Only Scan using product_attribute_values_product_id_profile_version_index on public.product_attribute_values oav (cost=0.42..1.74 rows=1 width=50) (actual time=0.008..0.010 rows=11 loops=42433)
" Output: oav.product_id, oav.profile, oav.version"
Index Cond: ((oav.product_id = (products.id)::text) AND (oav.profile = (products.profile)::text) AND (oav.version = products.version))
Heap Fetches: 61992
Buffers: shared hit=215624
Worker 0: actual time=0.008..0.010 rows=11 loops=14433
Buffers: shared hit=73572
Worker 1: actual time=0.008..0.010 rows=11 loops=14153
Buffers: shared hit=71808
-> Index Scan using products_version_key on public.products e (cost=0.41..1.09 rows=1 width=885) (actual time=0.003..0.003 rows=1 loops=460724)
" Output: e.id, e.name, e.description, e.discount_id, e.legacy, e.author, e.datetime, e.profile, e.version, e.deleted, e.start_date, e.end_date, e.type, e.status, e.attributes"
Index Cond: (e.version = oav.version)
Filter: (((e.id)::text = (oav.product_id)::text) AND ((e.profile)::text = (oav.profile)::text))
Buffers: shared hit=1842898
Worker 0: actual time=0.003..0.003 rows=1 loops=156639
Buffers: shared hit=626557
Worker 1: actual time=0.003..0.003 rows=1 loops=152762
Buffers: shared hit=611049
"Settings: maintenance_io_concurrency = '1', effective_cache_size = '21585496kB', search_path = 'public'"
Query Identifier: 49276727828285631
Planning:
Buffers: shared hit=78
Planning Time: 1.757 ms
Execution Time: 874.783 ms
2
Answers
See example from my test data and query.
First, query with row_number(…) turned out to be worse in my test. Somewhat unexpectedly for me, but you were right.
With index
you should have a query with good performance.
Test tables about 1M rows.
Explain
For interest
and explain
Avoid the self-join with
products
and use anEXISTS
clause withproduct_attribute_values
(since you don’t need any data from that table):That query will return only one result, even if there are tow
products
with the sameid
,profile
andversion
, so it is not 100% identical to yours.