Postgresql - Postgres data filtering over attributes performance

zoranjeremic
March 18, 2024
81 views
1 vote
2 Answers

We have Postgres DB design that in simplified form looks like this:

CREATE TABLE products
(
    id varchar,
    version bigint,
    profile varchar,
    PRIMARY KEY(id, profile, version)
);

CREATE TABLE product_attributes
(
    productId varchar,
    key varchar,
    value varchar,
    profile varchar,
    version bigint,
    PRIMARY KEY(productId, key, value, profile, version)
);

INSERT INTO products (id, version) VALUES ('prod_123', 1);

INSERT INTO product_attributes (productId, key, value, version) 
VALUES ('prod_123', 'country', 'US', 1);
INSERT INTO product_attributes (productId, key, value, version) 
VALUES ('prod_123', 'country', 'MX', 1);
INSERT INTO product_attributes (productId, key, value, version) 
VALUES ('prod_123', 'country', 'ES', 1);
INSERT INTO product_attributes (productId, key, value, version) 
VALUES ('prod_123', 'currency', 'USD', 1);
INSERT INTO product_attributes (productId, key, value, version) 
VALUES ('prod_123', 'startDate', '2023-01-01', 1);

Large volume of data is inserted to DB and each product would have 20-30 versions, but we need to retrieve only latest version. Some of the attributes are collections, some are single values, but it was designed this way for easier attributes abstraction in the application to Map[String, Set[String]], and also to make it easier to add new attributes.

We were retrieving data using query like:

SELECT e.*
FROM products as e
JOIN (SELECT id, profile, MAX(version) AS version 
      FROM products  
      GROUP BY id, profile) AS vs ON e.id = vs.id 
                                  AND e.profile = vs.profile 
                                  AND e.version = vs.version
JOIN products_attributes oav ON e.id = oav.productId 
                             AND e.version = oav.version;

However, performance of this query was very really poor and moving all attributes to jsonb field in products table seemed like good idea as it was significant performance improvement when we need to retrieve a whole set of products in single query.

{
  "endDate": [
    "2024-02-20T21:00:00.000Z"
  ],
  "countries": [
    "US", "MX","ES"
  ],
  "type": [
    "RETAIL"
  ],
  "startDate": [
    "2024-02-13T08:00:00.000Z"
  ],
  "categories": [
    "ELECTRONICS"
  ],
  "currency": [
    "USD"
  ],
  "status": [
    "ACTIVE"
  ]
}

Now, we are considering having server side filtering and most of the filters would filter data based on attributes, e.g. filter all products where at least one of the countries is in the list [US, ES, IT] and/or products where start date is after certain date.

I’m looking for some ideas on how to move further on to achieve the best performance with this server side filtering, as it’s most critical aspect. Filtering over jsonb field seem a bit complicated, and reverting back separate attribute table would affect performance of queries where all products are needed.

UPDATE:
We are having these indexes:

indexes that are currently added
 products_pk(id, profile, version)
 products_version_key(version desc)
 products_id_profile_version_inded(id, asc, profile asc, version desc)


 producs_attributes_pk(product_id, key, value, profile, version)
 product_attributes_id_profile_version(product_id asc, profile asc, version desc)

explain(analyze, verbose, buffer, settings) results for 2 queries. First query performs better:

explain(analyze, verbose, buffers, settings)
SELECT e.*
FROM products as e
JOIN (SELECT id, profile, MAX(version) AS version FROM products  GROUP BY id, profile) as vs
    ON e.id = vs.id and e.profile = vs.profile and e.version = vs.version
JOIN product_attribute_values oav on e.id = oav.product_id and e.version = oav.version and e.profile = oav.profile;

Nested Loop  (cost=1.25..59210.34 rows=11 width=885) (actual time=0.051..587.036 rows=125034 loops=1)
"  Output: e.id, e.name, e.description, e.discount_id, e.legacy, e.author, e.datetime, e.profile, e.version, e.deleted, e.start_date, e.end_date, e.type, e.status, e.attributes"
  Inner Unique: true
  Join Filter: (((e.id)::text = (products.id)::text) AND ((e.profile)::text = (products.profile)::text) AND (e.version = (max(products.version))))
  Buffers: shared hit=607499
  ->  Nested Loop  (cost=0.84..59209.24 rows=1 width=100) (actual time=0.038..204.454 rows=125034 loops=1)
"        Output: products.id, products.profile, (max(products.version)), oav.product_id, oav.version, oav.profile"
        Buffers: shared hit=107363
        ->  GroupAggregate  (cost=0.41..23863.19 rows=10721 width=50) (actual time=0.016..39.909 rows=12007 loops=1)
"              Output: products.id, products.profile, max(products.version)"
"              Group Key: products.id, products.profile"
              Buffers: shared hit=23986
              ->  Index Only Scan using products_id_profile_version_index on public.products  (cost=0.41..23453.43 rows=40339 width=50) (actual time=0.007..28.559 rows=42433 loops=1)
"                    Output: products.id, products.profile, products.version"
                    Heap Fetches: 8381
                    Buffers: shared hit=23986
        ->  Index Only Scan using product_attribute_values_product_id_profile_version_index on public.product_attribute_values oav  (cost=0.42..3.29 rows=1 width=50) (actual time=0.008..0.011 rows=10 loops=12007)
"              Output: oav.product_id, oav.profile, oav.version"
              Index Cond: ((oav.product_id = (products.id)::text) AND (oav.profile = (products.profile)::text) AND (oav.version = (max(products.version))))
              Heap Fetches: 41197
              Buffers: shared hit=83377
  ->  Index Scan using products_version_key on public.products e  (cost=0.41..1.09 rows=1 width=885) (actual time=0.002..0.002 rows=1 loops=125034)
"        Output: e.id, e.name, e.description, e.discount_id, e.legacy, e.author, e.datetime, e.profile, e.version, e.deleted, e.start_date, e.end_date, e.type, e.status, e.attributes"
        Index Cond: (e.version = oav.version)
        Filter: (((e.id)::text = (oav.product_id)::text) AND ((e.profile)::text = (oav.profile)::text))
        Buffers: shared hit=500136
"Settings: maintenance_io_concurrency = '1', effective_cache_size = '21585496kB', search_path = 'public'"
Query Identifier: -2874336356913203089
Planning:
  Buffers: shared hit=46
Planning Time: 1.284 ms
Execution Time: 594.111 ms


SELECT e.*
FROM products as e
JOIN (SELECT id, profile, deleted, version, row_number() OVER (PARTITION BY id, profile ORDER BY version DESC) as rn
  FROM products) as vs
    ON e.id = vs.id and e.profile = vs.profile and e.version = vs.version
JOIN product_attribute_values oav on e.id = oav.product_id and e.version = oav.version and e.profile = oav.profile;


Gather  (cost=1001.25..53915.82 rows=201 width=885) (actual time=0.466..849.478 rows=460724 loops=1)
"  Output: e.id, e.name, e.description, e.discount_id, e.legacy, e.author, e.datetime, e.profile, e.version, e.deleted, e.start_date, e.end_date, e.type, e.status, e.attributes"
  Workers Planned: 2
  Workers Launched: 2
  Buffers: shared hit=2082508
  ->  Nested Loop  (cost=1.25..52895.72 rows=84 width=885) (actual time=0.105..788.148 rows=153575 loops=3)
"        Output: e.id, e.name, e.description, e.discount_id, e.legacy, e.author, e.datetime, e.profile, e.version, e.deleted, e.start_date, e.end_date, e.type, e.status, e.attributes"
        Inner Unique: true
        Join Filter: (((e.id)::text = (products.id)::text) AND ((e.profile)::text = (products.profile)::text) AND (e.version = products.version))
        Buffers: shared hit=2082508
        Worker 0:  actual time=0.138..797.222 rows=156639 loops=1
          Buffers: shared hit=708216
        Worker 1:  actual time=0.127..787.080 rows=152762 loops=1
          Buffers: shared hit=690788
        ->  Nested Loop  (cost=0.84..52837.28 rows=53 width=100) (actual time=0.083..194.444 rows=153575 loops=3)
"              Output: products.id, products.profile, products.version, oav.product_id, oav.version, oav.profile"
              Buffers: shared hit=239610
              Worker 0:  actual time=0.114..199.550 rows=156639 loops=1
                Buffers: shared hit=81659
              Worker 1:  actual time=0.097..193.384 rows=152762 loops=1
                Buffers: shared hit=79739
              ->  Parallel Index Only Scan using products_id_profile_version_index on public.products  (cost=0.41..23218.12 rows=16808 width=59) (actual time=0.032..13.041 rows=14144 loops=3)
"                    Output: products.id, products.profile, NULL::boolean, products.version, NULL::bigint"
                    Heap Fetches: 8381
                    Buffers: shared hit=23986
                    Worker 0:  actual time=0.041..13.810 rows=14433 loops=1
                      Buffers: shared hit=8087
                    Worker 1:  actual time=0.045..12.751 rows=14153 loops=1
                      Buffers: shared hit=7931
              ->  Index Only Scan using product_attribute_values_product_id_profile_version_index on public.product_attribute_values oav  (cost=0.42..1.74 rows=1 width=50) (actual time=0.008..0.010 rows=11 loops=42433)
"                    Output: oav.product_id, oav.profile, oav.version"
                    Index Cond: ((oav.product_id = (products.id)::text) AND (oav.profile = (products.profile)::text) AND (oav.version = products.version))
                    Heap Fetches: 61992
                    Buffers: shared hit=215624
                    Worker 0:  actual time=0.008..0.010 rows=11 loops=14433
                      Buffers: shared hit=73572
                    Worker 1:  actual time=0.008..0.010 rows=11 loops=14153
                      Buffers: shared hit=71808
        ->  Index Scan using products_version_key on public.products e  (cost=0.41..1.09 rows=1 width=885) (actual time=0.003..0.003 rows=1 loops=460724)
"              Output: e.id, e.name, e.description, e.discount_id, e.legacy, e.author, e.datetime, e.profile, e.version, e.deleted, e.start_date, e.end_date, e.type, e.status, e.attributes"
              Index Cond: (e.version = oav.version)
              Filter: (((e.id)::text = (oav.product_id)::text) AND ((e.profile)::text = (oav.profile)::text))
              Buffers: shared hit=1842898
              Worker 0:  actual time=0.003..0.003 rows=1 loops=156639
                Buffers: shared hit=626557
              Worker 1:  actual time=0.003..0.003 rows=1 loops=152762
                Buffers: shared hit=611049
"Settings: maintenance_io_concurrency = '1', effective_cache_size = '21585496kB', search_path = 'public'"
Query Identifier: 49276727828285631
Planning:
  Buffers: shared hit=78
Planning Time: 1.757 ms
Execution Time: 874.783 ms

Answers

See example from my test data and query.
First, query with row_number(…) turned out to be worse in my test. Somewhat unexpectedly for me, but you were right.

explain analyze
SELECT e.*
FROM products as e
JOIN (SELECT id, profile, MAX(version) AS version 
      FROM products  
      GROUP BY id, profile) AS vs ON e.id = vs.id 
                                  AND e.profile = vs.profile 
                                  AND e.version = vs.version
JOIN product_attributes oav ON e.id = oav.productId 
   AND e.profile = oav.profile AND e.version = oav.version;

With index

create index  (ix_)products_id_profile_version on products (id asc, profile asc, version desc);
create index  products_id_profile_version_inded on products_attributes (id asc, profile asc, version desc);

you should have a query with good performance.

Test tables about 1M rows.

Explain

"Gather  (cost=63249.94..84865.64 rows=268 width=75) (actual time=428.216..905.465 rows=98198 loops=1)"
"  Workers Planned: 2"
"  Workers Launched: 2"
"  ->  Nested Loop  (cost=62249.94..83838.84 rows=112 width=75) (actual time=408.680..844.521 rows=32733 loops=3)"
"        Join Filter: (((products.id)::text = (e.id)::text) AND ((products.profile)::text = (e.profile)::text) AND ((max(products.version)) = e.version))"
"        ->  Hash Join  (cost=62249.52..83684.79 rows=303 width=34) (actual time=408.599..594.368 rows=32733 loops=3)"
"              Hash Cond: (((oav.productid)::text = (products.id)::text) AND ((oav.profile)::text = (products.profile)::text) AND (oav.version = (max(products.version))))"
"              ->  Parallel Seq Scan on product_attributes oav  (cost=0.00..12188.02 rows=441302 width=17) (actual time=0.040..64.395 rows=353042 loops=3)"
"              ->  Hash  (cost=59858.97..59858.97 rows=102317 width=17) (actual time=408.104..408.105 rows=96168 loops=3)"
"                    Buckets: 65536  Batches: 2  Memory Usage: 3085kB"
"                    ->  GroupAggregate  (cost=0.42..58835.80 rows=102317 width=17) (actual time=0.062..367.465 rows=96168 loops=3)"
"                          Group Key: products.id, products.profile"
"                          ->  Index Only Scan using products_id_profile_version on products  (cost=0.42..50138.83 rows=1023173 width=17) (actual time=0.050..203.666 rows=1023173 loops=3)"
"                                Heap Fetches: 1023173"
"        ->  Index Scan using products_id_profile_version on products e  (cost=0.42..0.49 rows=1 width=75) (actual time=0.007..0.007 rows=1 loops=98198)"
"              Index Cond: (((id)::text = (oav.productid)::text) AND ((profile)::text = (oav.profile)::text) AND (version = oav.version))"
"Planning time: 2.967 ms"
"Execution time: 908.023 ms"

For interest

explain analyze
select *
from (
  SELECT id,profile,version
     ,row_number()over(partition by id, profile order by version)rn
  FROM products p
)as e
JOIN product_attributes oav ON rn=1 and e.id = oav.productId 
   AND e.profile = oav.profile AND e.version = oav.version

and explain

"Nested Loop  (cost=0.85..114709.32 rows=1949 width=51) (actual time=0.111..1359.362 rows=97618 loops=1)"
"  ->  Subquery Scan on e  (cost=0.42..83391.96 rows=5116 width=25) (actual time=0.099..646.132 rows=96168 loops=1)"
"        Filter: (e.rn = 1)"
"        Rows Removed by Filter: 927005"
"        ->  WindowAgg  (cost=0.42..70602.29 rows=1023173 width=25) (actual time=0.098..608.730 rows=1023173 loops=1)"
"              ->  Index Only Scan using ix_products_id_profile_version on products p  (cost=0.42..50138.83 rows=1023173 width=17) (actual time=0.091..238.432 rows=1023173 loops=1)"
"                    Heap Fetches: 1023173"
"  ->  Index Scan using products_id_profile_version_inded on product_attributes oav  (cost=0.43..6.11 rows=1 width=26) (actual time=0.006..0.006 rows=1 loops=96168)"
"        Index Cond: (((productid)::text = (e.id)::text) AND ((profile)::text = (e.profile)::text) AND (version = e.version))"
"Planning time: 0.868 ms"
"Execution time: 1362.314 ms"

- LaurenzAlbe
- March 18, 2024 at 4:54 am
- 0 votes
0
Avoid the self-join with products and use an EXISTS clause with product_attribute_values (since you don’t need any data from that table):
```
SELECT e.*
FROM products AS e
WHERE EXISTS (SELECT 1 FROM product_attribute_values AS oav
              WHERE e.id = oav.product_id
                AND e.version = oav.version
                AND e.profile = oav.profile)
ORDER BY id, profile, version DESC
LIMIT 1;
```
That query will return only one result, even if there are tow products with the same id, profile and version, so it is not 100% identical to yours.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Postgres data filtering over attributes performance

Answers