Problem with PostgreSQL, JSONB, nested queries and database indices

greenwizard
January 23, 2023
258 views
1 vote
2 Answers

currently I’m struggeling with PostgreSQL 11, JSONB and database indices.

We have a customers table (shortened):

create table public.customers (   
   id uuid not null,   
   created timestamp without time zone,   
   lastmodified timestamp without time zone,   
   data jsonb,   
   primary key (id, mandantid) 
);

The JSON in the data column looks like this (shortened):

{
  "id": "...",
  "$encrypted": {
    "iv": "...",
    "data": "...",
    "keyRef": "key1"
  }
}

What I would like to do is to create a query which counts all db-entries which i.e. do NOT have ‘key1’ as ‘keyRef’. I’m not quite an PostgreSQL expert, so I tried several approaches.

I searched several documentation / stackoverflow / blogs and found the following approaches:

Approach 1

CREATE INDEX idx_customer_encryption_key ON customers(( (data->'$encrypted')::jsonb ->>'keyRef'::text));
COUNT(id) FROM customers WHERE data->'$encrypted' ->> 'keyRef' != 'key1';

Query works but is slow (no index is used).

Finalize Aggregate  (cost=163614.73..163614.74 rows=1 width=8) (actual time=6604.325..6610.718 rows=1 loops=1)
  Output: count(id)
  Buffers: shared hit=6550991 read=731452
  ->  Gather  (cost=163614.51..163614.72 rows=2 width=8) (actual time=6604.315..6610.709 rows=3 loops=1)
        Output: (PARTIAL count(id))
        Workers Planned: 2
        Workers Launched: 2
        Buffers: shared hit=6550991 read=731452
        ->  Partial Aggregate  (cost=162614.51..162614.52 rows=1 width=8) (actual time=6594.083..6594.083 rows=1 loops=3)
              Output: PARTIAL count(id)
              Buffers: shared hit=6550991 read=731452
              Worker 0:  actual time=6588.966..6588.967 rows=1 loops=1
                JIT:
                  Functions: 5
"                  Options: Inlining false, Optimization false, Expressions true, Deforming true"
"                  Timing: Generation 0.314 ms, Inlining 0.000 ms, Optimization 0.164 ms, Emission 3.179 ms, Total 3.657 ms"
                Buffers: shared hit=2180408 read=243504
              Worker 1:  actual time=6589.102..6589.102 rows=1 loops=1
                JIT:
                  Functions: 5
"                  Options: Inlining false, Optimization false, Expressions true, Deforming true"
"                  Timing: Generation 0.321 ms, Inlining 0.000 ms, Optimization 0.165 ms, Emission 3.174 ms, Total 3.661 ms"
                Buffers: shared hit=2142345 read=238322
              ->  Parallel Seq Scan on public.customers  (cost=0.00..160368.33 rows=898475 width=16) (actual time=3.728..6539.198 rows=722392 loops=3)
"                    Output: id, created, lastmodified, mandantid, data, customernumberpseudonym, internetnumberpseudonym, deletiondate"
                    Filter: (((customers.data -> '$encrypted'::text) ->> 'keyRef'::text) <> 'key1'::text)
                    Buffers: shared hit=6550991 read=731452
                    Worker 0:  actual time=3.343..6533.485 rows=720314 loops=1
                      Buffers: shared hit=2180408 read=243504
                    Worker 1:  actual time=3.808..6535.494 rows=706508 loops=1
                      Buffers: shared hit=2142345 read=238322
Planning:
  Buffers: shared hit=20 read=2 dirtied=1
Planning Time: 0.561 ms
JIT:
  Functions: 17
"  Options: Inlining false, Optimization false, Expressions true, Deforming true"
"  Timing: Generation 1.021 ms, Inlining 0.000 ms, Optimization 0.569 ms, Emission 10.153 ms, Total 11.744 ms"
Execution Time: 6611.164 ms

Approach 2

CREATE INDEX idx_customer_encryption_key ON customers (jsonb_extract_path_text(data, '$encrypted', 'keyRef'));
SELECT COUNT(id) FROM customers WHERE jsonb_extract_path_text(data, '$encrypted', 'keyRef') != 'key1';

Query works but is slow (no index is used).

Finalize Aggregate  (cost=161357.25..161357.26 rows=1 width=8) (actual time=6531.578..6538.081 rows=1 loops=1)
  Output: count(id)
  Buffers: shared hit=6550991 read=731452 written=2
  ->  Gather  (cost=161357.04..161357.25 rows=2 width=8) (actual time=6531.568..6538.072 rows=3 loops=1)
        Output: (PARTIAL count(id))
        Workers Planned: 2
        Workers Launched: 2
        Buffers: shared hit=6550991 read=731452 written=2
        ->  Partial Aggregate  (cost=160357.04..160357.05 rows=1 width=8) (actual time=6521.380..6521.381 rows=1 loops=3)
              Output: PARTIAL count(id)
              Buffers: shared hit=6550991 read=731452 written=2
              Worker 0:  actual time=6516.406..6516.407 rows=1 loops=1
                JIT:
                  Functions: 5
"                  Options: Inlining false, Optimization false, Expressions true, Deforming true"
"                  Timing: Generation 0.297 ms, Inlining 0.000 ms, Optimization 0.186 ms, Emission 3.301 ms, Total 3.784 ms"
                Buffers: shared hit=2185669 read=243926
              Worker 1:  actual time=6516.327..6516.328 rows=1 loops=1
                JIT:
                  Functions: 5
"                  Options: Inlining false, Optimization false, Expressions true, Deforming true"
"                  Timing: Generation 0.331 ms, Inlining 0.000 ms, Optimization 0.161 ms, Emission 3.384 ms, Total 3.876 ms"
                Buffers: shared hit=2155723 read=241152
              ->  Parallel Seq Scan on public.customers  (cost=0.00..158110.85 rows=898475 width=16) (actual time=3.594..6466.590 rows=722392 loops=3)
"                    Output: id, created, lastmodified, mandantid, data, customernumberpseudonym, internetnumberpseudonym, deletiondate"
"                    Filter: (jsonb_extract_path_text(customers.data, VARIADIC '{$encrypted,keyRef}'::text[]) <> 'key1'::text)"
                    Buffers: shared hit=6550991 read=731452 written=2
                    Worker 0:  actual time=3.490..6460.749 rows=724558 loops=1
                      Buffers: shared hit=2185669 read=243926
                    Worker 1:  actual time=3.937..6460.802 rows=712789 loops=1
                      Buffers: shared hit=2155723 read=241152
Planning:
  Buffers: shared hit=18 read=2 dirtied=1
Planning Time: 0.453 ms
JIT:
  Functions: 17
"  Options: Inlining false, Optimization false, Expressions true, Deforming true"
"  Timing: Generation 0.931 ms, Inlining 0.000 ms, Optimization 0.516 ms, Emission 9.871 ms, Total 11.319 ms"
Execution Time: 6538.429 ms

Approach 3

CREATE INDEX idx_customer_encryption_key ON customers ( ((data #> '{$encrypted,keyRef}')::varchar));
SELECT COUNT(id) FROM customers WHERE (data #> '{$encrypted,keyRef}')::varchar <> 'initialKey';

Query works but is slow (no index is used).

Finalize Aggregate  (cost=165872.20..165872.21 rows=1 width=8) (actual time=6914.265..6920.442 rows=1 loops=1)
  Output: count(id)
  Buffers: shared hit=6550978 read=731465 written=13
  ->  Gather  (cost=165871.99..165872.20 rows=2 width=8) (actual time=6914.144..6920.432 rows=3 loops=1)
        Output: (PARTIAL count(id))
        Workers Planned: 2
        Workers Launched: 2
        Buffers: shared hit=6550978 read=731465 written=13
        ->  Partial Aggregate  (cost=164871.99..164872.00 rows=1 width=8) (actual time=6904.453..6904.454 rows=1 loops=3)
              Output: PARTIAL count(id)
              Buffers: shared hit=6550978 read=731465 written=13
              Worker 0:  actual time=6899.759..6899.759 rows=1 loops=1
                JIT:
                  Functions: 5
"                  Options: Inlining false, Optimization false, Expressions true, Deforming true"
"                  Timing: Generation 0.349 ms, Inlining 0.000 ms, Optimization 0.173 ms, Emission 3.345 ms, Total 3.867 ms"
                Buffers: shared hit=2172611 read=241759 written=7
              Worker 1:  actual time=6899.599..6899.600 rows=1 loops=1
                JIT:
                  Functions: 5
"                  Options: Inlining false, Optimization false, Expressions true, Deforming true"
"                  Timing: Generation 0.347 ms, Inlining 0.000 ms, Optimization 0.174 ms, Emission 3.352 ms, Total 3.874 ms"
                Buffers: shared hit=2175840 read=243601 written=3
              ->  Parallel Seq Scan on public.customers  (cost=0.00..162625.80 rows=898475 width=16) (actual time=3.750..6848.950 rows=722392 loops=3)
"                    Output: id, created, lastmodified, mandantid, data, customernumberpseudonym, internetnumberpseudonym, deletiondate"
"                    Filter: ((((customers.data #> '{$encrypted,keyRef}'::text[]))::character varying)::text <> 'initialKey'::text)"
                    Buffers: shared hit=6550978 read=731465 written=13
                    Worker 0:  actual time=3.532..6844.562 rows=716749 loops=1
                      Buffers: shared hit=2172611 read=241759 written=7
                    Worker 1:  actual time=4.014..6843.198 rows=721111 loops=1
                      Buffers: shared hit=2175840 read=243601 written=3
Planning:
  Buffers: shared hit=18 read=2 dirtied=1
Planning Time: 0.367 ms
JIT:
  Functions: 17
"  Options: Inlining false, Optimization false, Expressions true, Deforming true"
"  Timing: Generation 1.042 ms, Inlining 0.000 ms, Optimization 0.531 ms, Emission 10.215 ms, Total 11.788 ms"
Execution Time: 6920.835 ms

Approach 4

CREATE INDEX idx_customer_encryption_key ON customers USING gin( (data -> '$encrypted') jsonb_path_ops);
SELECT COUNT(id) FROM customers WHERE data -> '$encrypted' @> '{"$encrypted": { "keyRef": "key1"}}';

Not exactly what I want, just to test if the index works. Uses an index but result always is 0.

Aggregate  (cost=58652.33..58652.34 rows=1 width=8) (actual time=0.020..0.021 rows=1 loops=1)
  Output: count(data)
  Buffers: shared hit=1 read=3
  ->  Bitmap Heap Scan on public.customers  (cost=215.96..58598.15 rows=21672 width=341) (actual time=0.018..0.018 rows=0 loops=1)
"        Output: id, created, lastmodified, mandantid, data, customernumberpseudonym, internetnumberpseudonym, deletiondate"
"        Recheck Cond: ((customers.data -> '$encrypted'::text) @> '{""$encrypted"": {""keyRef"": ""key1""}}'::jsonb)"
        Buffers: shared hit=1 read=3
        ->  Bitmap Index Scan on idx_customer_encryption_key  (cost=0.00..210.54 rows=21672 width=0) (actual time=0.016..0.016 rows=0 loops=1)
"              Index Cond: ((customers.data -> '$encrypted'::text) @> '{""$encrypted"": {""keyRef"": ""key1""}}'::jsonb)"
              Buffers: shared hit=1 read=3
Planning:
  Buffers: shared read=1
Planning Time: 0.680 ms
Execution Time: 0.062 ms

SELECT COUNT(data) FROM customers WHERE data -> '$encrypted' @> '{ "keyRef": "key1"}';

Counts correctly but is slow (does not use index).

Aggregate  (cost=58652.33..58652.34 rows=1 width=8) (actual time=32023.053..32023.054 rows=1 loops=1)
  Output: count(data)
  Buffers: shared hit=6550821 read=731787
  ->  Bitmap Heap Scan on public.customers  (cost=215.96..58598.15 rows=21672 width=341) (actual time=112.830..31779.441 rows=2167122 loops=1)
"        Output: id, created, lastmodified, mandantid, data, customernumberpseudonym, internetnumberpseudonym, deletiondate"
"        Recheck Cond: ((customers.data -> '$encrypted'::text) @> '{""keyRef"": ""initialKey""}'::jsonb)"
        Rows Removed by Index Recheck: 49
        Heap Blocks: exact=45746 lossy=98820
        Buffers: shared hit=6550821 read=731787
        ->  Bitmap Index Scan on idx_customer_encryption_key  (cost=0.00..210.54 rows=21672 width=0) (actual time=106.353..106.354 rows=2167122 loops=1)
"              Index Cond: ((customers.data -> '$encrypted'::text) @> '{""keyRef"": ""initialKey""}'::jsonb)"
              Buffers: shared hit=2 read=303
Planning:
  Buffers: shared hit=1
Planning Time: 0.068 ms
Execution Time: 32023.422 ms

Bergis’s comment

explain(analyze, buffers, verbose, format text) SELECT COUNT(data) FROM customers WHERE NOT data -> '$encrypted' @> '{"$encrypted": { "keyRef": "key1"}}';

Finalize Aggregate  (cost=161345.97..161345.98 rows=1 width=8) (actual time=6835.857..6842.275 rows=1 loops=1)
  Output: count(data)
  Buffers: shared hit=6550964 read=731479
  ->  Gather  (cost=161345.75..161345.96 rows=2 width=8) (actual time=6835.611..6842.266 rows=3 loops=1)
        Output: (PARTIAL count(data))
        Workers Planned: 2
        Workers Launched: 2
        Buffers: shared hit=6550964 read=731479
        ->  Partial Aggregate  (cost=160345.75..160345.76 rows=1 width=8) (actual time=6821.304..6821.304 rows=1 loops=3)
              Output: PARTIAL count(data)
              Buffers: shared hit=6550964 read=731479
              Worker 0:  actual time=6814.290..6814.291 rows=1 loops=1
                JIT:
                  Functions: 5
"                  Options: Inlining false, Optimization false, Expressions true, Deforming true"
"                  Timing: Generation 1.647 ms, Inlining 0.000 ms, Optimization 0.230 ms, Emission 3.716 ms, Total 5.592 ms"
                Buffers: shared hit=2168672 read=241803
              Worker 1:  actual time=6814.158..6814.158 rows=1 loops=1
                JIT:
                  Functions: 5
"                  Options: Inlining false, Optimization false, Expressions true, Deforming true"
"                  Timing: Generation 1.658 ms, Inlining 0.000 ms, Optimization 0.234 ms, Emission 3.721 ms, Total 5.613 ms"
                Buffers: shared hit=2196206 read=244975
              ->  Parallel Seq Scan on public.customers  (cost=0.00..158110.85 rows=893960 width=341) (actual time=4.449..6766.917 rows=722392 loops=3)
"                    Output: id, created, lastmodified, mandantid, data, customernumberpseudonym, internetnumberpseudonym, deletiondate"
"                    Filter: (NOT ((customers.data -> '$encrypted'::text) @> '{""$encrypted"": {""keyRef"": ""key1""}}'::jsonb))"
                    Buffers: shared hit=6550964 read=731479
                    Worker 0:  actual time=5.220..6760.360 rows=717794 loops=1
                      Buffers: shared hit=2168672 read=241803
                    Worker 1:  actual time=3.965..6761.016 rows=725142 loops=1
                      Buffers: shared hit=2196206 read=244975
Planning Time: 0.061 ms
JIT:
  Functions: 17
"  Options: Inlining false, Optimization false, Expressions true, Deforming true"
"  Timing: Generation 3.638 ms, Inlining 0.000 ms, Optimization 0.691 ms, Emission 11.364 ms, Total 15.692 ms"
Execution Time: 6842.656 ms

Approach 5

CREATE INDEX idx_customer_encryption_key ON customers USING GIN( ((data -> '$encrypted')::jsonb));
SELECT COUNT(data) FROM customers WHERE data -> '$encrypted' @> '{"keyRef" : "key1"}';

Query works but is slow (no index is used).

Aggregate  (cost=58676.33..58676.34 rows=1 width=8) (actual time=27662.270..27662.272 rows=1 loops=1)
  Output: count(data)
  Buffers: shared hit=6551064 read=731894 written=13274
  ->  Bitmap Heap Scan on public.customers  (cost=239.96..58622.15 rows=21672 width=341) (actual time=145.119..27449.141 rows=2167122 loops=1)
"        Output: id, created, lastmodified, mandantid, data, customernumberpseudonym, internetnumberpseudonym, deletiondate"
"        Recheck Cond: ((customers.data -> '$encrypted'::text) @> '{""keyRef"": ""initialKey""}'::jsonb)"
        Rows Removed by Index Recheck: 49
        Heap Blocks: exact=45746 lossy=98820
        Buffers: shared hit=6551064 read=731894 written=13274
        ->  Bitmap Index Scan on idx_customer_encryption_key  (cost=0.00..234.54 rows=21672 width=0) (actual time=138.368..138.369 rows=2167122 loops=1)
"              Index Cond: ((customers.data -> '$encrypted'::text) @> '{""keyRef"": ""initialKey""}'::jsonb)"
              Buffers: shared hit=240 read=415
Planning:
  Buffers: shared hit=22 read=4 dirtied=1
Planning Time: 0.575 ms
Execution Time: 27662.936 ms

Approach 6

CREATE INDEX idx_customer_encryption_key ON customers USING gin( (data #> '{$encrypted,keyRef}') );
SELECT COUNT(data) from customers WHERE (data #>> '{$encrypted,keyRef}')::text = 'key1';

Query works but is slow (no index is used).

Finalize Aggregate  (cost=159122.35..159122.36 rows=1 width=8) (actual time=7124.276..7130.794 rows=1 loops=1)
  Output: count(data)
  Buffers: shared hit=6550972 read=731471 written=307
  ->  Gather  (cost=159122.14..159122.35 rows=2 width=8) (actual time=7124.001..7130.783 rows=3 loops=1)
        Output: (PARTIAL count(data))
        Workers Planned: 2
        Workers Launched: 2
        Buffers: shared hit=6550972 read=731471 written=307
        ->  Partial Aggregate  (cost=158122.14..158122.15 rows=1 width=8) (actual time=7108.420..7108.420 rows=1 loops=3)
              Output: PARTIAL count(data)
              Buffers: shared hit=6550972 read=731471 written=307
              Worker 0:  actual time=7100.765..7100.765 rows=1 loops=1
                JIT:
                  Functions: 5
"                  Options: Inlining false, Optimization false, Expressions true, Deforming true"
"                  Timing: Generation 0.632 ms, Inlining 0.000 ms, Optimization 0.411 ms, Emission 9.283 ms, Total 10.327 ms"
                Buffers: shared hit=2175813 read=242587 written=85
              Worker 1:  actual time=7100.660..7100.661 rows=1 loops=1
                JIT:
                  Functions: 5
"                  Options: Inlining false, Optimization false, Expressions true, Deforming true"
"                  Timing: Generation 0.620 ms, Inlining 0.000 ms, Optimization 0.393 ms, Emission 8.127 ms, Total 9.140 ms"
                Buffers: shared hit=2180525 read=244599 written=117
              ->  Parallel Seq Scan on public.customers  (cost=0.00..158110.85 rows=4515 width=341) (actual time=8.311..7054.700 rows=722374 loops=3)
"                    Output: id, created, lastmodified, mandantid, data, customernumberpseudonym, internetnumberpseudonym, deletiondate"
"                    Filter: ((customers.data #>> '{$encrypted,keyRef}'::text[]) = 'initialKey'::text)"
                    Rows Removed by Filter: 18
                    Buffers: shared hit=6550972 read=731471 written=307
                    Worker 0:  actual time=10.292..7046.967 rows=718818 loops=1
                      Buffers: shared hit=2175813 read=242587 written=85
                    Worker 1:  actual time=8.559..7046.322 rows=726031 loops=1
                      Buffers: shared hit=2180525 read=244599 written=117
Planning:
  Buffers: shared hit=17 dirtied=1
Planning Time: 0.187 ms
JIT:
  Functions: 17
"  Options: Inlining false, Optimization false, Expressions true, Deforming true"
"  Timing: Generation 1.666 ms, Inlining 0.000 ms, Optimization 1.118 ms, Emission 23.162 ms, Total 25.945 ms"
Execution Time: 7131.263 ms

Approach 7

Now I have no more ideas why the index is not used for the query.

Any ideas out there? Any help is highly appreciated!

Answers

Chosen as BEST ANSWER
- greenwizard
- January 26, 2023 at 12:35 pm
- 0 votes
0
As the indexing of nested JSONB in PostgreSQL seems to be difficult, I ended up in adding an extra table-column and index for keyRef.

This is not elegant, I have data duplication but querying the database is as fast as expected.

Thanks to everybod who invested time to read my question!

(Edit)

- jjanes
- January 23, 2023 at 6:32 pm
- 0 votes
0
While in theory a regular B-tree index could implement != by doing a piece-wise scan of the < and of the > and merging them, that is not implemented. So with the current system, you would need a filtered index whose WHERE matches the one used in the query.
```
create index on customers (id) where data->'$encrypted' ->> 'keyRef' != 'key1';

select COUNT(id) FROM customers WHERE data->'$encrypted' ->> 'keyRef' != 'key1';
```
Of course unless the condition is quite rare, the index might not be useful (or used) anyway. Also, this might not do what you want in case "data" doesn’t have the path '$encrypted' -> 'keyRef', as then you would be testing NULL != 'key1' which does not yield true.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.