skip to Main Content

We have a table with around 10k rows, with the following scheme:

  1. item_id: TEXT (Primary Key)
  2. xml_1: XML
  3. xml_2: XML
  4. country: TEXT

Running the following query takes around 9 to 10 seconds:

SELECT * FROM info_table
WHERE item_id IN ('item1','item2','...' -> 'item2000')

Each of our SELECT query is composed of an array of around 2,000+- of items ids (Strings), the query is extremely simple and we are looking to optimize it (if possible).
The size of the XML of each row is around 100Kb

If it helps our query is being done in Node.JS using Knex, such as:

client.select('*').from('info_table').where('item_id','in',ids)

The server is using PostgreSQL 14 hosted by GCP Cloud SQL with 2vCPU, 8GB Memory and 100GB SSD

Results of EXPLAIN (ANALYZE, BUFFERS):

Seq Scan on epg_test  (cost=4.85..740.17 rows=1939 width=601) (actual time=0.168..3.432 rows=1837 loops=1)
  Filter: (epg_id = ANY (Array of 2000 IDs)
  Rows Removed by Filter: 6051
  Buffers: shared hit=617
Planning:
  Buffers: shared hit=130
Planning Time: 1.999 ms
Execution Time: 3.590 ms

Any ideas of what we can do?

3

Answers


  1. Chosen as BEST ANSWER

    Thank you everyone for the input, it was very helpful.

    We managed to get the query down to 2 seconds by compressing the XML data before inserting to our PostgreSQL resulting in a massive reduction in size generally.

    We compressed the XML using "zlib" and GZIP.


  2. From your question and comments, it’s clear that the PostgreSQL complexity of your query is minimal. It takes under 4ms. Therefore, indexing or other SQL tuning isn’t part of your solution.

    It’s also clear that you’re returning a large result set, amounting to something like 0.2GiB. And, you’re doing it in ten seconds or so. That means your throughput is 20MiB/sec which is excellent. This is especially true if you’re retrieving it into a machine on your premises from a server located somewhere in GCP. (Keep in mind that 20megaBYTES a second takes upwards of 160megaBITS per second. That’s a significant amount of bandwidth to push from one machine to another.)

    How can you get this big data transfer to complete faster?

    1. More bandwidth. That you have to take up with your operations people. Or by moving the machine running the query closer on the net to the database machine.

    2. Compressing the data in transit. XML is generally quite compressible (information-theoretically it’s almost pathologically verbose). The PostgreSQL driver for nodejs (and knex) has a deprecated sslcompression connection-string flag that will apply lossless compression to the client-server network traffic. That might help.

      Or, you may be able to tunnel your database connection through an ssh session set up with the -C — compressed protocol — flag.

    3. Compressing the data at rest in your database. If you do this make sure you store the compressed xml in columns with a binary data type.

    All that being said, ten seconds to process that much data doesn’t seem terribly unreasonable.

    Login or Signup to reply.
  3. Building on @PepeNO’s comment – avoid the network latency altogether by writing a GCP cloud function that performs the query then assembles the combined XML document you’re looking for and returns that. That way the heavy network traffic will all remain inside GCP. If the combined XML document can benefit from caching then that will also help your UX as you won’t need to do the transformation as often.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search