Postgresql - Parallel table row processing across multiple service worker nodes

OverflowStack
August 15, 2023
342 views
0 votes
2 Answers

My system consists of

Database (Timescale Db) and table telemetry_aggregations
.Net Worker nodes (1…n)

Every couple minutes, telemetry_aggregations table gets updated with couple hundreds of new rows.

How system should work

Every node periodically requests new batch of data (SQL rows) that haven’t been processed yet.
Same data (SQL rows) cannot be shared and processed across multiple nodes. If Node 1 gets rows with id 100, 101 … 200, no other services can read/update these rows while Node 1 is processing them. Data processing speed is not guaranteed.

The problem

What sort of architectural pattern or approach should be used to achieve this sort of a system?

I probably need a new table of some sort to keep track of what data has been processed and which haven’t been. I’m suspecting I also need some sort of a lock mechanism which would indicate that this particular row is being worked on and that it should be skipped by other services. But other than that, I am not sure what the best approach to tackle this is.

Answers

- noctarius
- August 14, 2023 at 10:24 am
- 0 votes
0
What you try to achieve sounds a lot like you use your database as some kind of a message / task queue. Not sure this is the best approach. While possible, I’d probably look into task scheduling frameworks, or a messaging solution that supports partitioning, such as Kafka or NATS. That way you can write the metrics to the database (for archiving) and to the message queue for processing. I mentioned the partitioning because that way you can drop different nodes onto different partitions (like using the node id) to make sure that data for the same node is processed by the same processor (and in order of occurrence). I hope this makes sense.

Login or Signup to reply.

- TmTron
- August 15, 2023 at 8:53 am
- 0 votes
0
You may want to use SKIP LOCKED for that.

With SKIP LOCKED, any selected rows that cannot be immediately locked are skipped. Skipping locked rows provides an inconsistent view of the data, so this is not suitable for general purpose work, but can be used to avoid lock contention with multiple consumers accessing a queue-like table.

Here’s a nice blog that explains the feature in detail.

And as you said, you will need some info about which rows have been processed. What to do depends on your use-case of course: e.g.
- when the data has been consumed, you may just delete the original row after it has been processed
- if you still need that data later, then maybe add a boolean column processed to each row (and set it to true after it has been processed)
Some additional hints:
- When you want to avoid that your nodes constantly poll the database, you can use the LISTEN/NOTIFY feature.
- You should also think about what to do when a row cannot be processed for any reason: e.g. when your logic just throws an error and you roll-back the transaction, then the next iteration will read the same row again and fail, and so on. Thus, you may want to introduce some error state
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Parallel table row processing across multiple service worker nodes

Answers