skip to Main Content

I will try my best to go through the issue. Please let me know if there needs any clarity.

Environment:
Application is deployed on AWS, with multiple instances connected to a single datastore.
The data store consists of tables,

Legacy tables:

instance_info (id, instance_details, ...)
task_info (id, task_id, ...)

Newly added table:

new_table (id, instance_info_id, task_info_id, ...)  

Schema design:

  1. id – in all the tables are the PKs.
  2. In new_table the column,
    • task_info_id is a foreign key to table task_info and,
    • instance_info_id is to table instance_info.
    • A unique constraint exists on columns instance_info_id & task_info_id.

Problem:
When the code executes, it divides (forks) its operation into multiple threads that execute independently & in parallel. On completion, these threads join and try to insert data into one of the legacy tables – "task_info".
Now, there could be a situation where these multiple threads (running concurrently on a single node), will successfully populate multiple entries into the table.

Requirement:
If there are multiple threads, working in parallel, then only one thread INSERTs a record into the table "task_info", while the other threads only update it.

Limitations:

  1. cannot add unique constraints to the task_info table as this approach ruins an existing (legacy code) functionality for the retrying mechanism.
  2. Cannot lock the whole table during write operation as this could end up creating performance issues for us.
  3. A thoughtful approach of using a "write-through" mechanism (distributed Memcache) however, there seems to be a doubt if we take downtime into consideration, that could lead to data loss.

Is there any efficient design approach (with minimal/no changes in the legacy code/design) that can be looked into?

UPDATE

There are some real tough restrictions on implementing a solution (due to the cost of adding additional resources) as follows,

  1. The database supported are Oracle, SQL Server, MySQL & MariaDB. Hence, the locking mechanism must be interoperable.
  2. There are limitations on the resources that can be used – the database & Memcache.
  3. The system can be deployed both on cloud & on-prem.
  4. Cannot carve out a module out of the application, or create/depend on a new external service. I really loved the ideas suggested by Rob, as they are elegant & make the frameworks handle the majority of the complexity for me. However, that adds the cost of adding & maintaining the resources.

I guess the architecture & the restrictions to change it, make it complicated to find the correct & cost-effective solution for it.

2

Answers


  1. You are looking for a distributed lock manager. There are lots of options for this, but since you are already using AWS you should consider the one they built using DynamoDB as a lock-store. Three are lots of alternatives though, if you don’t like the one AWS built there are things like ZooKeeper that help maintain distributed lock systems.

    Login or Signup to reply.
  2. It sounds like Rob Conklin knows more about this than I do, so definitely take a look at his answer.

    One option that comes to mind is using queue. I have never used this approach myself within an application, but in theory your various instances can throw whatever they like at the queue, which manages all the randomness of the incoming calls by ensuring they are processed according to whatever rules you want (like FIFO – first in, first out). This would means you never have two calls trying to lock the DB because the queue would make sure that never happened.

    Another advantage of some queuing solutions is that they can store the events/messages and play them back, or replay them back, later. This means you can take the DB offline and let events gather in the cache, then play them through once the DB is back up.

    Obviously you just need some logic to manage the first-in-create / next-in-update approach. This will be easier with the sequence of messages now being consistent and more predictable via the queue.

    Update 2-Jul-2021

    Regards to your comment… In terms of synchronousity, assuming I understood your concern correctly – the call to the queue (or whatever façade you have in front of that) would typically be synchronous, so callers won’t have to hang around and wait because all that’s happening is that their call is accepted into the queue – which should be relatively fast. One potential issue is if the calling software assumes the call it’s made is complete against the DB when actually it might still be in the queue – is that what you mean? If so, it’s a bit hard to say what the right approach would be based on what’s been said so far.

    What if multiple nodes "re-tried" (retry is a legacy functionality)
    the same process and all of those nodes started updating the database?

    A Façade or Proxy pattern might be useful here, where you have a proxy that manages all calls against the database. This could maybe also help with the synchronousity issues.

    enter image description here

    Here I have the "Uber Proxy" which contains the queue to help wrangle the randomness generated by the multiple callers/instances, and an actual proxy component that performs the actual database calls.

    The thing with the proxy is that you can program logic into it to help it make decisions about which calls to execute and which ones to ignore or whatever.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search