skip to Main Content

I am building a system to allow our clients to transform PDF bank statements (from many different banks) to its better CSV form (better because it can be imported into accounting application). It will find tables on PDFs pages and convert them into CSV files.

I am going to use:

  1. Simple static webpage with HTML form to upload PDFs and choose which bank to process. It will also display job status and allow to download result of the transformation (CSV files). It should operate without user authentication.
  2. Backend running on NodeJS (more on that later)
  3. Excalibur
  4. Puppeteer (to operate Excalibur)

The Backend has to take responsibility for:

  1. Receiving request from the UI (PDF payload)
  2. Generate new job id
    1. sending it back to UI
    2. provide HTTP resource for UI to ask for job status
  3. Make new instance of Puppeteer, pass to it received PDF and job id
  4. Wait for Puppeteer to finish, receive archive file (Excalibur puts every page of the table in a separate CSV file)
  5. Unpack archived CSV files
  6. Normalize it with transformers (written with https://www.npmjs.com/package/mississippi)
  7. Send response to UI (client)

Problems that will occur:

  1. Multi-tenancy – multiple users at once will access the system (I am used to PHP which runs in context of a one user session, and I know that NodeJS resides in memory, going to resolve it with ‘continuation-local-storage’ package)
  2. Communication FE<->BE, there is a challenge with processing of big PDF files (it will take a lot of time) and giving feedback to user. That’s why I need some sort of job id to recognize clients.
  3. Disabling Excalibur database – my solution does not need to save any state.

As You can see there is quite a lot of things to do. I do not want to discuss decisions (eg why Puppeteer and not direct access to Excalibur API). This is rather the first, crude version. I have plenty of ideas to improve this system later.

My question is: Should I use message queue system or not to simplify (make it more readable) this system? How could this system benefit from using such queue like AMQP or Azure Queues or simply MongoDB as a queue? How a simple design (block diagram) of such system could look like when using message queue? I have no previous experience with message queues, I never used them, but I feel message queue could help me design better structure of this system.

2

Answers


  1. In general, queuing is not used to simplify a system. The simplest approach is to do the translation when the message is received and immediately respond with the result. The primary function of a queue is to add a layer of isolation between the data consumer and the data producer which supports a dynamic ordered backlog of messages to work on. Using a queue can be useful in situations where:

    1. Incoming messages do not need to be processed real-time.
    2. Message production rates may temporarily exceed consumption rates.
    3. Message consumers do not depend on message producers.
    4. Processing order of messages is important.

    Given translating PDF files to csv is a relatively expensive operation and it doesn’t need to complete immediately, writing incoming requests to a queue and responding with a job ID is a reasonable approach.

    Login or Signup to reply.
  2. AMQP, SQS or Azure Queues does not really work that well with large payloads. Further, they are not in itself a job engine. I.e. a job engine you can query for job progress, cancel a job etc. Such queues are used primarily to shuffle and buffer a lot of smaller messages across your system, or to notify other parts of your system.

    So, perhaps depending on the compute time of a text recognition job (I have no idea) a queue would help you buffer the load and perhaps use one worker per tenant if that is important to give some amount of "fairness" among your tenants. I.e. one tenant submits an entire library to scan and the others have to wait a week or two to use your system for a single line of text.

    However, for reporting status to the user "job is 10% done" and so forth, you can probably send some web socket messages but in the end you probably end up wanting to store info about the progress of each job in a database if they take more than a couple of seconds to finish.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search