skip to Main Content

The intention of this feature is to allow the user to upload a PDF file to the front end (Angular) and display the information of the PDF file on the screen. But it does not show the complete PDF file, only the information contained.

For this, I thought I could convert the PDF file to a .TXT file or something similar to let extract the info.

Example:

  • Input: Cedula de Indentificación Fiscal.pdf <– file Example pdf information
  • Output: { name: Gregorio Emanuel Hernandez Rivera, address:"…", …, etc.}

2

Answers


  1. You can use tesseractJS to read the text from pdf.

    Login or Signup to reply.
  2. OCR by default will have no context of such fields.

    You have a few options depending on your scenario:

    1. The documents all follow the same template (as provided) and are the same resolution.

    If you are working with a small number of fields, do the OCR and create a JSON mapping of the expected fields and their pixel coordinates, e.g. {"name": [x1, y1, x2, y2], …}, then map these ROIs back to the OCR output. Most OCR engines support coordinate output at the word level.

    1. The documents all follow the same template (as provided) but can come scanned or as a picture (let’s say by a mobile phone).

    You would need to do some image transformations with key-point mapping to match it to a template document. Then you can try option 1.

    1. The documents don’t follow any template (provided sample was just one example).

    In this case, given the fields you listed I think an option to try would be first to OCR the image, then utilize named-entity recognition (NER) on the extracted text to determine the fields you had listed (name, address, etc..). See as an example: spaCy NER.

    Another thing you should consider is whether or not the PDF you are given already has a text layer. If the PDF already has a text layer you do not need to OCR and can just parse the text directly. In this case, you might want to consider skipping OCR processing and just trying something like NER to extract the names and addresses. If NER is not giving good results you should fall back to options 1 and 2.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search