skip to Main Content

I am creating an OCR internal tool using aws textract and nodejs to detect text from a scanned pdf, specifically StartDocumentTextDetectionCommand and GetDocumentTextDetectionCommand. Currently returned in a list of block objects with the lines first and then starts detecting each word by word. Is there any way for me to add in a parameter or something where it will just return the lines for me and not the word by word in the pdf.

3

Answers


  1. No, this is not possible. There are multiple block types, lines link to words via relationships.

    Is there some reason why you cannot simply select only the block types you are interested in (lines)?

    Login or Signup to reply.
  2. Response will always contain the lines and words. But you can iterate the response[‘Blocks’] and find only the blocks with BlockType == ‘LINES’.
    Eg. below:

        for block in response["Blocks"]:
            if block["BlockType"] == "LINE":
                print(block)
    
    Login or Signup to reply.
  3. I would suggest to use the Amazon Textract Textractor library pip install amazon-textract-textractor

    It makes parsing and using the Textract output much easier than the raw JSON.

    from textractor import Textractor
    
    extractor = Textractor(profile_name="default")
    document = extractor.detect_document_text('test.png')
    print(document.lines)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search