I am using aws textract StartDocumentTextDetectionCommand and GetDocumentTextDetectionCommand. I want only lines to be returned, not the single words - Amazon web services

FarisAshhab
August 30, 2022
275 views
2 votes
3 Answers

I am creating an OCR internal tool using aws textract and nodejs to detect text from a scanned pdf, specifically StartDocumentTextDetectionCommand and GetDocumentTextDetectionCommand. Currently returned in a list of block objects with the lines first and then starts detecting each word by word. Is there any way for me to add in a parameter or something where it will just return the lines for me and not the word by word in the pdf.

Answers

- SalazNumpt
- September 23, 2022 at 6:14 pm
- 0 votes
0
No, this is not possible. There are multiple block types, lines link to words via relationships.

Is there some reason why you cannot simply select only the block types you are interested in (lines)?

Login or Signup to reply.

- JayalekshmiRJ
- October 12, 2022 at 8:23 am
- 0 votes
0
Response will always contain the lines and words. But you can iterate the response[‘Blocks’] and find only the blocks with BlockType == ‘LINES’.
Eg. below:
```
    for block in response["Blocks"]:
        if block["BlockType"] == "LINE":
            print(block)
```
Login or Signup to reply.

- Thomas
- October 26, 2022 at 11:27 pm
- 0 votes
0
I would suggest to use the Amazon Textract Textractor library pip install amazon-textract-textractor

It makes parsing and using the Textract output much easier than the raw JSON.
```
from textractor import Textractor

extractor = Textractor(profile_name="default")
document = extractor.detect_document_text('test.png')
print(document.lines)
```
Login or Signup to reply.