I am creating an OCR internal tool using aws textract and nodejs to detect text from a scanned pdf, specifically StartDocumentTextDetectionCommand and GetDocumentTextDetectionCommand. Currently returned in a list of block objects with the lines first and then starts detecting each word by word. Is there any way for me to add in a parameter or something where it will just return the lines for me and not the word by word in the pdf.
- August 30, 2022
- 275 views
- 2 votes
- 3 Answers
-
3
Answers
No, this is not possible. There are multiple block types, lines link to words via relationships.
Is there some reason why you cannot simply select only the block types you are interested in (lines)?
Response will always contain the lines and words. But you can iterate the response[‘Blocks’] and find only the blocks with BlockType == ‘LINES’.
Eg. below:
I would suggest to use the Amazon Textract Textractor library
pip install amazon-textract-textractor
It makes parsing and using the Textract output much easier than the raw JSON.