I want to convert a Bengali PDF to a text file. The current tool I’m using, poppler-utils’ pdftotext, doesn’t provide accurate results because the PDF uses Kalpurush font. Are there any tools that allow me to specify the Kalpurush font to get accurate results? I’d like to do this using Python, PHP,JS, or a Bash script.
2
Answers
You can try using Tesseract OCR engine. It supports Bengali and it allows specifying the font to be used for text recognition. You need to install the packages:
Then, convert the pdf to images:
And, finally, retrieve the text:
Alternatively, you can use the library pdfplumber which extracts text from PDF files and supports specifying custom fonts. Similarly, you need to install it:
Then, extract the text from the pdf:
PyMuPDF lets you write and read text written with Bengali fonts, e.g. also
kalpurush.ttf
:Reading the text again: