Php - I want to extract Bengali text from a PDF

MdAtiqurRahmanMazumder
May 28, 2024
225 views
0 votes
2 Answers

I want to convert a Bengali PDF to a text file. The current tool I’m using, poppler-utils’ pdftotext, doesn’t provide accurate results because the PDF uses Kalpurush font. Are there any tools that allow me to specify the Kalpurush font to get accurate results? I’d like to do this using Python, PHP,JS, or a Bash script.

Answers

- TaycirYahmed
- May 28, 2024 at 10:40 pm
- 0 votes
0
You can try using Tesseract OCR engine. It supports Bengali and it allows specifying the font to be used for text recognition. You need to install the packages:
```
pip install pdf2image pytesseract
```
Then, convert the pdf to images:
```
images = convert_from_path(pdf_path)
```
And, finally, retrieve the text:
```
text = "" 
for page in images:
    # Use pytesseract to extract text from the image
    # Specify the Kalpurush font and the Bengali language code
    text += pytesseract.image_to_string(page, lang='ben', config=f'--psm 6 --oem 3 -c tessedit_char_whitelist=0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ் --tessdata-dir "{font_path}"')
```
Alternatively, you can use the library pdfplumber which extracts text from PDF files and supports specifying custom fonts. Similarly, you need to install it:
```
pip install pdfplumber
```
Then, extract the text from the pdf:
```
with pdfplumber.open(pdf_path) as pdf:
    # Load the custom font
    pdf.load_font(font_path)

    text = ""
    for page in pdf.pages:
        text += page.extract_text()
```
Login or Signup to reply.

PyMuPDF lets you write and read text written with Bengali fonts, e.g. also kalpurush.ttf:

import pymupdf

fontfile = "kalpurush.ttf"
doc = pymupdf.open()  # make a new PDF
page = doc.new_page()  # add a page
rect = pymupdf.Rect(100, 100, 400, 400)  # write text inside this

text = """সমস্ত মানুষ স্বাধীনভাবে সমান মর্যাদা এবং অধিকার নিয়ে জন্মগ্রহণ
করে। তাঁদের বিবেক এবং বুদ্ধি আছে; সুতরাং সকলেরই একে অপরের প্রতি ভ্রাতৃত্বসুলভ
 মনোভাব নিয়ে আচরণ করা উচিৎ।"""

# instructions to use the font file
css = """@font-face {font-family: bengali; src: url(kalpurush.ttf);}
* {font-family: bengali;}
"""

# search current folder for font file
arch = pymupdf.Archive(".")

# insert the text
page.insert_htmlbox(rect, text, css=css, archive=arch)
doc.ez_save("bengali.pdf")

Reading the text again:

import pymupdf
doc = pymupfg.open("bengali.pdf")
page = doc[0]
text = page.get_text()

Please signup or login to give your own answer.

Click here to cancel reply.

Php – I want to extract Bengali text from a PDF

Answers