skip to Main Content

I want to convert a Bengali PDF to a text file. The current tool I’m using, poppler-utils’ pdftotext, doesn’t provide accurate results because the PDF uses Kalpurush font. Are there any tools that allow me to specify the Kalpurush font to get accurate results? I’d like to do this using Python, PHP,JS, or a Bash script.

2

Answers


  1. You can try using Tesseract OCR engine. It supports Bengali and it allows specifying the font to be used for text recognition. You need to install the packages:

    pip install pdf2image pytesseract
    

    Then, convert the pdf to images:

    images = convert_from_path(pdf_path)
    

    And, finally, retrieve the text:

    text = "" 
    for page in images:
        # Use pytesseract to extract text from the image
        # Specify the Kalpurush font and the Bengali language code
        text += pytesseract.image_to_string(page, lang='ben', config=f'--psm 6 --oem 3 -c tessedit_char_whitelist=0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ் --tessdata-dir "{font_path}"')
    

    Alternatively, you can use the library pdfplumber which extracts text from PDF files and supports specifying custom fonts. Similarly, you need to install it:

    pip install pdfplumber
    

    Then, extract the text from the pdf:

    with pdfplumber.open(pdf_path) as pdf:
        # Load the custom font
        pdf.load_font(font_path)
    
        text = ""
        for page in pdf.pages:
            text += page.extract_text()
    
    Login or Signup to reply.
  2. PyMuPDF lets you write and read text written with Bengali fonts, e.g. also kalpurush.ttf:

    import pymupdf
    
    fontfile = "kalpurush.ttf"
    doc = pymupdf.open()  # make a new PDF
    page = doc.new_page()  # add a page
    rect = pymupdf.Rect(100, 100, 400, 400)  # write text inside this
    
    text = """সমস্ত মানুষ স্বাধীনভাবে সমান মর্যাদা এবং অধিকার নিয়ে জন্মগ্রহণ
    করে। তাঁদের বিবেক এবং বুদ্ধি আছে; সুতরাং সকলেরই একে অপরের প্রতি ভ্রাতৃত্বসুলভ
     মনোভাব নিয়ে আচরণ করা উচিৎ।"""
    
    # instructions to use the font file
    css = """@font-face {font-family: bengali; src: url(kalpurush.ttf);}
    * {font-family: bengali;}
    """
    
    # search current folder for font file
    arch = pymupdf.Archive(".")
    
    # insert the text
    page.insert_htmlbox(rect, text, css=css, archive=arch)
    doc.ez_save("bengali.pdf")
    

    Reading the text again:

    import pymupdf
    doc = pymupfg.open("bengali.pdf")
    page = doc[0]
    text = page.get_text()
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search