skip to Main Content

I have installed apache Tika 1.8 and it is running perfectly except the OCR part is not working. I have Tesseract installed and it is also working properly.
When I try to send a pdf with an image on it I get the following.

WARNING: Tesseract OCR is installed and will be automatically applied to image f
iles unless
you’ve excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on via
TikaConfig.

Can I configure the TikaConfig using the command line utility ? Or do I have to clone the project and update poms and rebuild. I really do not want to have to do that.

There is some info here on how to use the command line utility and the TikaConfig but I cannot figure out how to enable TesseractOCRParser with it.

Any help, greatly appreciated.

3

Answers


  1. Chosen as BEST ANSWER

    OK so with the help of this post on the Apache Tika Forum Thank you guys.

    I managed to get it working. Its a hack but It works. What I did was extract the Tika-app Jar file. Then locate the PDFParser.properties and change the following properties like this

    extractInlineImages true 
    extractUniqueInlineImagesOnly false 
    ocrStrategy ocr_and_text_extraction
    

    Then locate TesseractOCRConfig.properties. And change this one property to 1..

    enableImageProcessing=1
    

    Save the above properties files. Zip it all up again. And use your new zipped up jar file and it will now extract text and text from images from a pdf file.


  2. I tried user3250052’s approach but I was unable to recompress the jar file in a way that was executable. That’s owing to my own inexperience with Java, but regardless, the less hacky way is to call a custom tika config file when calling tika:

    java -jar tika-app.jar --config=tika-config.xml image.pdf
    

    This is what my tika-config.xml looks like:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <properties>
      <!--for example: <mimeTypeRepository resource="/org/apache/tika/mime/tika-mimetypes.xml"/>-->
      <service-loader dynamic="true" loadErrorHandler="IGNORE"/>
      <encodingDetectors>
        <encodingDetector class="org.apache.tika.detect.DefaultEncodingDetector"/>
      </encodingDetectors>
      <translator class="org.apache.tika.language.translate.DefaultTranslator"/>
      <detectors>
        <detector class="org.apache.tika.detect.DefaultDetector"/>
      </detectors>
      <parsers>
        <parser class="org.apache.tika.parser.DefaultParser"/>
        <parser class="org.apache.tika.parser.pdf.PDFParser">
          <params>
            <param name="extractInlineImages" type="bool">true</param>
          </params>
        </parser>
      </parsers>
    </properties>
    

    To build that that config file, first I ran:

    java -jar tika-app.jar --dump-current-config
    

    That will dump for you the default config. I took that and put it into tika-config.xml and added:

    <parser class="org.apache.tika.parser.pdf.PDFParser">
      <params>
        <param name="extractInlineImages" type="bool">true</param>
      </params>
    </parser>
    

    which I gleaned from https://cwiki.apache.org/confluence/display/tika/PDFParser%20(Apache%20PDFBox) (option 1).

    Even though tesseract is enabled by default (so OCR will work out of the box on image files), PDFs do not get OCRed without that option set because, as noted in the above link, “by default, extracting inline images is turned off because some rare PDFs contain thousands of inline images per page, and it has a big hit on performance, both memory usage and time”.

    Now everything (OCR on image files, OCR of images in or image-based PDFs, and also naturally text extraction of text-based PDFs) works with the java app tika. I found plenty of documentation on getting this to work on the java server tika but very little on the java app tika, so I’m hoping this saves someone the few hours it took me to figure that out (let me know).

    Login or Signup to reply.
  3. I would recommend using ocrStrategy auto

    This tries to extract and then falls back onto OCR

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search