How do you enable the TesseractOCRParser using TikaConfig and the Tika command line utility? - Apache

Dunski
August 2, 2018
247 views
3 votes
3 Answers

I have installed apache Tika 1.8 and it is running perfectly except the OCR part is not working. I have Tesseract installed and it is also working properly.
When I try to send a pdf with an image on it I get the following.

WARNING: Tesseract OCR is installed and will be automatically applied to image f
iles unless
you’ve excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on via
TikaConfig.

Can I configure the TikaConfig using the command line utility ? Or do I have to clone the project and update poms and rebuild. I really do not want to have to do that.

There is some info here on how to use the command line utility and the TikaConfig but I cannot figure out how to enable TesseractOCRParser with it.

Any help, greatly appreciated.

Answers

Chosen as BEST ANSWER
- Dunski
- August 3, 2018 at 11:00 am
- 0 votes
0
OK so with the help of this post on the Apache Tika Forum Thank you guys.

I managed to get it working. Its a hack but It works. What I did was extract the Tika-app Jar file. Then locate the PDFParser.properties and change the following properties like this
```
extractInlineImages true 
extractUniqueInlineImagesOnly false 
ocrStrategy ocr_and_text_extraction
```
Then locate TesseractOCRConfig.properties. And change this one property to 1..
```
enableImageProcessing=1
```
Save the above properties files. Zip it all up again. And use your new zipped up jar file and it will now extract text and text from images from a pdf file.

(Edit)

- SsshirazzZ
- February 26, 2020 at 12:58 am
- 0 votes
0
I tried user3250052’s approach but I was unable to recompress the jar file in a way that was executable. That’s owing to my own inexperience with Java, but regardless, the less hacky way is to call a custom tika config file when calling tika:
```
java -jar tika-app.jar --config=tika-config.xml image.pdf
```
This is what my tika-config.xml looks like:
```
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<properties>
  
  <service-loader dynamic="true" loadErrorHandler="IGNORE"/>
  <encodingDetectors>
    <encodingDetector class="org.apache.tika.detect.DefaultEncodingDetector"/>
  </encodingDetectors>
  <translator class="org.apache.tika.language.translate.DefaultTranslator"/>
  <detectors>
    <detector class="org.apache.tika.detect.DefaultDetector"/>
  </detectors>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser"/>
    <parser class="org.apache.tika.parser.pdf.PDFParser">
      <params>
        <param name="extractInlineImages" type="bool">true</param>
      </params>
    </parser>
  </parsers>
</properties>
```
To build that that config file, first I ran:
```
java -jar tika-app.jar --dump-current-config
```
That will dump for you the default config. I took that and put it into tika-config.xml and added:
```
<parser class="org.apache.tika.parser.pdf.PDFParser">
  <params>
    <param name="extractInlineImages" type="bool">true</param>
  </params>
</parser>
```
which I gleaned from https://cwiki.apache.org/confluence/display/tika/PDFParser%20(Apache%20PDFBox) (option 1).

Even though tesseract is enabled by default (so OCR will work out of the box on image files), PDFs do not get OCRed without that option set because, as noted in the above link, “by default, extracting inline images is turned off because some rare PDFs contain thousands of inline images per page, and it has a big hit on performance, both memory usage and time”.

Now everything (OCR on image files, OCR of images in or image-based PDFs, and also naturally text extraction of text-based PDFs) works with the java app tika. I found plenty of documentation on getting this to work on the java server tika but very little on the java app tika, so I’m hoping this saves someone the few hours it took me to figure that out (let me know).
Login or Signup to reply.

- MatthewFord
- June 16, 2020 at 7:04 pm
- 0 votes
0
I would recommend using ocrStrategy auto

This tries to extract and then falls back onto OCR

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

How do you enable the TesseractOCRParser using TikaConfig and the Tika command line utility? – Apache

Answers