I have installed apache Tika 1.8 and it is running perfectly except the OCR part is not working. I have Tesseract installed and it is also working properly.
When I try to send a pdf with an image on it I get the following.
WARNING: Tesseract OCR is installed and will be automatically applied to image f
iles unless
you’ve excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on via
TikaConfig.
Can I configure the TikaConfig using the command line utility ? Or do I have to clone the project and update poms and rebuild. I really do not want to have to do that.
There is some info here on how to use the command line utility and the TikaConfig but I cannot figure out how to enable TesseractOCRParser with it.
Any help, greatly appreciated.
3
Answers
OK so with the help of this post on the Apache Tika Forum Thank you guys.
I managed to get it working. Its a hack but It works. What I did was extract the Tika-app Jar file. Then locate the PDFParser.properties and change the following properties like this
Then locate TesseractOCRConfig.properties. And change this one property to 1..
Save the above properties files. Zip it all up again. And use your new zipped up jar file and it will now extract text and text from images from a pdf file.
I tried user3250052’s approach but I was unable to recompress the jar file in a way that was executable. That’s owing to my own inexperience with Java, but regardless, the less hacky way is to call a custom tika config file when calling tika:
This is what my tika-config.xml looks like:
To build that that config file, first I ran:
That will dump for you the default config. I took that and put it into tika-config.xml and added:
which I gleaned from https://cwiki.apache.org/confluence/display/tika/PDFParser%20(Apache%20PDFBox) (option 1).
Even though tesseract is enabled by default (so OCR will work out of the box on image files), PDFs do not get OCRed without that option set because, as noted in the above link, “by default, extracting inline images is turned off because some rare PDFs contain thousands of inline images per page, and it has a big hit on performance, both memory usage and time”.
Now everything (OCR on image files, OCR of images in or image-based PDFs, and also naturally text extraction of text-based PDFs) works with the java app tika. I found plenty of documentation on getting this to work on the java server tika but very little on the java app tika, so I’m hoping this saves someone the few hours it took me to figure that out (let me know).
I would recommend using
ocrStrategy auto
This tries to extract and then falls back onto OCR