I’ve been runnning a pipeline in Azure for 4 months and it suddenly broke last night. I have the following code:
!pip install tabula-py
from tabula.io import read_pdf
import tabula
df = tabula.io.read_pdf(BytesIO(pdf_content), pandas_options={'header': None}, pages=3, stream=True)[0]
I got this error all of a sudden now:
~/cluster-env/env/lib/python3.8/site-packages/tabula/io.py in __init__(self, java_options, silent)
92
93 from java import lang
---> 94 from org.apache.commons import cli
95 from technology import tabula
96
ModuleNotFoundError: No module named 'org.apache.commons'
Any help would be appreciated.
2
Answers
the same happened to me today in a databricks environment after tabula was running smoothly for 6 months. My hotfix was to pip install the version 2.7.0 as I suppose the error is evoked by the most current version 2.8.1 which was published today.
Installing version 2.7.0 with the command pip install tabula-py==2.7.0 worked for me as well.