Azure - PySpark Tabula-Py Read_PDF (ERROR: No module named 'org.apache.commons')

mohamadmaarouf_
September 18, 2023
179 views
1 vote
2 Answers

I’ve been runnning a pipeline in Azure for 4 months and it suddenly broke last night. I have the following code:

!pip install tabula-py
from tabula.io import read_pdf
import tabula
df = tabula.io.read_pdf(BytesIO(pdf_content), pandas_options={'header': None}, pages=3, stream=True)[0]

I got this error all of a sudden now:

~/cluster-env/env/lib/python3.8/site-packages/tabula/io.py in __init__(self, java_options, silent)
     92 
     93         from java import lang
---> 94         from org.apache.commons import cli
     95         from technology import tabula
     96 

ModuleNotFoundError: No module named 'org.apache.commons'

Any help would be appreciated.

Answers

- jlwwu
- September 11, 2023 at 12:00 pm
- 0 votes
0
the same happened to me today in a databricks environment after tabula was running smoothly for 6 months. My hotfix was to pip install the version 2.7.0 as I suppose the error is evoked by the most current version 2.8.1 which was published today.

Login or Signup to reply.

- LiamAulph
- September 12, 2023 at 9:32 pm
- 0 votes
0
Installing version 2.7.0 with the command pip install tabula-py==2.7.0 worked for me as well.

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Azure – PySpark Tabula-Py Read_PDF (ERROR: No module named 'org.apache.commons')

Answers