I have a python notebook running the following imports on a DataBricks cluster
%pip install presidio_analyzer
%pip install presidio_anonymizer
import spacy.cli
spacy.cli.download("en_core_web_lg")
nlp = spacy.load("en_core_web_lg")
import csv
import pprint
import collections
from typing import List, Iterable, Optional, Union, Dict
import pandas as pd
from presidio_analyzer import AnalyzerEngine, BatchAnalyzerEngine, RecognizerResult, DictAnalyzerResult
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import EngineResult
To install and run the Microsoft Presidio library to anonymise data.
The code works fine and runs when called through the Databricks notebooks UI, but when attempting to call this notebook as a step in Azure Data Factory pipelines, it gives the following error:
"runError": "ImportError: cannot import name dataclass_transform"
From trial and error in the Databricks UI, I can determine that this error was generated due to missing certain parts of the imported libraries but the commands given at the beginning of the code resolved this in DataBricks notebooks.
I cannot reason why this step will not work when called as an ADF step.
2
Answers
The solution was that that libraries needed to be installed directly on the cluster via the compute tab in the DataBricks UI. I am unclear why the install commands failed to run when called from an Azure DF pipeline. If anyone has a clear answer as to why, please expand on my answer.
I had a similar issue in my environment these days. It looks like this is caused by spaCy version 3.5.0. I downgraded (explicitly specified) to use version 3.3.0 (3.4.0 maybe also works) and it was working again.