skip to Main Content

I have a python notebook running the following imports on a DataBricks cluster

%pip install presidio_analyzer
%pip install presidio_anonymizer
import spacy.cli
spacy.cli.download("en_core_web_lg")
nlp = spacy.load("en_core_web_lg")
import csv
import pprint
import collections
from typing import List, Iterable, Optional, Union, Dict
import pandas as pd
from presidio_analyzer import AnalyzerEngine, BatchAnalyzerEngine, RecognizerResult, DictAnalyzerResult
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import EngineResult

To install and run the Microsoft Presidio library to anonymise data.

The code works fine and runs when called through the Databricks notebooks UI, but when attempting to call this notebook as a step in Azure Data Factory pipelines, it gives the following error:

"runError": "ImportError: cannot import name dataclass_transform"

From trial and error in the Databricks UI, I can determine that this error was generated due to missing certain parts of the imported libraries but the commands given at the beginning of the code resolved this in DataBricks notebooks.

I cannot reason why this step will not work when called as an ADF step.

2

Answers


  1. Chosen as BEST ANSWER

    The solution was that that libraries needed to be installed directly on the cluster via the compute tab in the DataBricks UI. I am unclear why the install commands failed to run when called from an Azure DF pipeline. If anyone has a clear answer as to why, please expand on my answer.


  2. I had a similar issue in my environment these days. It looks like this is caused by spaCy version 3.5.0. I downgraded (explicitly specified) to use version 3.3.0 (3.4.0 maybe also works) and it was working again.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search