I am having issues with a series of pipelines that build our data platform Spark databases hosted in Azure Synapse.
The pipelines host dataflows which have ‘recreate table’ enabled. The dataflows extract data and are supposed to recreate the tables each time the pipelines run. There is a step at the start of the job to drop all the tables as well. However the jobs randomly fail at different stages of the jobs with errors that look like the one below (sensitive system details have been removed):
Operation on target failed: {"StatusCode":"DFExecutorUserError","Message":"Job failed due to reason: at Sink ‘sinkname’: Spark job failed in one of the cluster nodes while writing data in one of the partitions to sink, with following error message: Failed to rename VersionedFileStatus{VersionedFileStatus{path=abfss://synapsename.dfs.core.windows.net/synapse/workspaces/synapsename/warehouse/databasename.db/tablename/.name removed/_temporary/0/_temporary/idremoved/part-idremoved.snappy.parquet; isDirectory=false; length=636844; replication=1; blocksize=268435456; modification_time=1731778904698; access_time=0; owner=81aba2ef-674d-4bcb-a036-f4ab2ad78d3e; group=trusted-service-user; permission=rw-r—–; isSymlink=false; hasAcl=true; isEncrypted=false; isErasureCoded=false}; version=’0x8DD0665F02661DC’} to abfss://[email protected]/synapse/workspaces/synapsename/warehouse/dataplatform","Details":null}
This might occur at any Spark database table loads randomly or might not occur at all the next day and might reoccur again in a few days.
To fix this, we go to the Synapse backend data lake storage and manually delete the Spark database table (parquet file) and rerun the job and then it succeeds. Tried increasing the resources including the spark run time.
Any thoughts, anyone?
2
Answers
It seems like you’re dealing with an intermittent issue where Spark fails while trying to rename temporary files in your Data Lake, probably due to file locks or race conditions. Since manually deleting the table resolves it temporarily, it points to a problem with leftover files or some contention in the storage. You could try tweaking how you partition the data to reduce the chance of multiple processes trying to write to the same file at once, or scale up your Spark resources to handle more parallelism. It might also help to add a cleanup step before each run to clear out any old files or locks. Also, check if your Azure storage is getting throttled or running into other performance issues. Finally, enabling retries for those file operations could help make the job more robust against these occasional problems.
Set the concurrency to 1.Typically it the _temporary file.