I’m starting a Databricks notebook from ADF to do some preprocessing tasks.
The cluster of this notebook is usually not running and should only run, when the ADF pipeline is running as well.
But it takes several minutes for the compute cluster to start, which of course slows down the execution of the pipeline.
My question now is, if there is any possibility, to trigger the cluster in an earlier stage of the ADF pipeline, so it is already starting in the background, while earlier stages of the ADF pipeline are still running. Like this, I could speed up the pipeline in total.
I already searched the databricks menue and also the ADF menue and toolbars but didn’t find a solution.
Thanks for your help!
2
Answers
I think there is a option called Existing Interactive Pool.
Please refer this video for more info : https://www.youtube.com/watch?v=VZggcUdIO14.
To utilize Existing Interactive Pool i think there should be some clusters in cluster pool
For info related to cluster pool refer this link: https://learn.microsoft.com/en-us/azure/databricks/clusters/instance-pools/create?source=recommendations
We can use cluster pool. Azure Databricks pools reduce cluster start and auto-scaling times by maintaining a set of idle, ready-to-use instances. When a cluster is attached to a pool, cluster nodes are created using the pool’s idle instances. If the pool has no idle instances, the pool expands by allocating a new instance from the instance provider in order to accommodate the cluster’s request. When a cluster releases an instance, it returns to the pool and is free for another cluster to use. Only clusters attached to a pool can use that pool’s idle instances. You can check link