skip to Main Content

I am using databricks for a specific workload. This workload involves an approx of 10 to 200 dataframes that are read and written to a storage location. This workload can benefit from parallelism.

The constraint i have is cost optimization. Instances are billed for a minimum of 1 hr hence if a workload takes less than 1 hr then i am loosing money. Also all jobs should be completed in less than 1hr.

The databricks way to cost optimization is autoscaling but what happening is:
(Consider a constance instance type)

  1. If i don’t use autoscaling and use only 1 worker, then for 50 dataframes, the job takes 30 min but for 200 dataframes, it takes 2 hr. 2hr is not acceptable so it would make sense increase number of workers. If i increase the number of workers to 3 then now, the 200 dataframes takes 45 min but the 50 dataframe job takes only 12 min to num. This would be a problem cause, instances are billed for minimum of 1 hr. Hence i am loosing a lot of money for those 50 dataframes jobs.

  2. To over come the above, one would say to use autoscaling but what happens is when the 50 dataframe job starts, after 5 min, databricks autoscales the cluster from 1 worker (i had set the minimum workers to 1) to 5 workers(databricks scales up in steps of 4 workers). And then the job finishes in under 15 min, so again i am loosing money. This works like a charm for larger workloads but most of the job are small. Also the 1hr time limit is a hard limit so the job should not run more than 1 hr.

Any thoughts how to overcome this.

Here is some things that is tried or searched for

  1. Setting the number of instances before hand -> This wont be possible cause the size off the workload can only be determined after the job starts.

  2. Manipulating number of executors and cores per executors using
    .config("spark.executor.cores", "1").config("spark.executor.instances", "1")
    Didnt work

  3. Controlling autoscalling from the driver code -> Not possible for databricks

P.S. -> I am scheduling my job in databricks and the driver uses pyspark to run the worload

2

Answers


  1. Databricks’ pay-as-you-go pricing is billed on a per-second usage granularity, not hourly.
    https://www.databricks.com/product/pricing

    Pricing documentation usually posts prices in the format of $x per DBU/hr but this is just for simplicity and is consistent with pricing documentation like AWS; billable usage is per-second.


    Also keep in mind your cloud costs, such as AWS EC2 and EBS costs, still apply, so to keep total cost of ownership (TCO) down you should use the most cost-effective but performant hardware for your workloads (if it takes 12 minutes to complete an hourly job, that’s great!).

    Other tips for cost-effectiveness:

    1. Use Spot instances. If the workload doesn’t necessarily need 100% reliability, you can use "Spot with fallback to on-demand" in your cluster, which will attain a cheaper cost at times when spot instances are available.
    2. Try Databricks Photon engine. If your Spark transformations are compatible, Photon can greatly reduce the overall duration of the job (not a databricks employee–but sometimes up to 3x in my experience). It costs more $USD per DBU, but it incurs fewer DBU and if it reduces runtime duration, it usually reduces overall TCO.

    Lastly as a sidenote: Databricks’ Delta Live Tables (DLT) has a "Enhanced Autoscaler" that is different than the traditional Workflows/jobs autoscaler; in my experience the DLT Enhanced Autoscaler performs better, especially at the scaling down event, whereas the regular autoscaler in jobs/workflows may immediately scale up but hardly ever scale down.

    Login or Signup to reply.
  2. Yes your biggest misconception here is that Databricks charges per hour, which is not correct. Databricks charges per second, so a lot of your concerns don’t apply.

    If cost savings is your goal, autoscaling may or may not be helpful. If you’re running a production Databricks Job, then a small fixed cluster could be cheaper than autoscaling. The reason is because autoscaling has to increase and decrease your cluster size, which requires time for the new instances to "spin up." During this spin up time, you are paying for this and no processing is occurring.

    We wrote a blog article that went into the details of the cost performance of Databrick’s Autoscaling.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search