skip to Main Content

I am new to databricks and currently learning about managed tables.
I have created a managed table in databricks and on inspection, it is getting created in below location:

dbfs:/user/hive/warehouse/demo.db/race_results_python

The microsoft documentation states that this is a root directory.

So I have two questions:

  1. Does this mean that databricks is storing tables in the default Storage Account created during the creation of Databricks workspace ?
  2. If the answer to above question is Yes, then is it a good practice to store tables here or should we store it in a separate Storage Account?

For reference I have looked in the below documentation:
https://learn.microsoft.com/en-us/azure/databricks/dbfs/root-locations

2

Answers


  1. As per this documentationwhen you create a table registered to hive_metastore then the location it will store is in dbfs location path /user/hive/warehouse

    DBFS uses the default storage account that is associated with databricks workspace. This storage account is managed by Azure and linked to your Databricks workspace. It handles metadata and data storage for managed tables automatically.

    1. If the answer to above question is Yes, then is it a good practice to store tables here or should we store it in a separate Storage Account?

    For temporary or testing purpose it is ok use DBFS default location but for large datasets, production, non-tabular files by external systems, you can use external storage mounting in DBFS path or in unity catalog.

    Refer this documentation for Best practices for DBFS and Unity Catalog

    Generally, it is recommended to use an external storage account.

    The main difference between managed and external location is when you drop the volume in managed the underlying metadata, data is deleted but when you drop the volume in external location the underlying data remains unchanged in in external location, only metadata is deleted.

    Here, in this article the Databricks recommends using external volumes for non-tabular data files that are read or written by external systems in addition to Databricks.

    Login or Signup to reply.
  2. Answering your two sub questions individually below:

    Does this mean that databricks is storing tables in the default Storage Account created during the creation of Databricks workspace ?

    • Yes. It stores the tables at the default location that is /user/hive/warehouse location.

    If the answer to above question is Yes, then is it a good practice to store tables here or should we store it in a separate Storage Account?

    • If you are using the default schema/database within the hive_metastore then yes it would store the tables in that default location. If you create table without specifying the external location where you want to store the data then it would create those tables as managed tables and if you provide the external path, then it would create them as external tables.

    • Now if you are creating the tables in any other custom schema/database that you have created within hive_metastore catalog then you have two options:

    1. Specifying the location where all the tables within the schema/database would get stored. (This location could be an external location). In this case all the tables that get created inside that database/schema where you don’t specify the path would get created as managed tables but will still point to the location that was specified at the database/schema level.
    2. If you don’t specify the location at the time of creation of schema/database then it will point to default storage location and all tables that you create within that database without specifying the path would get stored at that default location and would be created as managed tables.

    Just to make sure you understand the difference between managed tables and external tables, I am describing that below:

    Managed Tables are one in which when you drop the tables, it deletes the tables meta data information in the hive_metastore as well as deletes the actual data files. you don’t need to handle separately the deletion of the data files if that is what your use case is.

    External tables are one in which when you drop the tables, it only deletes the tables meta data information but does not delete the actual data files. In case of external tables if you want to delete the actual data files you will have to have an external process to delete those data files if that is what your use case wants.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search