I’m working with a Delta table in Databricks that is partitioned by a Year
column, and I have a timestamp column in the table. I want to optimize my queries to fetch data within a specific timestamp range (e.g., between 21-01-2019 and 04-12-2019).
df_filtered = df.filter((df.timestamp >= "2019-01-21") & (df.timestamp <= "2019-12-04"))
I understand that partitioning can help with performance, but I’m not sure how to leverage it effectively when querying based on a timestamp range. Even though I’m not directly querying the Year
partition column, I’d like to take advantage of partition pruning to only read the relevant partitions.
How can I ensure that the Year
partition column is correlated with the timestamp data effectively so that partition pruning works optimally for timestamp range queries?
2
Answers
Documentation for Delta Table batch reads and writes mentions:
It means that if you have a table defined as:
then Databricks will be able to analyze filter conditions on
timestamp
column and derive proper partition filters foryear
. With your example:It will add a partition filter
year = 2019
.To optimize your queries for timestamp range queries in Databricks with a Delta table partitioned by a Year column
The below code with Year partition column is correlated with the timestamp data.
Here is how you can do it with Pyspark.
In the above code converting the timestamp column to a date column and create a year column for partitioning.
Write the dataFrame as a Delta table partitioned by the "year" column.
Read the Delta table and filter the data within the specified timestamp range.