skip to Main Content

I have a Databricks DataFrame with

Columns : tno,data_value

Output of first Databricks run:

tno, data_value
1,hdjsjsjnsns
2,dhjdjdjsnsn
3,jdjsjsjsjsjjs

When I run again the same notebook after some time it should generate as:

tno, data_value
4,hdjsjsjnsns
5,dhjdjdjsnsn
6,jdjsjsjsjsjjs

Just like an Oracle or SQL Server "Sequence"

I have done rownumber() and monotically_increasing_id()

But those create right from start from 1 for every run.

So, just think it as a transactional_key. How to achieve this in pyspark Databricks.

2

Answers


  1. To achieve your requirement, you need to get the maximum value in the dataframe of last run and using that, you need to transform the column in the current run.

    After the first notebook run, store the dataframe in a databricks file. In the next notebook, read the data from that notebook in a dataframe and get the maximum value from your column.

    After that, add an increasing id column my_id to the dataframe using row_number() starting from 1.

    from pyspark.sql.functions import *
    from pyspark.sql.window import *
    
    last_max = my_df.select(max(my_df.tno)).collect()[0]['max(tno)']
    print(last_max)
    
    window = Window.orderBy(col('tno'))
    df_id = my_df.withColumn('my_id', row_number().over(window))
    df_id.show()
    

    Now, add the last_max value to the my_id column and store each row in the tno column and drop the extra column my_id.

    res_df=df_id.withColumn("tno", col("my_id") + lit(last_max)).drop('my_id')
    res_df.display()
    

    At the end, overwrite the dataframe to the same databricks file so that the same can be done in next notebook run.

    Result:

    enter image description here

    You can also store the maximum value of tno column into a file and use that in the next run.

    You can go through this reference by @ram.sankarasubramanian to know more about it.

    Login or Signup to reply.
  2. I have done rownumber() and monotically_increasing_id()

    But those create right from start from 1 for every run.

    So, just think it as a transactional_key. How to achieve this in pyspark Databricks.

    for truly unique numbers and storing id’s with delta on databricks you can use: identity_columns

    If you need generic but guaranteed to be unique you need either snowflake style generation (Quality has a 160 bit one, not tested via pyspark) or a central server which reserves blocks (as Oracle does) and map partitions.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search