I have a Databricks DataFrame with
Columns : tno,data_value
Output of first Databricks run:
tno, data_value
1,hdjsjsjnsns
2,dhjdjdjsnsn
3,jdjsjsjsjsjjs
When I run again the same notebook after some time it should generate as:
tno, data_value
4,hdjsjsjnsns
5,dhjdjdjsnsn
6,jdjsjsjsjsjjs
Just like an Oracle or SQL Server "Sequence"
I have done rownumber() and monotically_increasing_id()
But those create right from start from 1 for every run.
So, just think it as a transactional_key. How to achieve this in pyspark Databricks.
2
Answers
To achieve your requirement, you need to get the maximum value in the dataframe of last run and using that, you need to transform the
column
in the current run.After the first notebook run, store the dataframe in a databricks file. In the next notebook, read the data from that notebook in a dataframe and get the maximum value from your column.
After that, add an increasing id column
my_id
to the dataframe usingrow_number()
starting from1
.Now, add the
last_max
value to themy_id
column and store each row in thetno
column and drop the extra columnmy_id
.At the end, overwrite the dataframe to the same databricks file so that the same can be done in next notebook run.
Result:
You can also store the maximum value of
tno
column into a file and use that in the next run.You can go through this reference by @ram.sankarasubramanian to know more about it.
for truly unique numbers and storing id’s with delta on databricks you can use: identity_columns
If you need generic but guaranteed to be unique you need either snowflake style generation (Quality has a 160 bit one, not tested via pyspark) or a central server which reserves blocks (as Oracle does) and map partitions.