I have a .csv
file (in Azure Data Lake Storage
), which looks approximately like this ->
I want to create a notebook (PySpark (Python)
), which could be implemented in the synapse analytics (integrate -> pipeline) in one of the pipelines.
The code in notebook should be able to separate 2nd column in 2 and transform all the rows to GB unit, so that it looks like this:
Could you please help with the PySpark
code? As I am beginner in Azure synapse analytics
and not sure how to do it
!! IMPORTANT: The problem I have is that it all should be done in the same file (no new files have to be created)
Thanks in advance
2
Answers
One way to do is to use the split function:
Input:
Code:
result:
I would not recommend overwriting on the same file. It’s always good practice to separate your stages. You could call the files that you are reading as raw files, e.g. saved in
.../Raw/
folder and then write the newly generated files in somewhere like.../Preprocessed/
folder.It might be a good idea to save the new file in a binary format, like Parquet, both for compression/fileSize plus you save the datatype of each column in the file itself.
I read csv file form my storage into a dataframe.
Here is my file:
I created new column called ‘Unit’ by splitting consumed column;
I converted kb value into GB of consumed value by using below code:
When I try with above one, I am getting 670 kb value as 6.7E-4 i.e., it is converting as scientific notification.
So, I formatted it by using below command
And selected columns in below format
Output:
Mount the path using below procedure:
Create linked service of path and created mount using below code:
I overwrite the updated dataframe into filename using below code:
Updated file: