I am trying to load multiple documents into the MongoDb collection using databricks pyspark and while loading I am using updateDate filed as well , but after loading I could see updateDate field data type is string instead date data type.
here i am using the code for timestamp.
import datetime
current_timestamp_utc = datetime.datetime.now(datetime.timezone.utc)
formatted_timestamp = current_timestamp_utc.strftime("%Y-%m-%dT%H:%M:%S")
timezone_offset = current_timestamp_utc.strftime("%z")
formatted_timestamp = formatted_timestamp + ".000" + timezone_offset[:-2] + ":" +
timezone_offset[-2:]
print(formatted_timestamp)
result : 2024-04-03T07:33:52.000+00:00
as per the result looks fine , but after loading into MongoDb , it is displaying as String instead Date.
So could you please help me on this how to load the documents with date data type.
I have used UpdateMany() method to change string to date data type and is this the write approach to proceed or is there any I/O or performance impact while using updatemany() method . Please suggest
2
Answers
You can get the current time directly using spark SQL date-time functions as follows:
If you notice the schema,
updateDate
is a string and you can convert it into a timestamp usingto_timestamp()
as follows:And now the
updateDate
is a timestamp adjusted for your spark session’s timezone (that’s why the+00:00
offset is gone now) – which, by the way, you can update usingspark.conf.set("spark.sql.session.timeZone", "<enter-timezone-here>")
.And if you want to add it to an existing dataframe as a new column you can do something like this:
The datatype you are looking for is a timestamp with a timezone in Spark. Now, you can try loading the dataset into MongoDB with this schema.
I have tried the below approach in Pyspark:
DataFrame.printSchema of DataFrame[id: int, date: string]
Results:
In the code above, I used the
withColumn()
function to create a new column called "date" and use theto_date()
function to convert the "date" column to the DateType.The format string "yyyy/MM/dd HH:mm:ss" specifies the format of the input date string.
However, As you mentiond you are loading files to MangoDB
As per the $toDate (aggregation)
$toDate
Converts a value to a date. If the value cannot be converted to a date,$toDate
errors. If the value is null or missing,$toDate
returns null.You can also try the below in the MangoDB in 4.2 version:
Reference: String Datatype to Date datatype conversion
$convert (aggregation)