Using AWS Firehose to ingest data into an Iceberg table managed by AWS Glue, I’m unable to insert timestamp data.
Firehose
I’m trying to insert data using the following script:
json_data = json.dumps(
{
"ADF_Record": {
"foo": "bar",
"baz": "2024-09-04T18:56:15.114"
},
"ADF_Metadata": {
"OTF_Metadata": {
"DestinationDatabaseName": "my_db",
"DestinationTableName": "my_table",
"Operation": "INSERT"
}
}
}
)
response = boto3.client("firehose").put_record(
DeliveryStreamName="my_stream",
Record={"Data": json_data.encode()}
)
Note that the baz
value corresponds to a timestamp of type TimestampType.withoutZone as referenced in the Firehose documentation.
Glue
- My table is of the Iceberg type.
- I did not define any additional SerDe library or SerDe parameters.
- The table schema is :
foo
:string
baz
:timestamp
Error
Whenever I try to insert data using this method, no data is delivered and I get this error on Firehose side :
Firehose is unable to convert column data in your record to the column type specified within the schema. Table: my_db.my_table
Things I tried
- Data is written when
baz
is removed from the payload (the pipeline seems functional without timestamp). - Switching to epoch format (
1725476175114000
) doesn’t help. Glue creates a new version of the table withbaz
asdate
and the written data is not legible. - Switching to TimestampType.WithZone results in the same error.
- Trying a SerDe library like
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
andtimestamp.formats
parameter doesn’t help. Glue creates a new version of the table and removes the SerDe parameters altogether.
I’m about to give up and just write timestamps as string. Any insight is appreciated!
2
Answers
I think I figured it out. Modifying the schema of your Glue table has no incidence on the schema of the underlying Iceberg table (i.e. metadata are not altered by Glue), it only goes one way. This resulted in an inconsistency between schemas :
baz
was adate
for Iceberg and atimestamp
for Glue.What worked was to delete
baz
with PyIceberg (an Athena query might do the trick too) and recreate it with thetimestamp
type. This will update the metadata on Iceberg side and the Glue schema altogether. Now,baz
is properly interpreted by Firehose, Glue and Iceberg.Seems like problem with format of date you sending to AWS Firehose.
You sending date with
T
in your time:2024-09-04T18:56:15.114
. But, as I know, it waiting for date format withoutT
, so try to change it:If this wouldn’t help you, try to remove milliseconds as well. If removing milliseconds will be helpful, you will need to configure baz column in your Glue table schema properly to allow milliseconds accepting.