skip to Main Content

Using AWS Firehose to ingest data into an Iceberg table managed by AWS Glue, I’m unable to insert timestamp data.

Firehose

I’m trying to insert data using the following script:

json_data = json.dumps(
    {
        "ADF_Record": {
            "foo": "bar",
            "baz": "2024-09-04T18:56:15.114"
        },
        "ADF_Metadata": { 
            "OTF_Metadata": {
                "DestinationDatabaseName": "my_db",
                "DestinationTableName": "my_table",
                "Operation": "INSERT"
            }
        }
    }
)

response = boto3.client("firehose").put_record(
    DeliveryStreamName="my_stream",
    Record={"Data": json_data.encode()}
)

Note that the baz value corresponds to a timestamp of type TimestampType.withoutZone as referenced in the Firehose documentation.

Glue

  • My table is of the Iceberg type.
  • I did not define any additional SerDe library or SerDe parameters.
  • The table schema is :
    • foo : string
    • baz : timestamp

Error

Whenever I try to insert data using this method, no data is delivered and I get this error on Firehose side :

Firehose is unable to convert column data in your record to the column type specified within the schema. Table: my_db.my_table

Things I tried

  • Data is written when baz is removed from the payload (the pipeline seems functional without timestamp).
  • Switching to epoch format (1725476175114000) doesn’t help. Glue creates a new version of the table with baz as date and the written data is not legible.
  • Switching to TimestampType.WithZone results in the same error.
  • Trying a SerDe library like org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe and timestamp.formats parameter doesn’t help. Glue creates a new version of the table and removes the SerDe parameters altogether.

I’m about to give up and just write timestamps as string. Any insight is appreciated!

2

Answers


  1. Chosen as BEST ANSWER

    I think I figured it out. Modifying the schema of your Glue table has no incidence on the schema of the underlying Iceberg table (i.e. metadata are not altered by Glue), it only goes one way. This resulted in an inconsistency between schemas : baz was a date for Iceberg and a timestamp for Glue.

    What worked was to delete baz with PyIceberg (an Athena query might do the trick too) and recreate it with the timestamp type. This will update the metadata on Iceberg side and the Glue schema altogether. Now, baz is properly interpreted by Firehose, Glue and Iceberg.


  2. Seems like problem with format of date you sending to AWS Firehose.
    You sending date with T in your time: 2024-09-04T18:56:15.114. But, as I know, it waiting for date format without T, so try to change it:

    json_data = json.dumps(
        {
            "ADF_Record": {
                "foo": "bar",
                "baz": "2024-09-04 18:56:15.114"
            },
        ...
    )
    

    If this wouldn’t help you, try to remove milliseconds as well. If removing milliseconds will be helpful, you will need to configure baz column in your Glue table schema properly to allow milliseconds accepting.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search