DynamoDB Table:
+------------------+---------+---------------------+
| id |Column A | Column B|
+------------------+---------+---------------------+
|1 | "155.2"| 400 |
|2 | 100| 200 |
|3 | "455.2"| 305.5 |
|4 | "312.3"| 350 |
+------------------+---------+---------------------+
Notice that in Column A. We have numbers stored as strings except for id = 2.
Following code is used to read the table contents into a Dynamic Frame:
def create_dynamic_frame(table_name):
ddb_s3_bucket = <some-s3-bucket>
ddb_table_arn = <some-table-arn>
connection_options = {
"dynamodb.export": "ddb",
"dynamodb.unnestDDBJson": True,
"dynamodb.tableArn": ddb_table_arn,
"dynamodb.s3.bucket": ddb_s3_bucket,
"dynamodb.s3.prefix": 'temporary/ddbexport/'
}
dynamic_frame = glueContext.create_dynamic_frame.from_options(
connection_type="dynamodb",
connection_options=connection_options,
transformation_ctx="dynamic_frame",
)
return dynamic_frame
dyf = create_dynamic_frame('test-table')
The output of show on the created Dynamic Frame: dyf.toDf().show()
+------------------+---------+---------------------+
| id |Column A | Column B|
+------------------+---------+---------------------+
|1 | null| 400 |
|2 | 100 | 200 |
|3 | null| 305.5 |
|4 | null| 350 |
+------------------+---------+---------------------+
The output of dyf.toDf().printSchema()
:
root
|-- id: string (nullable = true)
|-- Column A: string (nullable = true)
|-- Column B: string (nullable = true)
Notice that the string values in Column A are null
. I was under the impression that Glue keeps both types in the column and you can use resolveChoice
to then cast to whichever type you would want.
Is there a way I can resolve the type in the Glue-DDB connector?
I tried to resolve the types using resolveChoice
:
resolved_dyf = dyf.resolveChoice(specs = [("Column A", "cast:string")])
This did not work, since the values in dyf
itself are null
2
Answers
This seems to be a limitation of the "connectionType": "dynamodb" with the AWS Glue DynamoDB export connector as source
Moreover, if we use
unnestDDBJson
parameter, Glue is forced to evaluate schema for the columns. If I do not use theunnestDDBJson
parameter, all column values are kept asstruct
and I could not find a resource to then resolve the type.One workaround for this is using the "connectionType": "dynamodb" with the ETL connector as source
When using "unnestDDBJson" it’s forced to resolve and flatten the schema, I think you are will need to avoid that in your case and do it yourself after you have resolved the type.