I am running AWS Glue crawler on a CSV file. This CSV file has a string column which has alpahanumeric values. The crawler is setting the data type for this columns as INT (instead of string). This is causing my ETL to fail. Is there anyway to force glue to correct this ? I don’t want to put schema manually in crawler as that defeats the purpose of automatic data catalogging.
Question posted in Amazon Web Sevices
The official Amazon Web Services documentation can be found here.
The official Amazon Web Services documentation can be found here.
2
Answers
CSV is always tricky format to handle, especially if all of your columns are strings. Crawler would not be able to differentiate between headers and rows. To avoid this, you can use Glue classifier. Set the classifier with format as CSV, use
Column headings
as has headings. Add the classifier to Glue crawler.Make sure to delete the crawler and re-run. Crawler will sometimes fail to pick up the modifications after running.
Crawler Schema Detection
During the first crawler run, the crawler reads either the first 1,000 records or the first megabyte of each file to infer the schema. The amount of data read depends on the file format and availability of a valid record.
From the Glue developer docs:
To bypass this, create a new CSV classifer, set Column headings to Has headings and add classifier to the crawler.
If the classifer still does not create your AWS Glue table as you want, edit the table definition, adjust any inferred types to
STRING
, set theSchemaChangePolicy
to LOG, and set the partitions output configuration toInheritFromTable
for future crawler runs.