Data has headers like _col_0
, _col_1
, etc. I have a sample data file that has the correct column headers.
However, all the data is in snappy/parquet across ~250 files.
What is the easiest way to remap the column headers in Glue?
Thanks.
UPDATE:
So I tried John R’s comment below – I went in and edited the table schema in glue, to rename the columns, but when I query the data, any column I edited is now missing data.
I tried to re-run the glue job and it overwrote the edited schema (makes sense) but the data is back.
So editing the column name in the schema makes that data get dropped, or not applied to the column. Searched on google, but don’t see any related issues.
UPDATE #2:
If I rename the column back to it’s original name (which is in the snappy/parquet files) then the data comes back.
UPDATE #3:
I basically solved this by generating a view in Athena and renaming the _col_0… columns to the their correct names.
2
Answers
I would say that you can’t rename columns from tables stored in parquet because the schema is contained in the file itself. If you add a file with another header than glue will threat it in one of the 3 ways when running crawlers: ignore changes, add new columns or create new tables; depends on how it’s set up.
What you can do is recreate all the files in another bucket with the desired column names and create another table from it, then delete the old files if you want.
Altough this might sound a little difficulty it is not. You can leverage the Athena’s UNLOAD command (also available on Redshift) to do so. This is also a good moment to (re)partition your files if you see fit. I will leave an example here, you can than customize the command for your own needs. Check the docs for further reference: UNLOAD – Amazon Athena
Redshift UNLOAD is more robust, so if you need more options, like controling max file size, you can do it by leveraging Redshift Spectrum to query data directly from S3.
A similar approach to what @JohnRotenstein mentioned is to "edit" the table’s DDL by dropping and creating the table again in Athena, that would allow you to query the data with the new column names.
Now this can be useful depending on how your glue job updates the table later, you might ending overwritten that the next time you run your glue job.