skip to Main Content

Data has headers like _col_0, _col_1, etc. I have a sample data file that has the correct column headers.

However, all the data is in snappy/parquet across ~250 files.

What is the easiest way to remap the column headers in Glue?

Thanks.

UPDATE:

So I tried John R’s comment below – I went in and edited the table schema in glue, to rename the columns, but when I query the data, any column I edited is now missing data.

I tried to re-run the glue job and it overwrote the edited schema (makes sense) but the data is back.

So editing the column name in the schema makes that data get dropped, or not applied to the column. Searched on google, but don’t see any related issues.

UPDATE #2:

If I rename the column back to it’s original name (which is in the snappy/parquet files) then the data comes back.

UPDATE #3:

I basically solved this by generating a view in Athena and renaming the _col_0… columns to the their correct names.

2

Answers


  1. I would say that you can’t rename columns from tables stored in parquet because the schema is contained in the file itself. If you add a file with another header than glue will threat it in one of the 3 ways when running crawlers: ignore changes, add new columns or create new tables; depends on how it’s set up.

    What you can do is recreate all the files in another bucket with the desired column names and create another table from it, then delete the old files if you want.

    Altough this might sound a little difficulty it is not. You can leverage the Athena’s UNLOAD command (also available on Redshift) to do so. This is also a good moment to (re)partition your files if you see fit. I will leave an example here, you can than customize the command for your own needs. Check the docs for further reference: UNLOAD – Amazon Athena

    UNLOAD (SELECT _col_0 AS new_col0_name, _col_1 AS new_col1_name FROM table) 
    TO 's3://yourbucket/your_partitioned_table/' 
    WITH (format = 'PARQUET', compression = 'SNAPPY', partitioned_by = ARRAY['new_col1_name'])
    

    Redshift UNLOAD is more robust, so if you need more options, like controling max file size, you can do it by leveraging Redshift Spectrum to query data directly from S3.

    Login or Signup to reply.
  2. A similar approach to what @JohnRotenstein mentioned is to "edit" the table’s DDL by dropping and creating the table again in Athena, that would allow you to query the data with the new column names.

    1. Go to athena console and right click over the table name
    2. Select "Generate table DDL", this would provide you the code needed to modify the table (copy the code and save it for later use)
    3. Right click over the table name and select "Delete table"
    4. In the query editor paste the code from step two and manually alter the column names and run the code
    5. Query the table and you should see all the data with the new column names

    Now this can be useful depending on how your glue job updates the table later, you might ending overwritten that the next time you run your glue job.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search