skip to Main Content

I have a JSON file in s3 that I’m trying to run queries against using Athena. When I run my Lambda that tries to call Athena, I get this error: SyntaxError: Unexpected token < in JSON at position 0 at JSON.parse (<anonymous>)

Any help would be greatly appreciated.

My code looks like this:

const athena_client = new AthenaClient({ region: process.env.AWS_REGION });

const sql = `SELECT * FROM customer_data`

const params = {
            QueryString: sql,
            WorkGroup: 'customer-data-workgroup-prod'
        };

const command = new StartQueryExecutionCommand(params);

console.log("command", command)

try {
        const data = await client.send(command);
        console.log("data response: ", data)
  } catch (error) {
        console.log("error", error)
  }

In s3, I have a file called movies.json, that just pulls from this test API – https://jsonmock.hackerrank.com/api/movies

My Glue Table Cloudformation template looks like this:

  Type: AWS::Glue::Table
  Properties:
    CatalogId: !Ref AWS::AccountId
    DatabaseName: !Ref GlueDatabase
    TableInput:
      Name: customer_data
      StorageDescriptor:
        Columns:
          - Name: createdAt
            Type: timestamp
          - Name: fullJson  
            Type: string
          - Name: data  
            Type: string  
          - Name: title
            Type: string                
        Location: 's3://customer-data-bucket-prod/'
        InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
        OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
        SerdeInfo:
          SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
      Parameters: {'classification': 'json'}
      TableType: "EXTERNAL_TABLE"

JSON:

{
  "page": 1,
  "per_page": 10,
  "total": 2770,
  "total_pages": 277,
  "data": [
    {
      "Title": "Waterworld",
      "Year": 1995,
      "imdbID": "tt0114898"
    },
    {
      "Title": "Waterworld",
      "Year": 1995,
      "imdbID": "tt0189200"
    },
    {
      "Title": "The Making of 'Waterworld'",
      "Year": 1995,
      "imdbID": "tt2670548"
    },
    {
      "Title": "Waterworld 4: History of the Islands",
      "Year": 1997,
      "imdbID": "tt0161077"
    },
    {
      "Title": "Waterworld",
      "Year": 1997,
      "imdbID": "tt0455840"
    },
    {
      "Title": "Waterworld",
      "Year": 1997,
      "imdbID": "tt0390617"
    },
    {
      "Title": "Swordquest: Waterworld",
      "Year": 1983,
      "imdbID": "tt2761086"
    },
    {
      "Title": "Behind the Scenes of the Most Fascinating Waterworld on Earth: The Great Backwaters, Kerala.",
      "Year": 2014,
      "imdbID": "tt5847056"
    },
    {
      "Title": "Louise's Waterworld",
      "Year": 1997,
      "imdbID": "tt0298417"
    },
    {
      "Title": "Waterworld",
      "Year": 2001,
      "imdbID": "tt0381702"
    }
  ]
}

2

Answers


  1. This is the Glue table schema to allow Athena parsing the text file correctly.

    "Columns": [
        {
            "Name": "page",
            "Type": "int"
        },
        {
            "Name": "per_page",
            "Type": "int"
        },
        {
            "Name": "total",
            "Type": "int"
        },
        {
            "Name": "total_pages",
            "Type": "int"
        },
        {
            "Name": "data",
            "Type": "array<struct<Title:string,Year:int,imdbID:string>>"
        }
    ]
    
    "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
    "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
    
    "SerdeInfo": {
        "SerializationLibrary": "org.openx.data.jsonserde.JsonSerDe",
        "Parameters": {
            "paths": "data,page,per_page,total,total_pages"
        }
    }
    

    In Athena,

    SELECT data FROM customer_data;
    

    will return all movies in an array

    [{title=Waterworld, year=1995, imdbid=tt0114898}, {title=Waterworld, year=1995, imdbid=tt0189200}, {title=The Making of 'Waterworld', year=1995, imdbid=tt2670548}, {title=Waterworld 4: History of the Islands, year=1997, imdbid=tt0161077}, {title=Waterworld, year=1997, imdbid=tt0455840}, {title=Waterworld, year=1997, imdbid=tt0390617}, {title=Swordquest: Waterworld, year=1983, imdbid=tt2761086}, {title=Behind the Scenes of the Most Fascinating Waterworld on Earth: The Great Backwaters, Kerala., year=2014, imdbid=tt5847056}, {title=Louise's Waterworld, year=1997, imdbid=tt0298417}, {title=Waterworld, year=2001, imdbid=tt0381702}]
    

    If you want a regular table, create a view based on this query

    WITH w AS
      (SELECT DATA
       FROM customer_data)
    SELECT movie.title,
           movie.year,
           movie.imdbid
    FROM w,
         UNNEST (DATA) t(movie);
    

    which gives you the result

    #   title   year    imdbid
    1   Waterworld  1995    tt0114898
    2   Waterworld  1995    tt0189200
    3   The Making of 'Waterworld'  1995    tt2670548
    ...
    
    Login or Signup to reply.
  2. As @Isc also mentioned, you need to define the columns as described in the json data.

    Glue Table definition expects the column to be present in the data.Glue crawler crawls the data from external location using the Serde configuration and maps the data in the specified columns.
    Hence you need to ensure that right schema is defined. Based on that, it crawls.

    Your cloudformation template Glue::Table should look like this:

     Type: AWS::Glue::Table
      Properties:
        CatalogId: !Ref AWS::AccountId
        DatabaseName: !Ref GlueDatabase
        TableInput:
          Name: customer_data
          StorageDescriptor:
            Columns:
              - Name: page
                Type: int
              - Name: per_page  
                Type: int
              - Name: total  
                Type: int  
              - Name: total_pages
                Type: int     
              - Name: data
                Type: array<struct<Title:string,Year:int,imdbID:string>>           
            Location: 's3://customer-data-bucket-prod/'
            InputFormat: org.apache.hadoop.mapred.TextInputFormat
            OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
            SerdeInfo:
              SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
          Parameters: {'classification': 'json'}
          TableType: "EXTERNAL_TABLE"
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search