I have a JSON file in s3 that I’m trying to run queries against using Athena. When I run my Lambda that tries to call Athena, I get this error: SyntaxError: Unexpected token < in JSON at position 0 at JSON.parse (<anonymous>)
Any help would be greatly appreciated.
My code looks like this:
const athena_client = new AthenaClient({ region: process.env.AWS_REGION });
const sql = `SELECT * FROM customer_data`
const params = {
QueryString: sql,
WorkGroup: 'customer-data-workgroup-prod'
};
const command = new StartQueryExecutionCommand(params);
console.log("command", command)
try {
const data = await client.send(command);
console.log("data response: ", data)
} catch (error) {
console.log("error", error)
}
In s3, I have a file called movies.json, that just pulls from this test API – https://jsonmock.hackerrank.com/api/movies
My Glue Table Cloudformation template looks like this:
Type: AWS::Glue::Table
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: !Ref GlueDatabase
TableInput:
Name: customer_data
StorageDescriptor:
Columns:
- Name: createdAt
Type: timestamp
- Name: fullJson
Type: string
- Name: data
Type: string
- Name: title
Type: string
Location: 's3://customer-data-bucket-prod/'
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
SerdeInfo:
SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
Parameters: {'classification': 'json'}
TableType: "EXTERNAL_TABLE"
JSON:
{
"page": 1,
"per_page": 10,
"total": 2770,
"total_pages": 277,
"data": [
{
"Title": "Waterworld",
"Year": 1995,
"imdbID": "tt0114898"
},
{
"Title": "Waterworld",
"Year": 1995,
"imdbID": "tt0189200"
},
{
"Title": "The Making of 'Waterworld'",
"Year": 1995,
"imdbID": "tt2670548"
},
{
"Title": "Waterworld 4: History of the Islands",
"Year": 1997,
"imdbID": "tt0161077"
},
{
"Title": "Waterworld",
"Year": 1997,
"imdbID": "tt0455840"
},
{
"Title": "Waterworld",
"Year": 1997,
"imdbID": "tt0390617"
},
{
"Title": "Swordquest: Waterworld",
"Year": 1983,
"imdbID": "tt2761086"
},
{
"Title": "Behind the Scenes of the Most Fascinating Waterworld on Earth: The Great Backwaters, Kerala.",
"Year": 2014,
"imdbID": "tt5847056"
},
{
"Title": "Louise's Waterworld",
"Year": 1997,
"imdbID": "tt0298417"
},
{
"Title": "Waterworld",
"Year": 2001,
"imdbID": "tt0381702"
}
]
}
2
Answers
This is the Glue table schema to allow Athena parsing the text file correctly.
In Athena,
will return all movies in an array
If you want a regular table, create a view based on this query
which gives you the result
As @Isc also mentioned, you need to define the columns as described in the json data.
Glue Table definition expects the column to be present in the data.Glue crawler crawls the data from external location using the Serde configuration and maps the data in the specified columns.
Hence you need to ensure that right schema is defined. Based on that, it crawls.
Your cloudformation template
Glue::Table
should look like this: