skip to Main Content

I have json files in S3 bucket generated by AWS Textract service and I’m using Athena to query data from those files. Every file has the same structure and I created a table in Athena where I have column "blocks" that is array of struct:

  "blocks": [{
      "BlockType": "LINE",
      "Id": "12345",
      "Text": "Text from document",
      "Confidence": 98.7022933959961,
      "Page": "1",
      "SourceLanguage": "de",
      "TargetLanguage": "en",
    },
    ...100+ blocks]

How can I query just for the "Text" property from every block that has one?

Thanks in advance!

2

Answers


  1. It looks like column stores array of rows, so you can process it as one (array functions):

    select transform(
         filter(block_column, t -> t.text is not null), 
         r => cast(row(r.text) as row(text varchar))) texts
    from table
    
    Login or Signup to reply.
  2. I have defined a table with exact schema of yours using sample JSON provided.

    _col0
    #   
    array(row(blocktype varchar, id varchar, text varchar, confidence double, page varchar, sourcelanguage varchar, targetlanguage varchar))
    

    I have used unnest operator to flatten the array of blocks and fetch the Text column from it using below query:

    select block.text from <table-name> CROSS JOIN UNNEST(blocks) as t(block)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search