skip to Main Content

I am querying Athena thru lambda. Results are getting stored in csv format in S3 bucket.

The csv files has two columns – EventTime and instance id.

I am reading csv file via one of function in my lambda handler:

def read_instanceids(path):
    s3 = boto3.resource('s3')
    bucket = s3.Bucket('aws-athena-query-results-mybucket-us-east-1')
    obj = bucket.Object(key= path)
    response = obj.get()
    lines = response['Body'].read().decode('utf-8').split()
    return lines**

Output:

[
  ""eventTime","instanceId"",
  ""2021-09-27T19:46:08Z","""i-0aa1f4dd"""",
  ""2021-09-27T21:04:13Z","""i-0465c287"""",
  ""2021-09-27T21:10:48Z","""i-08b75f79"""",
  ""2021-09-27T19:40:43Z","""i-0456700b"""",
  ""2021-03-29T21:58:40Z","""i-0724f99f"""",
  ""2021-03-29T23:27:44Z","""i-0fafbe64"""",
  ""2021-03-29T21:41:12Z","""i-0064a8552"""",
  ""2021-03-29T23:19:09Z","""i-07f5f08e5""""
]

I want to store only my instance ids in one array.

How I can achieve that. I cant use Pandas/Numpy.

If I am using get_query_results – and returning the response – its in the below format:

[
  {
    "Data": [
      {
        "VarCharValue": "eventTime"
      },
      {
        "VarCharValue": "instanceId"
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-09-23T22:36:15Z"
      },
      {
        "VarCharValue": ""i-053090803""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-29T21:58:40Z"
      },
      {
        "VarCharValue": ""i-0724f62a""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-29T21:41:12Z"
      },
      {
        "VarCharValue": ""i-552""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-29T23:19:09Z"
      },
      {
        "VarCharValue": ""i-07f4e5""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-29T23:03:09Z"
      },
      {
        "VarCharValue": ""i-0eb453""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-30T19:18:11Z"
      },
      {
        "VarCharValue": ""i-062120""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-30T18:15:26Z"
      },
      {
        "VarCharValue": ""i-0121a04""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-29T23:27:44Z"
      },
      {
        "VarCharValue": ""i-0f213""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-30T18:07:05Z"
      },
      {
        "VarCharValue": ""i-0ee19d8""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-04-28T14:49:22Z"
      },
      {
        "VarCharValue": ""i-04ad3c29""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-04-28T14:38:43Z"
      },
      {
        "VarCharValue": ""i-7c6166""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-30T19:13:42Z"
      },
      {
        "VarCharValue": ""i-07bc579d""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-04-29T19:47:34Z"
      },
      {
        "VarCharValue": ""i-0b8bc7df5""
      }
    ]
  }
  ]

3

Answers


  1. IF your list was valid, you can do:

    l = [ "eventTime",
          "instanceId",
          "2021-09-27T19:46:08Z",
          "i-0aa1f4dd",
          "2021-09-27T21:04:13Z",
          """i-0465c287""",
          "2021-09-27T21:10:48Z",
          """i-08b75f79""",
          "2021-09-27T19:40:43Z",
          """i-0456700b""",
          "2021-03-29T21:58:40Z",
          """i-0724f99f""",
          "2021-03-29T23:27:44Z",
          """i-0fafbe64""",
          "2021-03-29T21:41:12Z",
          """i-0064a8552""",
          "2021-03-29T23:19:09Z",
          """i-07f5f08e5""" ]
    print(l[2:][1::2])
    ['i-0aa1f4dd', 'i-0465c287', 'i-08b75f79', 'i-0456700b', 'i-0724f99f', 'i-0fafbe64', 'i-0064a8552', 'i-07f5f08e5']
    
    Login or Signup to reply.
  2. Python has csv module in standard library. https://docs.python.org/3/library/csv.html

    But in this use case, if instanceIds doesn’t contain comma you can split lines by comma, take second field and strip double quotes.

    def read_instanceids(path):
        s3 = boto3.resource('s3')
        bucket = s3.Bucket('aws-athena-query-results-mybucket-us-east-1')
        obj = bucket.Object(key= path)
        response = obj.get()
        lines = response['Body'].read().decode('utf-8').split()
        return [line.split(',')[1].strip('"') for line in lines[1:]]
    
    Login or Signup to reply.
  3. You can use the result returned from Amazon Athena via get_query_results().

    If the data variable contains the JSON shown in your question, you can extract a list of the instances with:

    rows = [row['Data'][1]['VarCharValue'].replace('"', '') for row in data]
    print(rows)
    

    The output is:

    ['instanceId', 'i-053090803', 'i-0724f62a', 'i-552', 'i-07f4e5', 'i-0eb453', 'i-062120', 'i-0121a04', 'i-0f213', 'i-0ee19d8', 'i-04ad3c29', 'i-7c6166', 'i-07bc579d', 'i-0b8bc7df5']
    

    You can skip the column header by referencing: rows[1:]

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search