Amazon web services - Extract specific column from csv stored in S3

Dkashyap
July 20, 2022
166 views
3 votes
3 Answers

I am querying Athena thru lambda. Results are getting stored in csv format in S3 bucket.

The csv files has two columns – EventTime and instance id.

I am reading csv file via one of function in my lambda handler:

def read_instanceids(path):
    s3 = boto3.resource('s3')
    bucket = s3.Bucket('aws-athena-query-results-mybucket-us-east-1')
    obj = bucket.Object(key= path)
    response = obj.get()
    lines = response['Body'].read().decode('utf-8').split()
    return lines**

Output:

[
  ""eventTime","instanceId"",
  ""2021-09-27T19:46:08Z","""i-0aa1f4dd"""",
  ""2021-09-27T21:04:13Z","""i-0465c287"""",
  ""2021-09-27T21:10:48Z","""i-08b75f79"""",
  ""2021-09-27T19:40:43Z","""i-0456700b"""",
  ""2021-03-29T21:58:40Z","""i-0724f99f"""",
  ""2021-03-29T23:27:44Z","""i-0fafbe64"""",
  ""2021-03-29T21:41:12Z","""i-0064a8552"""",
  ""2021-03-29T23:19:09Z","""i-07f5f08e5""""
]

I want to store only my instance ids in one array.

How I can achieve that. I cant use Pandas/Numpy.

If I am using get_query_results – and returning the response – its in the below format:

[
  {
    "Data": [
      {
        "VarCharValue": "eventTime"
      },
      {
        "VarCharValue": "instanceId"
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-09-23T22:36:15Z"
      },
      {
        "VarCharValue": ""i-053090803""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-29T21:58:40Z"
      },
      {
        "VarCharValue": ""i-0724f62a""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-29T21:41:12Z"
      },
      {
        "VarCharValue": ""i-552""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-29T23:19:09Z"
      },
      {
        "VarCharValue": ""i-07f4e5""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-29T23:03:09Z"
      },
      {
        "VarCharValue": ""i-0eb453""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-30T19:18:11Z"
      },
      {
        "VarCharValue": ""i-062120""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-30T18:15:26Z"
      },
      {
        "VarCharValue": ""i-0121a04""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-29T23:27:44Z"
      },
      {
        "VarCharValue": ""i-0f213""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-30T18:07:05Z"
      },
      {
        "VarCharValue": ""i-0ee19d8""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-04-28T14:49:22Z"
      },
      {
        "VarCharValue": ""i-04ad3c29""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-04-28T14:38:43Z"
      },
      {
        "VarCharValue": ""i-7c6166""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-03-30T19:13:42Z"
      },
      {
        "VarCharValue": ""i-07bc579d""
      }
    ]
  },
  {
    "Data": [
      {
        "VarCharValue": "2021-04-29T19:47:34Z"
      },
      {
        "VarCharValue": ""i-0b8bc7df5""
      }
    ]
  }
  ]

Answers

IF your list was valid, you can do:

l = [ "eventTime",
      "instanceId",
      "2021-09-27T19:46:08Z",
      "i-0aa1f4dd",
      "2021-09-27T21:04:13Z",
      """i-0465c287""",
      "2021-09-27T21:10:48Z",
      """i-08b75f79""",
      "2021-09-27T19:40:43Z",
      """i-0456700b""",
      "2021-03-29T21:58:40Z",
      """i-0724f99f""",
      "2021-03-29T23:27:44Z",
      """i-0fafbe64""",
      "2021-03-29T21:41:12Z",
      """i-0064a8552""",
      "2021-03-29T23:19:09Z",
      """i-07f5f08e5""" ]
print(l[2:][1::2])
['i-0aa1f4dd', 'i-0465c287', 'i-08b75f79', 'i-0456700b', 'i-0724f99f', 'i-0fafbe64', 'i-0064a8552', 'i-07f5f08e5']

- FilipHanes
- July 20, 2022 at 9:59 am
- 0 votes
0
Python has csv module in standard library. https://docs.python.org/3/library/csv.html

But in this use case, if instanceIds doesn’t contain comma you can split lines by comma, take second field and strip double quotes.
```
def read_instanceids(path):
    s3 = boto3.resource('s3')
    bucket = s3.Bucket('aws-athena-query-results-mybucket-us-east-1')
    obj = bucket.Object(key= path)
    response = obj.get()
    lines = response['Body'].read().decode('utf-8').split()
    return [line.split(',')[1].strip('"') for line in lines[1:]]
```
Login or Signup to reply.

- JohnRotenstein
- July 20, 2022 at 12:00 pm
- 0 votes
0
You can use the result returned from Amazon Athena via get_query_results().

If the data variable contains the JSON shown in your question, you can extract a list of the instances with:
```
rows = [row['Data'][1]['VarCharValue'].replace('"', '') for row in data]
print(rows)
```
The output is:
```
['instanceId', 'i-053090803', 'i-0724f62a', 'i-552', 'i-07f4e5', 'i-0eb453', 'i-062120', 'i-0121a04', 'i-0f213', 'i-0ee19d8', 'i-04ad3c29', 'i-7c6166', 'i-07bc579d', 'i-0b8bc7df5']
```
You can skip the column header by referencing: rows[1:]
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Amazon web services – Extract specific column from csv stored in S3

Answers