skip to Main Content

I am using the following bit of code to read the iris dataset from an s3 bucket.

import pandas as pd
import s3fs

s3_path = 's3://h2o-public-test-data/smalldata/iris/iris.csv'

s3 = s3fs.S3FileSystem(anon=True)
with s3.open(s3_path, 'rb') as f:
    df = pd.read_csv(f, header = True)

However, the column names are just the contents of the first row of the dataset. How do I fix that?

2

Answers


  1. The following changes are required:

    1. s3_path should omit the s3://.
    2. iris.csv is a file without header. In case you need a file with header then you should go for iris_wheader.csv file.
    3. In read_csv header accepts boolean value

    Your final code should look something like this

    import s3fs
    import pandas as pd
    
    s3 = s3fs.S3FileSystem(anon=True)
    
    with s3.open('h2o-public-test-data/smalldata/iris/iris_wheader.csv', 'rb') as f:
        df = pd.read_csv(f, header=0)
        print(df.head())
    

    Edit: You can directly read the file in pandas as follows:

    import pandas as pd
    
    df = pd.read_csv('s3://h2o-public-test-data/smalldata/iris/iris_wheader.csv', header=0, storage_options={
        "anon": True
    })
    print(df.head())
    

    You still need to install s3fs. Just that no need to open a file for accessing it.

    Login or Signup to reply.
  2. See https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html for all the parameters.

    If you don’t have a CSV with the column names, you can use the names parameter to specify the names you want. In that case, you do not need to set header to True.

    df = pd.read_csv(file_path, names=['yan', 'tan', 'tetherer', 'mether', 'pip'])
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search