I am using the following bit of code to read the iris dataset from an s3 bucket.
import pandas as pd
import s3fs
s3_path = 's3://h2o-public-test-data/smalldata/iris/iris.csv'
s3 = s3fs.S3FileSystem(anon=True)
with s3.open(s3_path, 'rb') as f:
df = pd.read_csv(f, header = True)
However, the column names are just the contents of the first row of the dataset. How do I fix that?
2
Answers
The following changes are required:
s3://
.iris.csv
is a file without header. In case you need a file with header then you should go foriris_wheader.csv
file.read_csv
header accepts boolean valueYour final code should look something like this
Edit: You can directly read the file in pandas as follows:
You still need to install s3fs. Just that no need to open a file for accessing it.
See https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html for all the parameters.
If you don’t have a CSV with the column names, you can use the
names
parameter to specify the names you want. In that case, you do not need to setheader
to True.