I want to read multiple text file from s3 bucket which will make up a data frame of 10M records and 900 columns.
But it is taking too long to read data using ml.p3.2xlarge instance. Moreover, I would like to know if there is a better way to read the data with all the resources available in aws large instance.
Following is the code I am using right now.
def get_data(prefix_objs = prefix_objs, bucket = bucket, i = 0, prefix_df = [], cols = col):
for obj in prefix_objs:
i += 1
key = obj.key
file_path = 's3://' + bucket.name +'/'+str(key)
temp = pd.read_csv(file_path,
sep = "|",
usecols = col)
print("File No: {}".format(i))
prefix_df.append(temp)
return pd.concat(prefix_df)
2
Answers
Use python boto3 client or async version aioboto3 for downloading files from AWS S3
Is this in SageMaker notebook instance or SageMaker training jobs (an async job)?
For training jobs see this guide.
For notebook instance, you might be better performance by copying all files locally first, using CLI command:
aws s3 sync s3://... /tmp
, then reading them from disk.A side note – Try to switch P3->G5 instance – p3.2xlarge is a very old instance type. You will get better cost/efficiency and more GPU RAM with a g5 instance, plus g5 has local SSD drive, where p3.xlarge uses a slower EBS volume.