Amazon web services - How to read large text file in Amazon SageMaker?

Mandar
October 29, 2023
184 views
0 votes
2 Answers

I want to read multiple text file from s3 bucket which will make up a data frame of 10M records and 900 columns.
But it is taking too long to read data using ml.p3.2xlarge instance. Moreover, I would like to know if there is a better way to read the data with all the resources available in aws large instance.

Following is the code I am using right now.

def get_data(prefix_objs = prefix_objs, bucket = bucket, i = 0, prefix_df = [], cols = col):
    for obj in prefix_objs:
        i += 1
        key = obj.key
        file_path = 's3://' + bucket.name +'/'+str(key)
        temp = pd.read_csv(file_path,
                          sep = "|",
                          usecols = col)
        print("File No: {}".format(i))
        prefix_df.append(temp)
    return pd.concat(prefix_df)

Answers

- AleksandrFuze
- October 27, 2023 at 3:15 pm
- 0 votes
0
Use python boto3 client or async version aioboto3 for downloading files from AWS S3

Login or Signup to reply.

- GiliNachum
- October 29, 2023 at 8:38 am
- 0 votes
0
Is this in SageMaker notebook instance or SageMaker training jobs (an async job)?
For training jobs see this guide.
For notebook instance, you might be better performance by copying all files locally first, using CLI command: aws s3 sync s3://... /tmp, then reading them from disk.

A side note – Try to switch P3->G5 instance – p3.2xlarge is a very old instance type. You will get better cost/efficiency and more GPU RAM with a g5 instance, plus g5 has local SSD drive, where p3.xlarge uses a slower EBS volume.

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Amazon web services – How to read large text file in Amazon SageMaker?

Answers