skip to Main Content

I know there’s a few topics around this but I am looking for a way to do this using AWS CLI. I inherited a really old S3 bucket with about 30 over million log files. I need to get a specific month of log files (i.e., "2023-06*").

Initially I tried using Athena but have received a timeout error supposedly from the large amount of files it need to go through.

I tried to copy the files to local by using the following but I realized it will still go through the entire folder and taking forever.

aws s3 cp s3://mybucket/S3-Accesslogs/ logs --exclude "*" --include "2023-06*" --recursive

Using AWS CLI, is there a way to avoid the need to go through the entire folder to find the match? Something similar to the "ls" or a combination of both ls and cp?

My problem is I have 30 over million files.

2

Answers


  1. If you want to sync a single folder/directory from an S3 bucket to a directory on your machine:

    s3 sync "s3://bucket" "/var/local/path" --include "*2023-06*" --recursive
    

    TIP: Would recommend to make a new folder for the landing directory on your machine to prevent confusion between pre-existing local files from the ones downloaded from S3.

    Login or Signup to reply.
  2. This is going to be a slow operation because the AWS CLI first Lists the objects and then performs the Copy.

    The ListObject() API call returns a maximum of 1000 objects, so looking through a bucket of 30 million objects could take 30,000 API calls and might actually run out of memory before starting the copy operation.

    An alternate approach would be:

    • Activate Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects
    • Edit the output file to only include files you are wanting (2023-06*)
    • Use that file to generate commands to copy only the desired files

    I normally use Microsoft Excel to write a copy command that includes the filename. For example, if the first column has the filename, then I make a formula like ="aws s3 cp s3://bucket/"&A1&" ."

    Then, I copy the output of the formula and paste it into a Terminal window.

    This will download one file at a time — this can be quite slow compared to normal AWS CLI copy commands that can download multiple files at the same time. However, it should be quick enough if you have a relatively small number of files (a few thousand) and will totally avoid the need to list the contents of the bucket (since it is using the Amazon S3 Inventory file as the source data).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search