Amazon web services - List all the files which are inside a folder in a s3 bucket using pyspark

DivyanshRathi
October 28, 2024
204 views
0 votes
2 Answers

I have a s3 bucket "bucket1" inside which I have a directory named "dir1". Inside this directory there are multiple files. I just want to create a list of all the file names in this directory in my pyspark code which I am not able to. I am completely new to pyspark so any leads will be helpful. Do I need to create a spark session for it? Also I don’t want to use libraries like boto3 etc.

Answers

- HigorNunes
- October 28, 2024 at 8:50 am
- 0 votes
0
Without using boto3, you’ll need to open a session. Remember to have your AWS credentials configured. You can also use an IAM role in this case, if you’ll deploy this somewhere.
```
from pyspark.sql import SparkSession

spark = SparkSession.builder 
    .appName("List S3 Files") 
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") 
    .config("spark.hadoop.fs.s3a.access.key", "access_key") 
    .config("spark.hadoop.fs.s3a.secret.key", "secret_key") 
    .getOrCreate()

s3_directory_path = "s3a://bucket1/dir1/"

file_paths = spark.sparkContext.wholeTextFiles(s3_directory_path).keys().collect()

file_names = [path.split("/")[-1] for path in file_paths]
```
Login or Signup to reply.

- ahmed
- October 28, 2024 at 11:47 am
- 0 votes
0
In term of performance it’s better to use the boto3.

But if you like to use pyspark, you can use the pyspark function input_file_name()

Example :
```
from pyspark.sql.functions import input_file_name

df = spark.read.json("s3a://bucket/folder")
df = df.withColumn("file_path", input_file_name())
df.select("file_path").distinct()
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Amazon web services – List all the files which are inside a folder in a s3 bucket using pyspark

Answers