skip to Main Content

I have a s3 bucket "bucket1" inside which I have a directory named "dir1". Inside this directory there are multiple files. I just want to create a list of all the file names in this directory in my pyspark code which I am not able to. I am completely new to pyspark so any leads will be helpful. Do I need to create a spark session for it? Also I don’t want to use libraries like boto3 etc.

2

Answers


  1. Without using boto3, you’ll need to open a session. Remember to have your AWS credentials configured. You can also use an IAM role in this case, if you’ll deploy this somewhere.

    from pyspark.sql import SparkSession
    
    spark = SparkSession.builder 
        .appName("List S3 Files") 
        .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") 
        .config("spark.hadoop.fs.s3a.access.key", "access_key") 
        .config("spark.hadoop.fs.s3a.secret.key", "secret_key") 
        .getOrCreate()
    
    s3_directory_path = "s3a://bucket1/dir1/"
    
    file_paths = spark.sparkContext.wholeTextFiles(s3_directory_path).keys().collect()
    
    file_names = [path.split("/")[-1] for path in file_paths]
    
    Login or Signup to reply.
  2. In term of performance it’s better to use the boto3.

    But if you like to use pyspark, you can use the pyspark function input_file_name()

    Example :

    from pyspark.sql.functions import input_file_name
    
    df = spark.read.json("s3a://bucket/folder")
    df = df.withColumn("file_path", input_file_name())
    df.select("file_path").distinct()
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search