skip to Main Content

I am trying to read the schema from a text file under the same package as the code but cannot read that file using the AWS glue job. I will use that schema for creating a dataframe using Pyspark. I can load that file locally. I am zipping the code files as .zip, placing them under the s3 bucket, and then referencing them in the glue job. Every other thing works fine. No problem there. But when I try the below code it doesn’t work.

file_path = os.path.join(Path(os.path.dirname(os.path.relpath(__file__))), "verifications.txt")
multiline_data = None
with open(file_path, 'r') as data_file:
   multiline_data = data_file.read()
self.logger.info(f"Schema is {multiline_data}")
           

This code throws the below error:

Error Category: UNCLASSIFIED_ERROR; NotADirectoryError: [Errno 20] Not a directory: 'src.zip/src/ingestion/jobs/verifications.txt'  

I also tried with abs_path but it didn’t help either. The same block of code works fine locally.

I also tried directly passing the "./verifications.txt" path but no luck.

So how do I read this file?

2

Answers


  1. AWS Glue scripts typically run in a managed environment, meaning your file is not visible to the ETL script. The reason the import works on your local machine is that the file is accessible from there since both are on the same machine. For such jobs, consider using S3 to store your files.

    Login or Signup to reply.
  2. As @Bogdan mentioned the way to do this is use S3 to store the verifications.txt file. Here’s some example code using boto3

    import boto3
    
    # Hardcoded S3 bucket/key (these are normally passed in as Glue Job params)
    s3_bucket = 'your-bucket-name'
    s3_key = 'path/to/verifications.txt'
    
    # Read data from S3 using boto3
    s3_client = boto3.client('s3')
    response = s3_client.get_object(Bucket=s3_bucket, Key=s3_key)
    multiline_data = response['Body'].read().decode('utf-8')
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search