I am trying to read the schema
from a text
file under the same package as the code but cannot read that file using the AWS glue job. I will use that schema
for creating a dataframe using Pyspark
. I can load that file locally. I am zipping the code files as .zip, placing them under the s3
bucket, and then referencing them in the glue job. Every other thing works fine. No problem there. But when I try the below code it doesn’t work.
file_path = os.path.join(Path(os.path.dirname(os.path.relpath(__file__))), "verifications.txt")
multiline_data = None
with open(file_path, 'r') as data_file:
multiline_data = data_file.read()
self.logger.info(f"Schema is {multiline_data}")
This code throws the below error:
Error Category: UNCLASSIFIED_ERROR; NotADirectoryError: [Errno 20] Not a directory: 'src.zip/src/ingestion/jobs/verifications.txt'
I also tried with abs_path
but it didn’t help either. The same block of code works fine locally.
I also tried directly passing the "./verifications.txt"
path but no luck.
So how do I read this file?
2
Answers
AWS Glue scripts typically run in a managed environment, meaning your file is not visible to the ETL script. The reason the import works on your local machine is that the file is accessible from there since both are on the same machine. For such jobs, consider using S3 to store your files.
As @Bogdan mentioned the way to do this is use S3 to store the
verifications.txt
file. Here’s some example code using boto3