Imagine you had a set of R scripts that form an ETL pipeline that you wanted to run as an AWS Glue job. AWS Glue supports Python and Scala.
Is it possible to call an R as a Python subprocess (or a bash script that wraps a set of R scripts) within an AWS Glue job running in a container with Python and R dependencies?
If so, please outline the steps required and key considerations.
2
Answers
It is not possible
While possible to run custom code in Glue, as it is based on Spark only Scala and Python are supported. Regarding the question if Python subprocess, it seems not to be an option as mentioned in the documentation:
Only pure Python libraries can be used. Libraries that rely on C extensions, such as the pandas Python Data Analysis Library, are not yet supported.
As @Isc commented, I would recommend using Docker with ECS to run batch ETL jobs using R.
As Glue doesn’t natively support running R scripts, you can consider the following as an alternative:
Example folder structure
Example Dockerfile based on https://hub.docker.com/r/rocker/tidyverse
Example commands to push the image to ECR
Ref: https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html
Then follow this guide to configure an ECS cluster on Fargate, create and execute a job: https://docs.aws.amazon.com/batch/latest/userguide/getting-started-fargate.html