skip to Main Content

Imagine you had a set of R scripts that form an ETL pipeline that you wanted to run as an AWS Glue job. AWS Glue supports Python and Scala.

Is it possible to call an R as a Python subprocess (or a bash script that wraps a set of R scripts) within an AWS Glue job running in a container with Python and R dependencies?

If so, please outline the steps required and key considerations.



  1. It is not possible

    While possible to run custom code in Glue, as it is based on Spark only Scala and Python are supported. Regarding the question if Python subprocess, it seems not to be an option as mentioned in the documentation:

    Only pure Python libraries can be used. Libraries that rely on C extensions, such as the pandas Python Data Analysis Library, are not yet supported.

    As @Isc commented, I would recommend using Docker with ECS to run batch ETL jobs using R.

    Login or Signup to reply.
  2. As Glue doesn’t natively support running R scripts, you can consider the following as an alternative:

    1. Customise your own Docker image
    2. Push the image to ECR
    3. Configure the compute resources and schedule using AWS Batch

    Example folder structure

    ├── Dockerfile
    └── scripts
        └── rtest.R

    Example Dockerfile based on

    FROM rocker/tidyverse:4.2.2
    WORKDIR /scripts
    COPY scripts/* /scripts
    RUN chmod 755 ./*
    # Install additional R libraries

    Example commands to push the image to ECR

    aws ecr get-login-password --region region | docker login --username AWS --password-stdin
    docker build -t rdev .
    docker tag rdev:latest
    docker push


    Then follow this guide to configure an ECS cluster on Fargate, create and execute a job:

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top