skip to Main Content

I am trying to create a glue job with this configuration ‘NumberOfWorkers’: 2, ‘WorkerType’: ‘G.1X’. Here’s my code for job creation.

job_name = "glue_job"
    submit_glue_job = AwsGlueJobOperator(
        task_id="glue_job",
        job_name=job_name,
        wait_for_completion=True,
        # num_of_dpus=10,
        retry_limit=0,
        script_location=f"s3://bucket/etl.py",
        s3_bucket=GLUE_EXAMPLE_S3_BUCKET,
        iam_role_name=GLUE_CRAWLER_ROLE.split("/")[-1],
        create_job_kwargs={
            'GlueVersion': '3.0', 'NumberOfWorkers': 2, 'WorkerType': 'G.1X',
            "DefaultArguments": {"--enable-glue-datacatalog": ''}
        }

and here’s the error:

when calling the CreateJob operation: 
Please do not set Allocated Capacity if using Worker Type and Number of Workers

I checked the official documentation to see if the Allocated capacity is assigned to any default value, but it’s not.
here’s the source code link for the operator.
https://github.com/apache/airflow/blob/providers-amazon/3.2.0/airflow/providers/amazon/aws/operators/glue.py

2

Answers


  1. The documentation is a bit confusing, but I found a solution through reading the airflow documentation here.

    The argument to pay attention to is:

    num_of_dpus (int | None) – Number of AWS Glue DPUs to allocate to this Job.

    This argument corresponds to MaxWorkers found in the glue documentation. You should uncomment num_of_dpus and remove both the NumberOfWorkers and WorkerType args because they are causing the error. I was not able to locate in the documentation why the error appears but my best guess is that the num_of_dpus arg is setting MaxWorkers regardless if you set the arg yourself.

    I discovered through trial and error that if you do not set num_of_dpus, the MaxWorkers is automatically set to 6 and the worker type is set to G.1x. My working solution below:

        submit_glue_job = AwsGlueJobOperator(
        task_id='submit_glue_job',
        job_name=glue_job_name,
        script_location=f's3://{bucket}/etl_script.py',
        iam_role_name=role_name,
        s3_bucket=bucket,
        retry_limit=0,
        num_of_dpus=10,
        create_job_kwargs={'GlueVersion':'3.0', "DefaultArguments": {"--job-language": "python", "--enable-metrics": "", "--enable-glue-datacatalog": "", "--enable-auto-scaling": "true"}}
    )
    

    While this solution works for G.1x worker types, I do not think it would work for G.2x because I have not found success in setting the worker type in airflow yet.

    Login or Signup to reply.
  2. Which package version are you using for apache-airflow-providers-amazon?

    I have the same problem with version 2.5.0.
    Seems like upgrade package version is the only option here to set worker type and number of worker.

    In stable version they do support work type configuration.
    See example here. https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/operators/glue.html#submit-an-aws-glue-job

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search