skip to Main Content

I stragle to make a Batch process to run with GPU in AWS Batch.
I set

Compute environment:
  - Type: Managed
  - Prov. model: EC2
  - Instance type: g4dn.xlarge
  - Status: Valid
  - State: Enabled
  - Min CPU: -
  - Desired CPU: -
  - Max CPU: 256

Job queue:
  - state: Enabled
  - status: Valid
  - priority: 100

Job Definition:
  - status: Active
  - Type: Container
  - Image: (the image that I have in ECR)
  - CPU: 3
  - Memory: 10240
  - N of nodes: -

When I submit a job it always stays at ‘RUNNABLE’.
It is not clear to me where I should look into. Is it permissions related? I have tried with different permissions with no luck.

How to debug it?

2

Answers


  1. Chosen as BEST ANSWER

    Searching internet, there are multiple reason that this might happen. On documentation is not clear where to look into.

    You should look at EC2 Auto Scaling Groups. There is an autoscaling group named after the compute environment. All of the errors for starting EC2 instances are in that auto scaling group.

    For my case was that I did not have permission to spin a GPU instance.


  2. I faced the same issue with almost the same configuration as yours. By specifying an ECS GPU-optimised AMI inside the compute environment block I got this issue resolved.

    To find the ECS GPU-optimised AMI in your region you can use this AWS CLI command (this field is named as imageId, you can refer here for the official docs).

    aws ssm get-parameters-by-path --path /aws/service/ecs/optimized-ami/amazon-linux-2/gpu/recommended 
                                     --region us-east-2 --output json
    

    Replace your region in the above command, and if your profile is not the default profile then add --profile your_profile_name.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search