I stragle to make a Batch process to run with GPU in AWS Batch.
I set
Compute environment:
- Type: Managed
- Prov. model: EC2
- Instance type: g4dn.xlarge
- Status: Valid
- State: Enabled
- Min CPU: -
- Desired CPU: -
- Max CPU: 256
Job queue:
- state: Enabled
- status: Valid
- priority: 100
Job Definition:
- status: Active
- Type: Container
- Image: (the image that I have in ECR)
- CPU: 3
- Memory: 10240
- N of nodes: -
When I submit a job it always stays at ‘RUNNABLE’.
It is not clear to me where I should look into. Is it permissions related? I have tried with different permissions with no luck.
How to debug it?
2
Answers
Searching internet, there are multiple reason that this might happen. On documentation is not clear where to look into.
You should look at EC2 Auto Scaling Groups. There is an autoscaling group named after the compute environment. All of the errors for starting EC2 instances are in that auto scaling group.
For my case was that I did not have permission to spin a GPU instance.
I faced the same issue with almost the same configuration as yours. By specifying an ECS GPU-optimised AMI inside the compute environment block I got this issue resolved.
To find the ECS GPU-optimised AMI in your region you can use this AWS CLI command (this field is named as imageId, you can refer here for the official docs).
Replace your region in the above command, and if your profile is not the default profile then add
--profile your_profile_name
.