How can I ensure I get a particular instance type in AWS Batch whilst simultaneously ensuring that my container uses as much of this resource as possible?
Naturally, if I’m paying for a particular instance type, I would like to ensure that I’m using as much of its resource as possible.
When allowing AWS batch to choose instance types using its optimal strategy I end up with the unfortunate situation that it selects instances which far out scale what’s defined for the job/container.
For example, if I define that the container should have 32vCPU and 128GiB, batch gives me a m6a.12xlarge instance with 48vCPUs and 192GiB. Thus I’m paying for 16 extra cores and 64 GiB which I’m not using.
If I limit the compute environment to 32vCPU and 128GiB and define the container to use all of this, the job seems to get stuck in the RUNNABLE state.
Now, I guess this is because the host instance needs some resource reserved for itself without giving all to the container. Correct?
I have done a load of internet searching on this and the information is really lacking.
I have a feeling it’s only memory and not cores that needs to be reserved (although I could be incorrect).
I’ve tried defining slightly lower memory (124GiB) for the container (I would have assumed 4GiB was more than enough to run the docker daemon + any other system stuff in the underlying image). This seemed to start trying to use a m6a.8xlarge initially before batch realised it wouldn’t work and switched to m6a.12xlarge before starting the job.
2
Answers
I've had to work this through using a trial and error; trying loads of different limits on the container resource.
In my specific case the jobs would actually start if I limited the instance types to m6a.8xlarge and gave the container 32 cores (all of the instance) and 123GiB of memory, i.e. giving the host 5GiB (I didn't bother going into fractions of GiB).
I don't know how general this is and whether it will apply to others attempting to use m6a.8xlarge.
From AWS Batch docs on memory management
Batch treats memory as a hard limit resource, so you need to allow for that bit of memory on each instance type in your job definitions.