How can I trigger N executions of the same task with different parameters in a way to systematically sweep a 3D paramter space?
I have AWS ECS Fargate tasks defined. They accept multiple args over the command line, or, as docker images, run args to entrypoint
. So the following will train my models, test them, for my ensemble setup "411":
docker run -it my_image -s train -a test -e 411
Now on ECS, the args are part of the task defintion
-s,train,-a,test,-e,411
Now, I want to sweep over many variations of -e 411
, and 4-5 other dimensions. For scalability reasons, I don’t want to sweep over the parameter space within my application (where, if I were to do it, it would really only be a nested set of for
loops). The tasks will all have the same (a) code, (b) docker image, (c) task settings, incl CPU and RAM requirements, (d) are absolutely independent of one another, (e) excpet for those 4-5 parameters which I want to vary/explore.
What’s the best way to do this, keeping the paramters neat in one place, having a very simple way to start these tasks, and keep it manageable quickly for future changes?
- Define ~100 tasks, and write the paramters into those task definitions? Then I can start them by just calling the task, but maintaining the tasks is a nightmare (many tasks with many parameters to track & update)
- Start ~100 tasks with "overwrite", each from the command line, so a separate
for
loop to kick all the tasks off? If so, where is that "starter batch job" running – on the local machine viaaws cli
, or yet another ECS container just to start the productive "task" containers? - Define a "service", and hard-code the ~100 param variations into that service, so that service starts the tasks? At least the params are all in one place, but the starting/scheduling becomes quite complex.
- Use the AWS Batch? How would I control the parameter space in there? I only see that I can replace one placeholder by one parameter, but not an option to create tasks for a loop over placeholders.
2
Answers
I would pick this approach:
You would use ECS Container Overrides to change the parameter for each ECS task.
Where you run this script to kick off your ECS tasks really doesn’t mater. A script that wraps the AWS CLI command running from your local machine would be just fine. If you need to run that from the cloud, in some automated manner, I would look into doing it from an AWS Lambda function and use the AWS SDK instead of the CLI. I see you tagged the question with
python
, so I would do this with a Python AWS Lambda function and boto3.If you are able and willing to change the code slightly, then another way to do this would be to use AWS Batch with an array job. It is similar to writing an ECS task definition, but in this case it is an AWS Batch job definition (use the single-node definition, not multi-node which is meant for jobs with cross-instance communication needs such as MPI or large ML model training).
Batch array jobs inject an environment variable into the container
AWS_BATCH_JOB_ARRAY_INDEX
which you can leverage to change the value of the-e
parameter in your code.The nice thing about this is that it is a single API request to Batch, and the managed service takes care of spawning the child tasks on ECS, respecting API limits, etc.