I would like to run my model 30 days in using aws sagemaker training job, but its max time is 5 days, how to resume the earlier to proceed further
Question posted in Amazon Web Sevices
The official Amazon Web Services documentation can be found here.
The official Amazon Web Services documentation can be found here.
2
Answers
According to the documentation here the maximum allowed runtime is 28 days, not 5. Check your configuration please. You are right, according to the documentation here the maximum runtime for a training job is 5 days. There are multiple things you can do: more powerful (multiple) GPU to reduce training time, or save checkpoint and restart training from there. Anyway 30 days looks like a very big training time (with associated cost), are you sure you need that ?Actually you could ask for service quotas increase from here but as you can see
Longest run time for a training job
is not adjustable. So I don’t you have any other choice of either using checkpoints or greater GPUs.Follow these steps:
Longest run time for a training job
to 2419200 seconds (28 days). (this can’t be adjusted using the service quotas in AWS Web console).
max_run=2419200
.Also, the questions in @rok’s answer are very relevant to consider.