skip to Main Content

I would like to run my model 30 days in using aws sagemaker training job, but its max time is 5 days, how to resume the earlier to proceed further

2

Answers


  1. According to the documentation here the maximum allowed runtime is 28 days, not 5. Check your configuration please . You are right, according to the documentation here the maximum runtime for a training job is 5 days. There are multiple things you can do: more powerful (multiple) GPU to reduce training time, or save checkpoint and restart training from there. Anyway 30 days looks like a very big training time (with associated cost), are you sure you need that ?

    Actually you could ask for service quotas increase from here but as you can see Longest run time for a training job is not adjustable. So I don’t you have any other choice of either using checkpoints or greater GPUs.

    Login or Signup to reply.
  2. Follow these steps:

    1. Open a support ticket to increase Longest run time for a training job
      to 2419200 seconds (28 days). (this can’t be adjusted using the service quotas in AWS Web console).
    2. Using the SageMaker Python SDK, when creating an Estimator, set max_run=2419200.
    3. Implement Resume from checkpoints in your training script.

    Also, the questions in @rok’s answer are very relevant to consider.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search