skip to Main Content

I am trying to run a study, using the optimize function with the default sampler and Median pruner.
every run crashes, sometimes after 1 succefull trial sometimes without completing any.
The crash message is: Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

Expected behavior

running a study

Environment

  • Optuna version:
    2.0.0
  • Python version: 3.8
  • OS:QubeOS with debian 10 VM
  • (Optional) Other libraries and their versions:
    Pytorch ‘1.5.0+cpu’

Error messages, stack traces, or logs

Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

what cause such an error?

2

Answers


  1. One likely situation is that your process consumes a lot of memory and gets killed by the operating system’s OOM killer. You can monitor the memory consumption of your process using a tool like top and see if it uses a lot of memory.

    You can also run dmesg in the console and look for messages from the OOM killer in the output. The OOM killer will usually print there which process it killed. Check whether the process ID is the one of your process.

    In case the process is indeed killed by the OOM killer then the only remedy probably is to reduce the memory consumption of the program (or get a bigger machine).

    Login or Signup to reply.
  2. you can play safe and use

    gc_after_trial=True
    

    there is study.optimize(timeout=300) but i didnt manage to get this to work
    there is also timeout and n_trial limit you can set if you are using tpesampler in HyperbandPruner.
    lastly what work for me is lower n_jobs (i set to -1 and has tendency to spike all my cores to 100% and finally just crash).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search