I am trying to run a study, using the optimize function with the default sampler and Median pruner.
every run crashes, sometimes after 1 succefull trial sometimes without completing any.
The crash message is: Process finished with exit code 137 (interrupted by signal 9: SIGKILL)
Expected behavior
running a study
Environment
- Optuna version:
2.0.0 - Python version: 3.8
- OS:QubeOS with debian 10 VM
- (Optional) Other libraries and their versions:
Pytorch ‘1.5.0+cpu’
Error messages, stack traces, or logs
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)
what cause such an error?
2
Answers
One likely situation is that your process consumes a lot of memory and gets killed by the operating system’s OOM killer. You can monitor the memory consumption of your process using a tool like
top
and see if it uses a lot of memory.You can also run
dmesg
in the console and look for messages from the OOM killer in the output. The OOM killer will usually print there which process it killed. Check whether the process ID is the one of your process.In case the process is indeed killed by the OOM killer then the only remedy probably is to reduce the memory consumption of the program (or get a bigger machine).
you can play safe and use
there is study.optimize(timeout=300) but i didnt manage to get this to work
there is also timeout and n_trial limit you can set if you are using tpesampler in HyperbandPruner.
lastly what work for me is lower n_jobs (i set to -1 and has tendency to spike all my cores to 100% and finally just crash).