skip to Main Content

I’m running ray on EC2. I am running workers on c5.large instances, which have ~4G of RAM.

When I run many jobs, I see these error messages:

  File "python/ray/_raylet.pyx", line 631, in ray._raylet.execute_task
  File "/home/ubuntu/project/env/lib/python3.6/site-packages/ray/memory_monitor.py", line 126, in raise_if_low_memory
    self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ip-172-31-43-111 is used (3.47 / 3.65 GB). The top 10 memory consumers are:

PID     MEM     COMMAND
21183   0.21GiB ray::IDLE
21185   0.21GiB ray::IDLE
21222   0.21GiB ray::IDLE
21260   0.21GiB ray::IDLE
21149   0.21GiB ray::IDLE
21298   0.21GiB ray::IDLE
21130   0.21GiB ray::IDLE
21148   0.21GiB ray::IDLE
21225   0.21GiB ray::IDLE
21257   0.21GiB ray::IDLE

In addition, up to 0.0 GiB of shared memory is currently being used by the Ray object store. You can set the object store size with the `object_store_memory` parameter when starting Ray, and the max Redis size with `redis_max_memory`. Note that Ray assumes all system memory is available for use by workers. If your system has other applications running, you should manually set these memory limits to a lower value.

I am running my ray task with memory = 2000*1024*1024 and max_calls=1, so there should never be more than 2 processes on the box at the same time.

What are these ray::IDLE processes and how can I stop my workers from going OOM?

Using ray 0.8.1

3

Answers


  1. ray:IDLE are idle processes that are staying in the processing pool. (Ray does it so that it can reduce process startup time). Each of them takes around 0.21GB of memory because even idle processes need to use some memory (For example, it should run a python interpreter).

    You can probably mitigate the problem by 2 things.
    1. Set the num_cpus argument of ray_init to be lower (like 2~3) so that you will have only 2~3 processes available.
    2. You should take into account of system memory. As you can see Ray is using memory not just for tasks but also for its system components such as raylet or idel processes. If your machine has 4GB memory and if 2 of your tasks are using 2GB of memory and scheduled in that machine, it will cause an OOM problem because there are extra processes that consume extra memory.

    To avoid memory issues, you can either scale up your cluster (use a bigger machine or multiple machines), or reduce the memory usage of your task.

    Login or Signup to reply.
  2. Try ray.init(local_mode=True) to run ray in single process, it solved my low memory issue.

    Login or Signup to reply.
  3. You can limit the port numbers that workers are allowed to use:
    ray start --min-worker-port 10010 --max-worker-port 10011 for example would only allow two workers. Note that (as of ray 1.12) num-cpus does not limit the number of ray::IDLE workers.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search