skip to Main Content

I have two windows servers (192.168.1.11 and 192.168.1.12) and try to run a Ray Docker container (image tag = 2.35.0-py312-gpu) on each server.

Steps

  1. I run these two commands to start the Ray process. I confirm 192.168.1.11:8265 (the dashboard) shows the worker node (192.168.1.12).
# Run this in 192.168.1.11
$ ray start --head --dashboard-host=0.0.0.0
# Run this in 192.168.1.12
$ ray start --address=192.168.1.11:6379 --node-ip-address=192.168.1.12
  1. However, about 30 seconds after I complete Step 1, the status of the worker node becomes DEAD.

  2. I find gcs_server.out has these lines below. It seems that the head node fails to access 192.168.1.12:39091.

[2024-09-13 04:23:52,090 W 2925 2925] (gcs_server) gcs_health_check_manager.cc:108: Health check failed for node f7d09b9af5a7100e0376fad74db65ce7189372757f494c4525d6f147, remaining checks 4, status 4, response status 0, status message Deadline Exceeded, status details
[2024-09-13 04:23:57,115 W 2925 2925] (gcs_server) gcs_health_check_manager.cc:108: Health check failed for node f7d09b9af5a7100e0376fad74db65ce7189372757f494c4525d6f147, remaining checks 3, status 14, response status 0, status message failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.1.12:39091: Failed to connect to remote host: FD Shutdown, status details
[2024-09-13 04:24:00,115 W 2925 2925] (gcs_server) gcs_health_check_manager.cc:108: Health check failed for node f7d09b9af5a7100e0376fad74db65ce7189372757f494c4525d6f147, remaining checks 2, status 14, response status 0, status message failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.1.12:39091: Failed to connect to remote host: FD Shutdown, status details
[2024-09-13 04:24:03,116 W 2925 2925] (gcs_server) gcs_health_check_manager.cc:108: Health check failed for node f7d09b9af5a7100e0376fad74db65ce7189372757f494c4525d6f147, remaining checks 1, status 14, response status 0, status message failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.1.12:39091: Failed to connect to remote host: FD Shutdown, status details
[2024-09-13 04:24:06,116 W 2925 2925] (gcs_server) gcs_health_check_manager.cc:108: Health check failed for node f7d09b9af5a7100e0376fad74db65ce7189372757f494c4525d6f147, remaining checks 0, status 14, response status 0, status message failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.1.12:39091: Failed to connect to remote host: FD Shutdown, status details

Problem

The problem is the port number (39091 in 192.168.1.12:39091) changes every time and I don’t find any method to specify this port here (https://docs.ray.io/en/latest/ray-core/configure.html), while I need to know which port to be used in advance in order to set up Windows Defender Firewall and Docker’s -p option.

Is there a good way to solve this problem?

2

Answers


  1. Chosen as BEST ANSWER

    Although this question may be off-topic, I find a solution and share it here. (I will move to DevOps for further questions.)

    I have to use --node-manager-port option when starting the worker (192.168.1.12). This port will be used for health check.


  2. I have been facing a similar issue when trying to run Ray head as a docker container with the command

    ray start --head --block --dashboard-host=0.0.0.0

    One of the things that worked for me (I don’t know why, yet) is getting rid of the --dashboard-host command line argument.

    Although my issue was regarding raylet failing, causing Ray to return with error code 1

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search