I have two windows servers (192.168.1.11
and 192.168.1.12
) and try to run a Ray Docker container (image tag = 2.35.0-py312-gpu
) on each server.
Steps
- I run these two commands to start the Ray process. I confirm 192.168.1.11:8265 (the dashboard) shows the worker node (192.168.1.12).
# Run this in 192.168.1.11
$ ray start --head --dashboard-host=0.0.0.0
# Run this in 192.168.1.12
$ ray start --address=192.168.1.11:6379 --node-ip-address=192.168.1.12
-
However, about 30 seconds after I complete Step 1, the status of the worker node becomes
DEAD
. -
I find
gcs_server.out
has these lines below. It seems that the head node fails to access192.168.1.12:39091
.
[2024-09-13 04:23:52,090 W 2925 2925] (gcs_server) gcs_health_check_manager.cc:108: Health check failed for node f7d09b9af5a7100e0376fad74db65ce7189372757f494c4525d6f147, remaining checks 4, status 4, response status 0, status message Deadline Exceeded, status details
[2024-09-13 04:23:57,115 W 2925 2925] (gcs_server) gcs_health_check_manager.cc:108: Health check failed for node f7d09b9af5a7100e0376fad74db65ce7189372757f494c4525d6f147, remaining checks 3, status 14, response status 0, status message failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.1.12:39091: Failed to connect to remote host: FD Shutdown, status details
[2024-09-13 04:24:00,115 W 2925 2925] (gcs_server) gcs_health_check_manager.cc:108: Health check failed for node f7d09b9af5a7100e0376fad74db65ce7189372757f494c4525d6f147, remaining checks 2, status 14, response status 0, status message failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.1.12:39091: Failed to connect to remote host: FD Shutdown, status details
[2024-09-13 04:24:03,116 W 2925 2925] (gcs_server) gcs_health_check_manager.cc:108: Health check failed for node f7d09b9af5a7100e0376fad74db65ce7189372757f494c4525d6f147, remaining checks 1, status 14, response status 0, status message failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.1.12:39091: Failed to connect to remote host: FD Shutdown, status details
[2024-09-13 04:24:06,116 W 2925 2925] (gcs_server) gcs_health_check_manager.cc:108: Health check failed for node f7d09b9af5a7100e0376fad74db65ce7189372757f494c4525d6f147, remaining checks 0, status 14, response status 0, status message failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.1.12:39091: Failed to connect to remote host: FD Shutdown, status details
Problem
The problem is the port number (39091
in 192.168.1.12:39091
) changes every time and I don’t find any method to specify this port here (https://docs.ray.io/en/latest/ray-core/configure.html), while I need to know which port to be used in advance in order to set up Windows Defender Firewall and Docker’s -p
option.
Is there a good way to solve this problem?
2
Answers
Although this question may be off-topic, I find a solution and share it here. (I will move to DevOps for further questions.)
I have to use
--node-manager-port
option when starting the worker (192.168.1.12
). This port will be used for health check.I have been facing a similar issue when trying to run Ray head as a docker container with the command
ray start --head --block --dashboard-host=0.0.0.0
One of the things that worked for me (I don’t know why, yet) is getting rid of the
--dashboard-host
command line argument.Although my issue was regarding raylet failing, causing Ray to return with error code 1