skip to Main Content

I have two GCP VM. on two vms, I run docker container.

I run
docker run --gpus all -it --rm --entrypoint /bin/bash -p 8000:8000 -p 7860:7860 -p 29500:29500 lf

I am trying llama-factory.

In one container, I run
FORCE_TORCHRUN=1 NNODES=2 RANK=1 MASTER_ADDR= MASTER_PORT=29500 llamafactory-cli train examples/train_lora/llama3_lora_sft_ds3.yaml,
where is external ip of vm

In the other container, I run
FORCE_TORCHRUN=1 NNODES=2 RANK=0 MASTER_ADDR= MASTER_PORT=29500 llamafactory-cli train examples/train_lora/llama3_lora_sft_ds3.yaml.

But I got

[rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank1]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank1]: Last error:
[rank1]: socketStartConnect: Connect to<49113> failed : Software caused connection abort
E0924 21:26:39.866000 140711615779968 torch/distributed/elastic/multiprocessing/] failed (exitcode: 1) local_rank: 0 (pid: 484) of binary: /usr/bin/python3.10
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/", line 879, in main
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/", line 870, in run
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/", line 263, in launch_agent
    raise ChildFailedError(
/workspace/LLaMA-Factory/src/llamafactory/ FAILED
Root Cause (first observed failure):
  time      : 2024-09-24_21:26:39
  host      : 71af1f49abe3
  rank      : 1 (local_rank: 0)
  exitcode  : 1 (pid: 484)
  error_file: <N/A>
  traceback : To enable traceback see:

it seems that pytorch is using the docker container ip instead of gcp vm external ip.

How to fix this?



  1. Chosen as BEST ANSWER

    I forgot to add --network host when running the docker container. After adding this, it works. external IPs are used instead of docker ip.

  2. Check ip addr in your docker container:

    3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
    link/ether 22:23:6b:28:6b:e0 brd ff:ff:ff:ff:ff:ff
    inet scope global docker0
    inet6 fe80::a402:65ff:fe86:bba6/64 scope link
       valid_lft forever preferred_lft forever

    Use that IP for your docker container.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top