I’m trying to build a fully dockerized deployment of slurm using docker stacks, but jobs don’t complete consistently. Does anyone have any idea why this might be?
Other than this problem, the system works: All the nodes come up, I can submit jobs, and they run. The problem I am having is that some jobs don’t complete properly. Right now it’s running on a single-node swarm.
I can submit a bunch of them with:
salloc -t 1 srun sleep 10
and I can watch them with squeue
. Some of them will complete after 10 seconds as expected but most of them keep running until they hit the 1-minute timeout from -t 1
.
The system consists of five docker services:
slurm-stack_mysql
slurm-stack_slurmdbd
slurm-stack_slurmctld
slurm-stack_c1
slurm-stack_c2
c1
and c2
are the worker nodes. All five services run the same docker image (Dockerfile
below) and are configured with the docker-compose.yml
linked below.
Here are some things I’ve noticed and tried:
-
I based the
Dockerfile
anddocker-compose.yml
on adocker-compose
-based version (i.e., without stacks or swarm). That versions works just fine — jobs complete as usual. So it seems like it’s something in the transition to Docker Stacks that’s causing trouble. The original is here: https://github.com/giovtorres/slurm-docker-cluster -
I noticed in the logs that
slurmdbd
was getting "Error connecting slurm stream socket at 10.0.2.6:6817: Connection refused" errors failure to an IP address that corresponded to the swarm load-balancer. I managed to get rid of these by declaring all the services as global deployments indocker-compose.yml
. Other than eliminating the connection failures, it didn’t seem to change anything. EDIT @chris-becke pointed out that I was mis-usingglobal
, so I’ve turned it off. No help, but the "connection refused" errors returned. -
When I do
host c2
,host c1
, orhost <service>
for any of the services in my system from inside one of the containers, I always get back two IP addresses. One of them corresponds to what I see in thecontainers
section ofdocker network inspect slurm-stack_default
. The other is one lower (e.g.,10.0.38.12
and10.0.38.11
). If I runip addr
in one of the containers, the ip address it reports matches what’s listed for that host in the output ofdocker network inspect
.
Configuration Files
Here are all the configuration files for the system:
Dockerfile
: https://gist.github.com/stevenjswanson/b819ab3a68cc7d9aea72099263ef10bddocker-compose.yml
: https://gist.github.com/stevenjswanson/4b50e085385a0ffcb0d6ffed9186ed02slurm.conf
: https://gist.github.com/stevenjswanson/d8c48fcd6b19b504fda3a32c34227878slurmdb.conf
: https://gist.github.com/stevenjswanson/84b31b5ae793379f16eff16678f75b47install_slurm.sh
:
https://gist.github.com/stevenjswanson/bcd04828dbc69eb25acd48c3d4c8ef31docker-entrypoint.sh
: https://gist.github.com/stevenjswanson/0b3650a123fd93f54a1fd9b973ed2e65
I start it with docker stack deploy -c docker-compose.yml slurm-stack
.
Example Logs
These are representative logs for when job is not finishing consistently. In this case, jobs 2 (running on c2
) and 3 (running on c1
) don’t complete correctly, but job 1 (running on c1
) does.
slurmctld
logs: https://gist.github.com/stevenjswanson/67ca4c76bc00200d52b2d05ab7bfb422slurmdbd
logs: https://gist.github.com/stevenjswanson/b49d9571dbf6b9160555db3a0867410fc1
logs: https://gist.github.com/stevenjswanson/fab9ce8510804919fafe36804fd417f6c2
logs: https://gist.github.com/stevenjswanson/dd03f5bdf77851115086801691410099mysql
logs: https://gist.github.com/stevenjswanson/d7cfb82adde9c260ea4673e2037363d1
Software Version Info
Slurm version info:
$ sinfo -V
slurm-wlm 21.08.5
Docker version information:
$ docker version
Client:
Version: 20.10.12
API version: 1.41
Go version: go1.17.3
Git commit: 20.10.12-0ubuntu4
Built: Mon Mar 7 17:10:06 2022
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.22
API version: 1.41 (minimum version 1.12)
Go version: go1.18.9
Git commit: 42c8b31
Built: Thu Dec 15 22:25:49 2022
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.14
GitCommit: 9ba4b250366a5ddde94bb7c9d1def331423aa323
runc:
Version: 1.1.4
GitCommit: v1.1.4-0-g5fd4c4d
docker-init:
Version: 0.19.0
GitCommit: de40ad0
Linux version:
$ uname -a
Linux slurmctld 5.15.0-76-generic #83-Ubuntu SMP Thu Jun 15 19:16:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Edit: For good measure, I rebuilt everything on a brand new cloud instance with the latest docker (24.0.5) and kernel (5.15.0-78). The results are the same.
2
Answers
I don't fully understand what was going on, but I believe the internal load balancer was to blame.
I set
endpoint_mode: dnsrr
for the hosts and the problem went away as did the double IP addresses for each container.My guess is that the load balancer was distributing requests bound for the same container across the two IP addresses and that was confusing Slurm somehow.
However, I still don't understand why there two addresses in the first place.
docker creates a VIP or virtual IP associated with each service. This VIP will, in the case that multiple tasks exist, load balance between the healthy tasks. It also ensures that consumers are not effected by IP changes when tasks restart.
Each task container gets its own IP. Normally consumers are insulated from this duality: The service name is associated with the VIP, and
tasks.<service>
is the dnsrr entry associated with the 0, 1, or more IPS associated with each container.However, docker also registers the hostname in its internal dns, and here steps in a frequent antipattern that refuses to die: Lots of compose files, for no reason at all, just love to declare a hostname the same as the service name.
This, as you have found, can have weird unintended side effects as now the hostname AND service name both resolve, resulting in a dnsrr that returns both the vip and task ip, where really you just want one response.