I’m working on an application in Node.js using Socket.IO which is deployed using Docker Swarm and I want the option of multiple instances of the application service. But, the application is failing when there is more than one instance. The failure involves an error in the browser for every Socket.IO message, the data that’s supposed to be sent in the message never arrives, etc.
The Docker Stack file has four services
- the Node.js application
- the REDIS instance required for handling Socket.IO and Sessions in a multi-node Socket.IO service — Yes, I’ve read the Socket.IO documentation on this, implementing the
connect-redis
SessionStore, and usingsocket.io-redis
to do multi-node Socket.IO - the database (MySQL)
- a reverse proxy – I’ve used both NGINX and Traefik
In Socket.IO there is a routine keepalive request such as a GET on /socket.io/?EIO=3&transport=polling&t=NLjcKJj&sid=X5UnuTjlYNJ4N8OsAAAH
. This request is seen in the log file for the reverse proxy, and gets handled by the application. The debugging output from Engine.IO says it receives these requests.
Specifically:
2020-10-28T05:06:02.557Z Net read redis:6379 id 0
2020-10-28T05:06:02.557Z socket.io:socket socket connected - writing packet
2020-10-28T05:06:02.557Z socket.io:socket joining room X5UnuTjlYNJ4N8OsAAAH
2020-10-28T05:06:02.557Z socket.io:client writing packet {"type":0,"nsp":"/"}
2020-10-28T05:06:02.557Z socket.io:socket joined room [ 'X5UnuTjlYNJ4N8OsAAAH' ]
2020-10-28T05:06:02.656Z engine intercepting request for path "/socket.io/"
2020-10-28T05:06:02.656Z engine handling "GET" http request "/socket.io/?EIO=3&transport=polling&t=NLjcKJj&sid=X5UnuTjlYNJ4N8OsAAAH"
2020-10-28T05:06:02.656Z engine setting new request for existing client
2020-10-28T05:06:02.655Z engine intercepting request for path "/socket.io/"
2020-10-28T05:06:02.655Z engine handling "POST" http request "/socket.io/?EIO=3&transport=polling&t=NLjcKJh&sid=X5UnuTjlYNJ4N8OsAAAH"
2020-10-28T05:06:02.655Z engine unknown sid "X5UnuTjlYNJ4N8OsAAAH"
2020-10-28T05:06:02.774Z engine intercepting request for path "/socket.io/"
2020-10-28T05:06:02.774Z engine handling "GET" http request "/socket.io/?EIO=3&transport=polling&t=NLjcKLI&sid=X5UnuTjlYNJ4N8OsAAAH"
2020-10-28T05:06:02.774Z engine unknown sid "X5UnuTjlYNJ4N8OsAAAH"
2020-10-28T05:06:02.775Z engine intercepting request for path "/socket.io/"
2020-10-28T05:06:02.775Z engine handling "POST" http request "/socket.io/?EIO=3&transport=polling&t=NLjcKLJ&sid=X5UnuTjlYNJ4N8OsAAAH"
2020-10-28T05:06:02.775Z engine setting new request for existing client
2020-10-28T05:06:02.775Z socket.io:client client close with reason transport close
2020-10-28T05:06:02.775Z socket.io:socket closing socket - reason transport close
2020-10-28T05:09:14.955Z socket.io:client client close with reason ping timeout
2020-10-28T05:09:14.955Z socket.io:socket closing socket - reason ping timeout
The log message saying engine unknown sid "X5UnuTjlYNJ4N8OsAAAH"
seems significant. It’s saying the Session ID is not known. But the sessions are shared between the nodes using REDIS. Hence, it is confusing why the session would be unknown since they’re supposed to be shared using connect-redis
.
Another significant thing is the logging in the browser.
In the JavaScript console there is a continuous reporting of these messages:
WebSocket connection to 'ws://DOMAIN-NAME/socket.io/?EIO=3&transport=websocket&sid=h2aFFkOvNZtFc1DcAAAI' failed: WebSocket is closed before the connection is established.
Failed to load resource: the server responded with a status of 400 (Bad Request)
The last is reported as occurring with http://DOMAIN-NAME/socket.io/?EIO=3&transport=polling&t=NLjf5hB&sid=h2aFFkOvNZtFc1DcAAAI
Then, for these requests I see the response body is:
{
"code": 1,
"message": "Session ID unknown"
}
That is obviously consistent with the unknown sid
message earlier. I take that to mean the connection is being closed because the server thinks the Session ID is incorrect.
In the research I’ve done into this, I’ve learned that in Docker Swarm the traffic is distributed in a round robin fashion — that is Docker Swarm acts as a round robin load balancer. The success path with Socket.IO in such a case is to implement sticky sessions.
I read somewhere that the sticky session support in NGINX does not work for this situation, and that Traefik instead can support this situation.
In NGINX I had this proxy configuration:
location / {
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Host $http_host;
proxy_set_header X-NginX-Proxy false;
proxy_pass http://todos;
proxy_redirect off;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
upstream todos {
ip_hash;
server todo:80 fail_timeout=1s max_fails=3;
keepalive 16;
}
That did not change the behavior – still unknown sid
etc. Hence I’ve switched to Traefik, and I’m having trouble finding documentation on this side of Traefik. It’s my first time using Traefik, FWIW. I was able to implement HTTPS using Lets Encrypt, but not the sticky sessions.
To configure Traefik, I’m using command line arguments and Docker container labels such that the entire configuration is in the Docker Stack file.
traefik:
image: traefik:v2.0
restart: always
ports:
- "80:80" # <== http
- "8080:8080" # <== :8080 is where the dashboard runs on
- "443:443" # <== https
deploy:
replicas: 1
labels:
#### Labels define the behavior and rules of the traefik proxy for this container ####
- "traefik.enable=true" # <== Enable traefik on itself to view dashboard and assign subdomain to view it
- "traefik.http.routers.api.rule=Host(`monitor.DOMAIN-NAME`)" # <== Setting the domain for the dashboard
- "traefik.http.routers.api.service=api@internal" # <== Enabling the api to be a service to access
- "traefik.http.routers.api.entrypoints=web"
placement:
constraints:
- "node.hostname==srv1"
command:
- "--providers.docker.swarmmode=true"
- "--providers.docker.endpoint=unix:///var/run/docker.sock"
- "--providers.docker.watch=true"
- "--log.level=DEBUG"
- "--accesslog=true"
- "--tracing=true"
- "--api.insecure=true" # <== Enabling insecure api, NOT RECOMMENDED FOR PRODUCTION
- "--api.dashboard=true" # <== Enabling the dashboard to view services, middlewares, routers, etc...
- "--providers.docker=true" # <== Enabling docker as the provider for traefik
- "--providers.docker.exposedbydefault=false" # <== Don't expose every container to traefik, only expose enabled onesconfiguration file
- "--providers.docker.network=todo_webnet" # <== Operate on the docker network named web
- "--entrypoints.web.address=:80" # <== Defining an entrypoint for port :80 named web
- "--entrypoints.web-secured.address=:443" # <== Defining an entrypoint for https on port :443 named web-secured
- "--certificatesresolvers.mytlschallenge.acme.tlschallenge=false" # <== Enable TLS-ALPN-01 to generate and renew ACME certs
- "--certificatesresolvers.mytlschallenge.acme.email=E-MAIL-ADDRESS@DOMAIN-NAME" # <== Setting email for certs
- "--certificatesresolvers.mytlschallenge.acme.storage=/letsencrypt/acme.json" # <== Defining acme file to store cert
- "--certificatesresolvers.mytlschallenge.acme.httpChallenge.entryPoint=web"
volumes:
- /home/ubuntu/letsencrypt:/letsencrypt # <== Volume for certs (TLS)
- /var/run/docker.sock:/var/run/docker.sock # <== Volume for docker admin
networks:
- webnet
todo:
image: robogeek/todo-app:first-dockerize-redis
# ports:
# - "80:80"
networks:
- dbnet
- webnet
- redisnet
deploy:
replicas: 2
labels:
#### Labels define the behavior and rules of the traefik proxy for this container ####
- "traefik.enable=true" # <== Enable traefik to proxy this container
- "traefik.http.routers.todo.rule=Host(`DOMAIN-NAME`)" # <== Your Domain Name goes here for the http rule
- "traefik.http.routers.todo.entrypoints=web" # <== Defining the entrypoint for http, **ref: line 30
- "traefik.http.routers.todo.service=todo"
- "traefik.http.services.todo.loadbalancer.healthcheck.port=80"
- "traefik.http.services.todo.loadbalancer.sticky=true"
- "traefik.http.services.todo.loadbalancer.server.port=80"
- "traefik.http.routers.todo-secured.rule=Host(`DOMAIN-NAME`)" # <== Your Domain Name goes here for the http rule
- "traefik.http.routers.todo-secured.entrypoints=web-secured" # <== Defining the entrypoint for http, **ref: line 30
- "traefik.http.routers.todo-secured.service=todo"
- "traefik.http.routers.todo-secured.tls=true"
- "traefik.http.routers.todo-secured.tls.certresolver=mytlschallenge" # <== Defining certsresolvers for https
# - "traefik.http.routers.todo-app.middlewares=redirect@file" # <== This is a middleware to redirect to https
# - "traefik.http.routers.nginx-secured.rule=Host(`example.com`)" # <== Your Domain Name for the https rule
# - "traefik.http.routers.nginx-secured.entrypoints=web-secured" # <== Defining entrypoint for https, **ref: line 31
depends_on:
- db
- redis
dns:
- 8.8.8.8
- 9.9.9.9
environment:
- SEQUELIZE_CONNECT=models/sequelize-mysql-docker.yaml
- SEQUELIZE_DBHOST=db
- SEQUELIZE_DBNAME=tododb
- SEQUELIZE_DBUSER=dbuser
- SEQUELIZE_DBPASSWD=PASS-WORD-HIDDEN
- REDIS_ENDPOINT=redis
- NODE_DEBUG=redis
- REDIS_PASSWD=PASS-WORD-HIDDEN
- DEBUG=todos:*,ioredis:*,socket.io:*,engine
command: [ "./wait-for-it.sh", "-t", "0", "db:3306", "--", "node", "./app.mjs" ]
2
Answers
Looking on the Traefik forum I found this: https://community.traefik.io/t/sticky-sessions-dont-work/1949
Per the discussion, I added the following
label
to thetodo
container:And now it works fine, scaling from 1 up to 4 containers so far and it is working great.
Just in case some one is running in HTTPS mode. This was my configuration:
Within docker-compose file in the
labels
section:StickyCookie
to any value you want.