EDIT 1: Tried with network_mode: host on the worker nodes, same result
I am setting up a multi-node multi-docker cluster of spark, in standalone configuration:
1 node with 1 spark master and X workers
docker-compose for master+worker node:
version: '2'
services:
spark:
image: bitnami/spark:latest
environment:
- SPARK_MODE=master
ports:
- '8080:8080'
- '4040:4040'
- '7077:7077'
spark-worker:
image: bitnami/spark:latest
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
deploy:
mode: replicated
replicas: 4
N nodes with 1…M workers
docker-compose for worker nodes:
version: '2'
services:
spark-worker:
image: bitnami/spark:latest
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://1.1.1.1:7077
network-mode: host
deploy:
mode: replicated
replicas: 4
I can see on the Spark Master web UI the correct number of workers registered.
But when I submit a job on master, the master logs are filled with:
spark_1 | 22/07/01 13:32:27 INFO Master: Removing executor app-20220701133058-0002/499 because it is EXITED
spark_1 | 22/07/01 13:32:27 INFO Master: Launching executor app-20220701133058-0002/530 on worker worker-20220701130135-172.18.0.4-35337
spark_1 | 22/07/01 13:32:27 INFO Master: Removing executor app-20220701133058-0002/501 because it is EXITED
spark_1 | 22/07/01 13:32:27 INFO Master: Launching executor app-20220701133058-0002/531 on worker worker-20220701132457-172.18.0.5-39517
spark_1 | 22/07/01 13:32:27 INFO Master: Removing executor app-20220701133058-0002/502 because it is EXITED
spark_1 | 22/07/01 13:32:27 INFO Master: Launching executor app-20220701133058-0002/532 on worker worker-20220701132457-172.18.0.2-43527
spark_1 | 22/07/01 13:32:27 INFO Master: Removing executor app-20220701133058-0002/505 because it is EXITED
spark_1 | 22/07/01 13:32:27 INFO Master: Launching executor app-20220701133058-0002/533 on worker worker-20220701130134-172.18.0.3-35961
spark_1 | 22/07/01 13:32:27 INFO Master: Removing executor app-20220701133058-0002/504 because it is EXITED
spark_1 | 22/07/01 13:32:27 INFO Master: Launching executor app-20220701133058-0002/534 on worker worker-20220701132453-172.18.0.5-40345
spark_1 | 22/07/01 13:32:28 INFO Master: Removing executor app-20220701133058-0002/506 because it is EXITED
spark_1 | 22/07/01 13:32:28 INFO Master: Launching executor app-20220701133058-0002/535 on worker worker-20220701132454-172.18.0.2-42907
spark_1 | 22/07/01 13:32:28 INFO Master: Removing executor app-20220701133058-0002/514 because it is EXITED
spark_1 | 22/07/01 13:32:28 INFO Master: Launching executor app-20220701133058-0002/536 on worker worker-20220701132442-172.18.0.2-41669
spark_1 | 22/07/01 13:32:28 INFO Master: Removing executor app-20220701133058-0002/503 because it is EXITED
spark_1 | 22/07/01 13:32:28 INFO Master: Launching executor app-20220701133058-0002/537 on worker worker-20220701132454-172.18.0.3-37011
spark_1 | 22/07/01 13:32:28 INFO Master: Removing executor app-20220701133058-0002/509 because it is EXITED
spark_1 | 22/07/01 13:32:28 INFO Master: Launching executor app-20220701133058-0002/538 on worker worker-20220701132455-172.18.0.4-42013
spark_1 | 22/07/01 13:32:28 INFO Master: Removing executor app-20220701133058-0002/507 because it is EXITED
spark_1 | 22/07/01 13:32:28 INFO Master: Launching executor app-20220701133058-0002/539 on worker worker-20220701132510-172.18.0.3-39097
spark_1 | 22/07/01 13:32:28 INFO Master: Removing executor app-20220701133058-0002/508 because it is EXITED
spark_1 | 22/07/01 13:32:28 INFO Master: Launching executor app-20220701133058-0002/540 on worker worker-20220701132510-172.18.0.2-40827
spark_1 | 22/07/01 13:32:28 INFO Master: Removing executor app-20220701133058-0002/513 because it is EXITED
Sample remote worker logs:
spark-worker_1 | 22/07/01 13:32:32 INFO ExecutorRunner: Launch command: "/opt/bitnami/java/bin/java" "-cp" "/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=38385" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@63ab9653f1c0:38385" "--executor-id" "561" "--hostname" "172.18.0.4" "--cores" "1" "--app-id" "app-20220701133058-0002" "--worker-url" "spark://[email protected]:35337"
spark-worker_1 | 22/07/01 13:32:38 INFO Worker: Executor app-20220701133058-0002/561 finished with state EXITED message Command exited with code 1 exitStatus 1
spark-worker_1 | 22/07/01 13:32:38 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 561
spark-worker_1 | 22/07/01 13:32:38 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20220701133058-0002, execId=561)
spark-worker_1 | 22/07/01 13:32:38 INFO Worker: Asked to launch executor app-20220701133058-0002/595 for API Bruteforce
spark-worker_1 | 22/07/01 13:32:38 INFO SecurityManager: Changing view acls to: spark
spark-worker_1 | 22/07/01 13:32:38 INFO SecurityManager: Changing modify acls to: spark
spark-worker_1 | 22/07/01 13:32:38 INFO SecurityManager: Changing view acls groups to:
spark-worker_1 | 22/07/01 13:32:38 INFO SecurityManager: Changing modify acls groups to:
spark-worker_1 | 22/07/01 13:32:38 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); groups with view permissions: Set(); users with modify permissions: Set(spark); groups with modify permissions: Set()
spark-worker_1 | 22/07/01 13:32:38 INFO ExecutorRunner: Launch command: "/opt/bitnami/java/bin/java" "-cp" "/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=38385" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@63ab9653f1c0:38385" "--executor-id" "595" "--hostname" "172.18.0.4" "--cores" "1" "--app-id" "app-20220701133058-0002" "--worker-url" "spark://[email protected]:35337"
spark-worker_1 | 22/07/01 13:32:43 INFO Worker: Executor app-20220701133058-0002/595 finished with state EXITED message Command exited with code 1 exitStatus 1
spark-worker_1 | 22/07/01 13:32:43 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 595
spark-worker_1 | 22/07/01 13:32:43 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20220701133058-0002, execId=595)
spark-worker_1 | 22/07/01 13:32:43 INFO Worker: Asked to launch executor app-20220701133058-0002/629 for API Bruteforce
spark-worker_1 | 22/07/01 13:32:43 INFO SecurityManager: Changing view acls to: spark
spark-worker_1 | 22/07/01 13:32:43 INFO SecurityManager: Changing modify acls to: spark
spark-worker_1 | 22/07/01 13:32:43 INFO SecurityManager: Changing view acls groups to:
spark-worker_1 | 22/07/01 13:32:43 INFO SecurityManager: Changing modify acls groups to:
spark-worker_1 | 22/07/01 13:32:43 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); groups with view permissions: Set(); users with modify permissions: Set(spark); groups with modify permissions: Set()
spark-worker_1 | 22/07/01 13:32:43 INFO ExecutorRunner: Launch command: "/opt/bitnami/java/bin/java" "-cp" "/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=38385" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@63ab9653f1c0:38385" "--executor-id" "629" "--hostname" "172.18.0.4" "--cores" "1" "--app-id" "app-20220701133058-0002" "--worker-url" "spark://[email protected]:35337"
Throughput is very low, and CPU usage on worker nodes is reaching 100%
I believe it has something to do with docker port mapping on the worker nodes, but I can’t figure out which ports I need to expose on the worker containers? And if they’re the same port, how would I configure them for multiple containers on the same machine?
2
Answers
I think you should add in the docker-compose of worker nodes
I don’t know if you ever solved this, but one thing I’ve seen is that Spark master tries to make (outbound) connections to the workers it is managing.
As specified, the docker-compose.yml for your workers does not expose any ports. So my guess is that the workers connect to the master, and the master records this, which is why it sees the workers. however, when the master goes to make connections back to the workers there are no ports exposed and it times out.
So I’d expose the workers’ ports in your docker-compose.yml as well.