skip to Main Content

Problem statement

I have a Python program that needs to launch a number of Singularity containers in parallel.

Is it possible to do this, exploiting all of the available hardware, using only built-in libraries (subprocessing, concurrent.futures, etc)?

The ‘host’ script runs on 1 CPU. It is launched by SLURM. The ‘host’ needs to launch the containers, wait for them to complete, do some analysis, repeat.

For example, if I have 40 containers each needing 2 CPUs, and two nodes each with 76 CPUs, then there should be something like:

Node 1 (76 CPUs) Node 2 (76 CPUs)
Host script (1 CPU) 3 containers (6 CPUs)
37 containers (74 CPUs) 70 spare CPUs
1 spare CPU

MWE

Singularity recipe (stress.def)

We use stress to fully utilise a given number of CPUs:

Bootstrap: docker
From: ubuntu:16.04

%post
apt update -y
apt install -y stress 

%runscript
    echo $(uname -n)
    stress "$@"

Build with singularity build stress.simg stress.def.

Python host script (main.py)

Spin up 40 containers, each running the stress image with 2 CPUs for 10s:

from subprocess import Popen

n_processes = 40
cpus_per_process = 2
stress_time = 10

command = [
    "singularity",
    "run",
    "stress.simg",
    "-c",
    str(cpus_per_process),
    "-t",
    f"{stress_time}s",
]
processes = [Popen(command) for i in range(n_processes)]

for p in processes:
    p.wait()

SLURM script

#!/bin/bash
#SBATCH -J stress
#SBATCH -A myacc
#SBATCH -p mypart
#SBATCH --output=%x_%j.out
#SBATCH --nodes=2
#SBATCH --ntasks=40
#SBATCH --cpus-per-task=2
#SBATCH --time=24:00:00

python main.py

Results

The above only runs on one of the two nodes. Total execution time is around 20s, and the Singularity containers are run sequentially – the first 38 are run, and then the last two.

As such, it does not have the desired effect.

2

Answers


  1. Chosen as BEST ANSWER

    Turns out my question was just from a misunderstanding of what should be handled by Singularity and what should be handled by SLURM.

    My mistake was thinking that Singularity could see and utilise other nodes; in reality, it can only see resources available on the current node.

    Solution:

    1. Allocate all the resources that the overall job will need in the SLURM script, treating each container as a new task. So in the above example, that means setting ntasks=40 and cpus-per-task=2.
    2. Launch the subprocesses with srun. This will allow SLURM to allocate resources to the containers from the pool that have already been allotted to this particular run.

    Modified main.py:

    from subprocess import Popen
    
    n_processes = 40
    cpus_per_process = 2
    stress_time = 10
    
    command = [
        "srun", # <--------- MODIFICATION
        "singularity",
        "run",
        "stress.simg",
        "-c",
        str(cpus_per_process),
        "-t",
        f"{stress_time}s",
    ]
    processes = [Popen(srun_command) for i in range(n_processes)]
    
    for p in processes:
        p.wait()
    

  2. Could you give this a try. Thanks.

    import subprocess
    
    # Define the number of instances and CPUs to use
    num_instances = 15
    num_cpus = 1
    
    # Create a list to store the subprocess instances
    processes = []
    
    print("Starting Each Process")
    
    # Run multiple instances of the Singularity container in parallel
    for _ in range(num_instances):
        # Define the Singularity command
        singularity_cmd = [
            "singularity", "run", "--contain", "--cpu", str(num_cpus), "-t", str(STRESS_TIME), "stress.simg",
            # Add any additional arguments or commands here
        ]
    
        # Start the subprocess for each instance
        process = subprocess.Popen(singularity_cmd)
        processes.append(process)
    
    print("Waiting for Process to end")
    
    # Wait for all subprocesses to complete
    for process in processes:
        process.wait()
    
    print("All Processes Completed")
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search