skip to Main Content

I am defining two nextflow processes. The first one, scatter(), creates two files. Then, parallel() is spawned twice, once for each file.

Here is my setup.

// bug.nf
nextflow.enable.dsl = 2

workflow {
    main:
        scatter(params.config)

        scatter.out.configs
            | flatten
            | parallel
}

process scatter {
    container "python:3.11.8"

    input:
        path "config.txt"

    output:
        path "config*.txt", emit: configs

    script:
        """
        echo $PWD
        ls -hal /home/alex/my_cool_repo

        touch config1.txt
        touch config2.txt
        """
}

process parallel {
    container "python:3.11.8"
    
    input:
        path "config.txt"

    script:
        """
        echo $PWD
        ls -hal /home/alex/my_cool_repo
        """
}
// run command
nextflow run nextflow/bug.nf --config /home/alex/my_cool_repo/my_cool_repo/config/bla.txt

The ls output from all processes should look the same but it does not.

Output from scatter() (truncated):

/home/alex/my_cool_repo
total 656K
drwxrwxr-x 16 1035 1036 4.0K Feb 17 13:20 .
drwxr-xr-x  3 root root 4.0K Feb 17 13:20 ..
-rw-rw-r--  1 1035 1036 3.3K Feb 17 11:09 .dockerignore
-rw-rw-r--  1 1035 1036 3.2K Feb  6 15:33 .gitignore
drwxrwxr-x  4 1035 1036 4.0K Feb 17 13:20 .nextflow
-rw-rw-r--  1 1035 1036 5.4K Feb 17 13:20 .nextflow.log
-rw-rw-r--  1 1035 1036    5 Jan 26 18:18 .python-version
drwxrwxr-x  6 1035 1036 4.0K Feb  7 14:20 .venv
drwxrwxr-x  2 1035 1036 4.0K Feb  6 13:28 .vscode
-rw-rw-r--  1 1035 1036  848 Feb 17 12:28 Dockerfile
-rw-rw-r--  1 1035 1036  627 Feb  6 15:33 README.md
drwxrwxr-x  3 1035 1036 4.0K Feb 17 12:55 nextflow
-rw-rw-r--  1 1035 1036 527K Feb 17 11:45 poetry.lock
-rw-rw-r--  1 1035 1036   32 Jan 26 18:18 poetry.toml
-rw-rw-r--  1 1035 1036 2.2K Feb 16 19:36 pyproject.toml
drwxrwxr-x  9 1035 1036 4.0K Feb  6 13:28 my_cool_repo
drwxrwxr-x  3 1035 1036 4.0K Feb 17 13:20 work

Output from the two parallel() processes:

/home/alex/my_cool_repo
total 12K
drwxr-xr-x 3 root root 4.0K Feb 17 13:20 .
drwxr-xr-x 3 root root 4.0K Feb 17 13:20 ..
drwxrwxr-x 5 1035 1036 4.0K Feb 17 13:20 work

Why are the outputs not the same?

Context: Instead of ls I actually would like to run poetry run ... but poetry gives the following error message for the parallel() processes: Poetry could not find a pyproject.toml file in /home/alex/my_cool_repo/work/f3/766313fbc5d6aeeb39f19193956ffd or its parents.

2

Answers


  1. The issue in your code lies in the mismatch between the input declaration in the scatter process and how it is used in the parallel process. In the scatter process, you are using path config, which means the variable is named config. However, in the parallel process, you are trying to access config.txt as the input, which is incorrect.

    To resolve this, you should use the same variable name in both processes. Here’s the corrected code:

    // bug.nf
    nextflow.enable.dsl=2

    workflow {
    main:
    scatter(params.config)
    .into { config_scatter }

        parallel(config_scatter) {
            scatter.out.configs
                | flatten()
                | parallel
        }
    

    }

    process scatter {
    container "python:3.11.8"

    input:
        path config
    
    output:
        path "config*.txt", emit: configs
    
    script:
        """
        echo $PWD
        ls -hal /home/alex/my_cool_repo
    
        touch config1.txt
        touch config2.txt
        """
    

    }

    process parallel {
    container "python:3.11.8"

    input:
        path config
    
    script:
        """
        echo $PWD
        ls -hal /home/alex/my_cool_repo
        """
    

    }

    Now, the variable config is consistently used in both the scatter and parallel processes. This should ensure that the output of the ls commands in both processes will be the same.

    Login or Signup to reply.
  2. As user dbthorbur points out in his comment, the difference has to do with the directories mounted into your container.

    For your first process scatter you are using an additional file-input that is located somewhere else on your machine. So nextflow needs to mount that location AND your work-directory into the container used for scatter. Apparently it takes a common root(?) directory of both, so that you find some additional files.

    The second process parallel on the other hand only takes input from work, so only that directory gets mounted as volume for your container.

    Check out your .command.run scripts in the work-directories to see what actually gets mounted by docker (or podman?).

    There are two ways to overcome the difference.

    • Use stageInMode "copy" as directive for scatter to get the behaviour of parallel in both processes
      or
    • use containerOptions "-v /home/alex/my_cool_repo:/home/alex/my_cool_repo" directive in parallelto get the current behaviour of scatter in both
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search