I am defining two nextflow processes. The first one, scatter(), creates two files. Then, parallel() is spawned twice, once for each file.
Here is my setup.
// bug.nf
nextflow.enable.dsl = 2
workflow {
main:
scatter(params.config)
scatter.out.configs
| flatten
| parallel
}
process scatter {
container "python:3.11.8"
input:
path "config.txt"
output:
path "config*.txt", emit: configs
script:
"""
echo $PWD
ls -hal /home/alex/my_cool_repo
touch config1.txt
touch config2.txt
"""
}
process parallel {
container "python:3.11.8"
input:
path "config.txt"
script:
"""
echo $PWD
ls -hal /home/alex/my_cool_repo
"""
}
// run command
nextflow run nextflow/bug.nf --config /home/alex/my_cool_repo/my_cool_repo/config/bla.txt
The ls
output from all processes should look the same but it does not.
Output from scatter() (truncated):
/home/alex/my_cool_repo
total 656K
drwxrwxr-x 16 1035 1036 4.0K Feb 17 13:20 .
drwxr-xr-x 3 root root 4.0K Feb 17 13:20 ..
-rw-rw-r-- 1 1035 1036 3.3K Feb 17 11:09 .dockerignore
-rw-rw-r-- 1 1035 1036 3.2K Feb 6 15:33 .gitignore
drwxrwxr-x 4 1035 1036 4.0K Feb 17 13:20 .nextflow
-rw-rw-r-- 1 1035 1036 5.4K Feb 17 13:20 .nextflow.log
-rw-rw-r-- 1 1035 1036 5 Jan 26 18:18 .python-version
drwxrwxr-x 6 1035 1036 4.0K Feb 7 14:20 .venv
drwxrwxr-x 2 1035 1036 4.0K Feb 6 13:28 .vscode
-rw-rw-r-- 1 1035 1036 848 Feb 17 12:28 Dockerfile
-rw-rw-r-- 1 1035 1036 627 Feb 6 15:33 README.md
drwxrwxr-x 3 1035 1036 4.0K Feb 17 12:55 nextflow
-rw-rw-r-- 1 1035 1036 527K Feb 17 11:45 poetry.lock
-rw-rw-r-- 1 1035 1036 32 Jan 26 18:18 poetry.toml
-rw-rw-r-- 1 1035 1036 2.2K Feb 16 19:36 pyproject.toml
drwxrwxr-x 9 1035 1036 4.0K Feb 6 13:28 my_cool_repo
drwxrwxr-x 3 1035 1036 4.0K Feb 17 13:20 work
Output from the two parallel() processes:
/home/alex/my_cool_repo
total 12K
drwxr-xr-x 3 root root 4.0K Feb 17 13:20 .
drwxr-xr-x 3 root root 4.0K Feb 17 13:20 ..
drwxrwxr-x 5 1035 1036 4.0K Feb 17 13:20 work
Why are the outputs not the same?
Context: Instead of ls
I actually would like to run poetry run ...
but poetry gives the following error message for the parallel() processes: Poetry could not find a pyproject.toml file in /home/alex/my_cool_repo/work/f3/766313fbc5d6aeeb39f19193956ffd or its parents
.
2
Answers
The issue in your code lies in the mismatch between the input declaration in the scatter process and how it is used in the parallel process. In the scatter process, you are using path config, which means the variable is named config. However, in the parallel process, you are trying to access config.txt as the input, which is incorrect.
To resolve this, you should use the same variable name in both processes. Here’s the corrected code:
// bug.nf
nextflow.enable.dsl=2
workflow {
main:
scatter(params.config)
.into { config_scatter }
}
process scatter {
container "python:3.11.8"
}
process parallel {
container "python:3.11.8"
}
Now, the variable config is consistently used in both the scatter and parallel processes. This should ensure that the output of the ls commands in both processes will be the same.
As user dbthorbur points out in his comment, the difference has to do with the directories mounted into your container.
For your first process
scatter
you are using an additional file-input that is located somewhere else on your machine. So nextflow needs to mount that location AND your work-directory into the container used forscatter
. Apparently it takes a common root(?) directory of both, so that you find some additional files.The second process
parallel
on the other hand only takes input fromwork
, so only that directory gets mounted as volume for your container.Check out your
.command.run
scripts in the work-directories to see what actually gets mounted by docker (or podman?).There are two ways to overcome the difference.
stageInMode "copy"
as directive forscatter
to get the behaviour ofparallel
in both processesor
containerOptions "-v /home/alex/my_cool_repo:/home/alex/my_cool_repo"
directive inparallel
to get the current behaviour ofscatter
in both