Ubuntu - GNU parallel freezes

ThammeGowda
August 4, 2024
198 views
1 vote
2 Answers

I have a bash script that applies different transformation/mappings on columns of TSV file. I am trying to parallelize the transformations using GNU parallel, however my code hangs.

For simplicity consider cat, the identity mapper (i.e. input -> output), and a TSV file of three columns (generated on-the-fly using paste and seqs)

n=1000000
map=cat    # identity: inp -> out

rm -f tmp.col{1,2}.fifo
mkfifo tmp.col{1,2}.fifo
paste <(seq $n) <(seq $n) <(seq $n) 
    | tee >(cut -f1 | $map > tmp.col1.fifo) 
    | tee >(cut -f2 | $map > tmp.col2.fifo) 
    | cut -f3- 
    | paste tmp.col{1,2}.fifo - 
    | python -m tqdm > /dev/null

The above code works fine.

NOTE: python -m tqdm > /dev/null prints the speed

Next, we can parallelize the mapping tasks using GNU parallel’s --pipe --keep-order arguments. Here is a minimal parallel example that works:

seq 100 | parallel --pipe -k -j4 -N10 'cat && sleep 1'

Now, putting all these together, here is my code that maps the TSV columns in parallel:

n=1000000
map=cat   # identity map: inp -> out
rm -f tmp.col{1,2}.fifo
mkfifo tmp.col{1,2}.fifo
paste <(seq $n) <(seq $n) <(seq $n) 
  | tee >(cut -f1 | parallel --id jobA --pipe -k -j4 -N1000 "$map" > tmp.col1.fifo) 
  | tee >(cut -f2 | parallel --id jobB --pipe -k -j4 -N1000 "$map" > tmp.col2.fifo) 
  | cut -f3- 
  | paste tmp.col{1,2}.fifo - 
  | python -m tqdm > /dev/null

This code was supposed to work, however, this code freezes.
Why does it freeze and how to unfreeze it?

Environment: Linux 5.15.0-116-generic, Ubuntu 22.04.4 LTS on x86_64

Answers

- Philippe
- August 1, 2024 at 11:47 pm
- 0 votes
0
FIFO has size limitations, can you re-organize the script this way ?
```
#!/bin/bash

n=${1-10}
map=cat

paste <(paste <(seq $n) <(seq $n)  <(seq $n) | cut -f1 | parallel --id jobA --pipe -k -j4 -N1000 "$map")
      <(paste <(seq $n) <(seq $n)  <(seq $n) | cut -f2 | parallel --id jobA --pipe -k -j4 -N1000 "$map")
      <(paste <(seq $n) <(seq $n)  <(seq $n) | cut -f3 ) 
    | python -m tqdm > /dev/null
```
Run with
```
bash test.sh 100
```
Login or Signup to reply.

- OleTange
- August 2, 2024 at 11:25 am
- 0 votes
0
It is a race condition with the fifos – not GNU Parallel

Assume this:
```
| tee >(cut -f1 | $map1 > tmp.col1.fifo) 
| tee >(cut -f2 | $map2 > tmp.col2.fifo) 
| cut -f3- 
| paste tmp.col{1,2}.fifo - 
```
Assume that $map1 prints very little and $map2 prints a lot.

paste tries to read a line from tmp.col1.fifo, but there is nothing to read, so it blocks. $map2 prints a lot to tmp.col2.fifo and fills the FIFO, so it blocks, too.

You have just been lucky that the race condition did not hit you earlier.

You can of course use temporary files to solve this, but I have the feeling you are trying to avoid that.

Maybe you can "increase" the size of the FIFO with a tool like mbuffer:
```
  | tee >(cut -f1 | parallel --pipe -k -j4 -N1000 "$map" | mbuffer -q -m6M -b5 > tmp.col1.fifo) 
  | tee >(cut -f2 | parallel --pipe -k -j4 -N1000 "$map" | mbuffer -q -m6M -b5 > tmp.col2.fifo) 
  | cut -f3- | mbuffer -q -m6M -b5 
  | paste tmp.col{1,2}.fifo - 
  | python -m tqdm > /dev/null
```
But unless you know the nature of your data is not going to change, then this is a fragile solution that just kicks the can a bit further down the road.

How about this instead?
```
n=1000000
map=cat   # identity map: inp -> out
rm -f tmp.col{1,2,3,4}.fifo
mkfifo tmp.col{1,2,3,4}.fifo
paste <(seq $n) <(seq $n) <(seq $n) | cut -f1 | parallel --pipe -k -j4 -N1000 "$map" > tmp.col1.fifo &
paste <(seq $n) <(seq $n) <(seq $n) | cut -f2 | parallel --pipe -k -j4 -N1000 "$map" > tmp.col2.fifo &
paste <(seq $n) <(seq $n) <(seq $n) | cut -f3 > tmp.col3.fifo &
paste <(seq $n) <(seq $n) <(seq $n) > tmp.col4.fifo &
paste tmp.col{1,2,3,4}.fifo | python -m tqdm > /dev/null
```
You will run a few more pastes, but if CPU is not a problem, then this should give you no race conditions.

(Also: --id (aka. --semaphore-name) is not used with --pipe but only with --semaphore. See https://www.gnu.org/software/parallel/parallel_options_map.pdf)

(Also also: If you do not need exactly 1000 entries (-N1000) then --block is faster).
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Ubuntu – GNU parallel freezes

Answers