skip to Main Content

I am trying to run a function through multiple processes in bash (on CentOs).
The function does different filtering over curl:

STEP=50
function filtering() {
    local i="$1"
    curl -s -g -X GET 'https://url.com?filter='$i'_filter'
}

for (( i=0; i<=TOTAL; i+=STEP )); do
    filtering "$i" 
done
wait

It works when runs as is. However, when i run the function in 10 processes it behaves odd, bringing every time different results, mostly unreliable (according to the produced numbers in comparison to the serial filtering above). My code is:

PARALLEL_PROCESSES=10
ACTIVE_PROCESSES=0
for (( i=0; i<=TOTAL; i+=STEP )); do
    filtering "$i" > /dev/null 2>&1 &
    (( ACTIVE_PROCESSES++ ))
    if (( ACTIVE_PROCESSES >= PARALLEL_PROCESSES )); then
        wait -n 2> /dev/null
        (( ACTIVE_PROCESSES-- )) 
    fi
done
wait 

When I check all the printed filters everything looks fine. I think curl doesn’t get something correctly.
GNU parallel doesn’t do my task as well. I thought debugging the own paralellisation would be easier.

  1. TOTAL is defined programmatically before the iteration starts. It stands to express the total count of the API filters so that further i split it into chunks of 50. Sometimes TOTAL can be 10K and more. The '$i'_filter' itself consists of 50 filtering elements, while this API supports up to 100. The length limit of the request is not reached, the waf and rate limit doesn’t block me.

  2. This function retrieves some data of (let’s say) candies. This data being parsed with jq like jq -r '.data[] | .id, .name, '"$label"' | @csv'

  3. By unreliable results I mean inconsistently different results between 2 and more closely launched individual iterations. While the serial approach always delivers the same true results. This makes me feel that in the parallelisation something chaotic being submitted to the filter.

I’d very appreciate practical advices how to debug what is going on/wrong inside, and any alternative suggestion for achieving my goal.

2

Answers


  1. For debugging, try to replace external calls to something you control.
    I don’t know what the "unreliable" results are, can you reproduce them with

    #!/bin/bash
    STEP=50
    TOTAL=2000
    filtering() {
      local j="$1"
      sleeptime="0.$((RANDOM%10))"
      sleep $sleeptime
      printf "Filter %sn" $j #l -s -g -X GET 'https://url.com?filter='$i'_filter'
    }
    
    PARALLEL_PROCESSES=30
    ACTIVE_PROCESSES=0
    for (( i=0; i<=TOTAL; i+=STEP )); do
      filtering "$i" & # > /dev/null 2>&1 &
      (( ACTIVE_PROCESSES++ ))
      if (( ACTIVE_PROCESSES >= PARALLEL_PROCESSES )); then
        wait -n 2> /dev/null
        (( ACTIVE_PROCESSES-- ))
      fi
    done
    

    When it does reproduce your problem, you can change the script to a minimal complexity. Changing the steps to 1 and TOTAL to 10 might help.
    Things you might notice is the number of steps in different in non-parallel and the parallel approach (TOTAL % STEP versus ACTIVE_PROCESSES) and a ‘random’ order in the output of the parallel case.

    When it does not reproduce your problem, the curl (or remote site) is giving the problem. Next to curl -v you might try to change in something without a get or another website.
    You might find that the site (or your firewall or network) shows a problem when it has many parallel requests.

    Login or Signup to reply.
  2. Not an answer but too long for a comment. Try running:

    step=50
    (( i = -step ))
    (( i += step )); curl -s -g -X GET 'https://url.com?filter='"$i"'_filter' &
    (( i += step )); curl -s -g -X GET 'https://url.com?filter='"$i"'_filter' &
    (( i += step )); curl -s -g -X GET 'https://url.com?filter='"$i"'_filter' &
    (( i += step )); curl -s -g -X GET 'https://url.com?filter='"$i"'_filter' &
    (( i += step )); curl -s -g -X GET 'https://url.com?filter='"$i"'_filter' &
    (( i += step )); curl -s -g -X GET 'https://url.com?filter='"$i"'_filter' &
    (( i += step )); curl -s -g -X GET 'https://url.com?filter='"$i"'_filter' &
    (( i += step )); curl -s -g -X GET 'https://url.com?filter='"$i"'_filter' &
    (( i += step )); curl -s -g -X GET 'https://url.com?filter='"$i"'_filter' &
    (( i += step )); curl -s -g -X GET 'https://url.com?filter='"$i"'_filter' &
    

    then edit your answer to tell is if that produces the expected output or not and, if not, what is wrong with the output, what error messages do you get, etc.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search