skip to Main Content

Environment

OS: Ubunty 20.4, Centos 8, macOS Catalina 10.15.7
Language: C, C++
Compiler: gcc (most recent versions for each OS)

Issue

I am using wordexp Posix library function to get shell-like expansion of strings.
The expansion works fine with one exception: when I set $IFS environment variable to something other than whitespace, for example ‘:’, it does not seem to affect splitting of the words that continues to be done on whitespace only regardless of the IFS value.

bash test

Man page for wordexp for Linux https://man7.org/linux/man-pages/man3/wordexp.3.html states:

  1. "The function wordexp() performs a shell-like expansion of the string…"
  2. "Field splitting is done using the environment variable $IFS. If it is not set, the field separators are space, tab and newline."

This is why I expected wordexp to behave the same way as bash in this respect.
On all the listed OSes I got the same exactly correct and expected result when changing the character set used for splitting:
Using default (IFS is not set)

    read -a words <<<"1 2:3 4:5"
    for word in "${words[@]}"; do echo "$word";  done

correctly splits on space and produces the result:

    1
    2:3
    4:5

while setting IFS to ‘:’

    IFS=':' read -a words <<<"1 2:3 4:5"
    for word in "${words[@]}"; do echo "$word";  done

correctly splits on ‘:’ and produces the result:

    1 2
    3 4
    5

C code test

But running the code below yields the same result regardless whether IFS environment variable is set or not:

C Code:

    #include <stdio.h>
    #include <wordexp.h>
    #include <stdlib.h>
    
    static void expand(char const *title, char const *str)
    {
        printf("%s input: %sn", title, str);
        wordexp_t exp;
        int rcode = 0;
        if ((rcode = wordexp(str, &exp, WRDE_NOCMD)) == 0) {
            printf("output:n");
            for (size_t i = 0; i < exp.we_wordc; i++)
                printf("%sn", exp.we_wordv[i]);
            wordfree(&exp);
        } else {
            printf("expand failed %dn", rcode);
        }
    }
    
    int main()
    {
        char const *str = "1 2:3 4:5";
        
        expand("No IFS", str);
    
        int rcode = setenv("IFS", ":", 1);
        if ( rcode != 0 ) {
            perror("setenv IFS failed: ");
            return 1;
        }
    
        expand("IFS=':'", str);
    
        return 0;
    }

The result in all OSes is the same:

    No IFS input: 1 2:3 4:5
    output:
    1
    2:3
    4:5
    IFS=':' input: 1 2:3 4:5
    output:
    1
    2:3
    4:5

As a note, the snippet above was created for this post – I did test with a more complex code that verified that the environment variable was indeed set properly.

Source code review

I looked at the source code for the wordexp function implementation available at https://code.woboq.org/userspace/glibc/posix/wordexp.c.html and it appears that it does use $IFS but perhaps inconsistently or maybe this is a bug.
Specifically:
In the body of wordexp that starts on line 2229 it does get IFS environment variable value and processes it:
lines 2273 – 2276:

     /* Find out what the field separators are.
       * There are two types: whitespace and non-whitespace.
       */
      ifs = getenv ("IFS");

But then later on in the function it does not seem to
use the $IFS values for words separation.
This looks like a bug unless "field separators" on line 2273
and "word separator" on line 2396 mean different things.
lines 2395 – 2398:

          default:
            /* Is it a word separator? */
            if (strchr (" t", words[words_offset]) == NULL)
            {

But in any case the code seem to only use space or tab as a splitter
unlike bash that respects the IFS set splitter values.

Questions

  1. Am I missing something and there is a way to get wordexp to split on characters other than whitespace?
  2. If the split is only on whitespace, is this a bug in the
    • gcc library implementation or
    • in the Linux man page for wordexp where they claim that $IFS can be used to define splitters

Many thanks in advance for all your comments and insights!

Answers Summary and workaround

In the accepted answer there was a hint on how to achieve the split on non-whitespace characters from the $IFS: you have to set $IFS and put the string that you want to split as a value for a temporary environmental variable and then call wordexp against that temporary variable. This is demonstrated in the updated code below.
While this behavior that is visible in the source code may not be actually a bug it definitely looks like a questionable design decision to me…
Updated code:

    #include <stdio.h>
    #include <wordexp.h>
    #include <stdlib.h>
    
    static void expand(char const *title, char const *str)
    {
        printf("%s input: %sn", title, str);
        wordexp_t exp;
        int rcode = 0;
        if ((rcode = wordexp(str, &exp, WRDE_NOCMD)) == 0) {
            printf("output:n");
            for (size_t i = 0; i < exp.we_wordc; i++)
                printf("%sn", exp.we_wordv[i]);
            wordfree(&exp);
        } else {
            printf("expand failed %dn", rcode);
        }
    }
    
    int main()
    {
        char const *str = "1 2:3 4:5";
        
        expand("No IFS", str);
    
        int rcode = setenv("IFS", ":", 1);
        if ( rcode != 0 ) {
            perror("setenv IFS failed: ");
            return 1;
        }
    
        expand("IFS=':'", str);
        
        rcode = setenv("FAKE", str, 1);
        if ( rcode != 0 ) {
            perror("setenv FAKE failed: ");
            return 2;
        }
    
        expand("FAKE", "${FAKE}");    
    
        return 0;
    }

which produces the result:

    No IFS input: 1 2:3 4:5
    output:
    1
    2:3
    4:5
    IFS=':' input: 1 2:3 4:5
    output:
    1
    2:3
    4:5
    FAKE input: ${FAKE}
    output:
    1 2
    3 4
    5

2

Answers


  1. Let’s naively assume POSIX is understandable and try to work with it. Let’s take wordexp() from posix:

    The words argument is a pointer to a string containing one or more words to be expanded. The expansions shall be the same as would be performed by the command line interpreter if words were the part of a command line representing the arguments to a utility. […]

    So let’s go to "the command line interpreter". From posix shell command language:

    2.1 Shell Introduction

    […]

    1. The shell breaks the input into tokens: words and operators; see Token Recognition.
      […….]

    2.3 Token Recognition

    […]

    1. If the current character is an unquoted <blank>, any token containing the previous character is delimited and the current character shall be discarded.
    2. If the previous character was part of a word, the current character shall be appended to that word.

    […]

    Basically the whole 2.3 Token Recognition sections applies here – this is the thing that wordexp() does – token recognition plus some expansions. And also the most important stuff about field splitting, emphasis mine:

    After parameter expansion (Parameter Expansion), command substitution (Command Substitution), and arithmetic expansion (Arithmetic Expansion), the shell shall scan the results of expansions and substitutions that did not occur in double-quotes for field splitting and multiple fields can result.

    IFS affects field splitting, it affects how the result of other expansions are spitted into words. IFS does not affect how string is split into tokens, it’s still split using <blank> – tab or space. So the behavior you are seeing.

    In other words, when you type IFS=: in your terminal, then you don’t start separating tokens by IFS, like echo:Hello:World, but still continue separating parts of commands using spaces.

    Anyway, the man page is correct… :p

    Am I missing something and there is a way to get wordexp to split on characters other than whitespace?

    No. If you want to have spaces in words, quote the arguments, as you would in the shell. "a b" "c d" "e".

    If the split is only on whitespace, is this a bug in the

    None :p

    Login or Signup to reply.
  2. You’re comparing apples to oranges. wordexp() splits a string up into individual tokens the same way the shell does. The shell builtin read doesn’t follow the same algorithm; it just does word splitting. You should be comparing wordexp() to how the arguments to a script or shell function are parsed:

    #!/bin/sh
    
    printwords() {
        for arg in "$@"; do
            printf "%sn" "$arg"
        done
    }
    
    echo "No IFS input: 1 2:3 4:5"
    printwords 1 2:3 4:5
    echo "IFS=':' input: 1 2:3 4:5"
    IFS=:
    printwords 1 2:3 4:5
    

    This produces

    No IFS input: 1 2:3 4:5
    1
    2:3
    4:5
    IFS=':' input: 1 2:3 4:5
    1
    2:3
    4:5
    

    just like the C program.


    Now, for the interesting bit. I couldn’t find it explicitly mentioned as such in the POSIX documentation with a quick scan, but the bash manual has this to say about word splitting:

    Note that if no expansion occurs, no splitting is performed.

    Let’s try a version that does parameter expansion in its arguments:

    #!/bin/sh
    
    printwords() {
        for arg in "$@"; do
            printf "%sn" "$arg"
        done
    }
    
    foo=2:3
    printf "foo = %sn" "$foo"
    printf "No IFS input: 1 $foo 4:5n"
    printwords 1 $foo 4:5
    printf "IFS=':' input: 1 $foo 4:5n"
    IFS=:
    printwords 1 $foo 4:5
    

    which when run via shells like dash, ksh93 or bash (But not zsh unless you turn on the SH_WORD_SPLIT option), produces

    foo = 2:3
    No IFS input: 1 $foo 4:5
    1
    2:3
    4:5
    IFS=':' input: 1 $foo 4:5
    1
    2
    3
    4:5
    

    As you can see, the argument that has a parameter was subject to field splitting, but not the literal one. Making the same change to the string in your C program and running foo=2:3 ./wordexp prints out the same thing.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search