Environment
OS: Ubunty 20.4, Centos 8, macOS Catalina 10.15.7
Language: C, C++
Compiler: gcc (most recent versions for each OS)
Issue
I am using wordexp Posix library function to get shell-like expansion of strings.
The expansion works fine with one exception: when I set $IFS environment variable to something other than whitespace, for example ‘:’, it does not seem to affect splitting of the words that continues to be done on whitespace only regardless of the IFS value.
bash test
Man page for wordexp for Linux https://man7.org/linux/man-pages/man3/wordexp.3.html states:
- "The function wordexp() performs a shell-like expansion of the string…"
- "Field splitting is done using the environment variable $IFS. If it is not set, the field separators are space, tab and newline."
This is why I expected wordexp to behave the same way as bash in this respect.
On all the listed OSes I got the same exactly correct and expected result when changing the character set used for splitting:
Using default (IFS is not set)
read -a words <<<"1 2:3 4:5"
for word in "${words[@]}"; do echo "$word"; done
correctly splits on space and produces the result:
1
2:3
4:5
while setting IFS to ‘:’
IFS=':' read -a words <<<"1 2:3 4:5"
for word in "${words[@]}"; do echo "$word"; done
correctly splits on ‘:’ and produces the result:
1 2
3 4
5
C code test
But running the code below yields the same result regardless whether IFS environment variable is set or not:
C Code:
#include <stdio.h>
#include <wordexp.h>
#include <stdlib.h>
static void expand(char const *title, char const *str)
{
printf("%s input: %sn", title, str);
wordexp_t exp;
int rcode = 0;
if ((rcode = wordexp(str, &exp, WRDE_NOCMD)) == 0) {
printf("output:n");
for (size_t i = 0; i < exp.we_wordc; i++)
printf("%sn", exp.we_wordv[i]);
wordfree(&exp);
} else {
printf("expand failed %dn", rcode);
}
}
int main()
{
char const *str = "1 2:3 4:5";
expand("No IFS", str);
int rcode = setenv("IFS", ":", 1);
if ( rcode != 0 ) {
perror("setenv IFS failed: ");
return 1;
}
expand("IFS=':'", str);
return 0;
}
The result in all OSes is the same:
No IFS input: 1 2:3 4:5
output:
1
2:3
4:5
IFS=':' input: 1 2:3 4:5
output:
1
2:3
4:5
As a note, the snippet above was created for this post – I did test with a more complex code that verified that the environment variable was indeed set properly.
Source code review
I looked at the source code for the wordexp function implementation available at https://code.woboq.org/userspace/glibc/posix/wordexp.c.html and it appears that it does use $IFS but perhaps inconsistently or maybe this is a bug.
Specifically:
In the body of wordexp that starts on line 2229 it does get IFS environment variable value and processes it:
lines 2273 – 2276:
/* Find out what the field separators are.
* There are two types: whitespace and non-whitespace.
*/
ifs = getenv ("IFS");
But then later on in the function it does not seem to
use the $IFS values for words separation.
This looks like a bug unless "field separators" on line 2273
and "word separator" on line 2396 mean different things.
lines 2395 – 2398:
default:
/* Is it a word separator? */
if (strchr (" t", words[words_offset]) == NULL)
{
But in any case the code seem to only use space or tab as a splitter
unlike bash that respects the IFS set splitter values.
Questions
- Am I missing something and there is a way to get wordexp to split on characters other than whitespace?
- If the split is only on whitespace, is this a bug in the
- gcc library implementation or
- in the Linux man page for wordexp where they claim that $IFS can be used to define splitters
Many thanks in advance for all your comments and insights!
Answers Summary and workaround
In the accepted answer there was a hint on how to achieve the split on non-whitespace characters from the $IFS: you have to set $IFS and put the string that you want to split as a value for a temporary environmental variable and then call wordexp against that temporary variable. This is demonstrated in the updated code below.
While this behavior that is visible in the source code may not be actually a bug it definitely looks like a questionable design decision to me…
Updated code:
#include <stdio.h>
#include <wordexp.h>
#include <stdlib.h>
static void expand(char const *title, char const *str)
{
printf("%s input: %sn", title, str);
wordexp_t exp;
int rcode = 0;
if ((rcode = wordexp(str, &exp, WRDE_NOCMD)) == 0) {
printf("output:n");
for (size_t i = 0; i < exp.we_wordc; i++)
printf("%sn", exp.we_wordv[i]);
wordfree(&exp);
} else {
printf("expand failed %dn", rcode);
}
}
int main()
{
char const *str = "1 2:3 4:5";
expand("No IFS", str);
int rcode = setenv("IFS", ":", 1);
if ( rcode != 0 ) {
perror("setenv IFS failed: ");
return 1;
}
expand("IFS=':'", str);
rcode = setenv("FAKE", str, 1);
if ( rcode != 0 ) {
perror("setenv FAKE failed: ");
return 2;
}
expand("FAKE", "${FAKE}");
return 0;
}
which produces the result:
No IFS input: 1 2:3 4:5
output:
1
2:3
4:5
IFS=':' input: 1 2:3 4:5
output:
1
2:3
4:5
FAKE input: ${FAKE}
output:
1 2
3 4
5
2
Answers
Let’s naively assume POSIX is understandable and try to work with it. Let’s take wordexp() from posix:
So let’s go to "the command line interpreter". From posix shell command language:
Basically the whole
2.3 Token Recognition
sections applies here – this is the thing thatwordexp()
does – token recognition plus some expansions. And also the most important stuff about field splitting, emphasis mine:IFS
affects field splitting, it affects how the result of other expansions are spitted into words.IFS
does not affect how string is split into tokens, it’s still split using<blank>
– tab or space. So the behavior you are seeing.In other words, when you type
IFS=:
in your terminal, then you don’t start separating tokens byIFS
, likeecho:Hello:World
, but still continue separating parts of commands using spaces.Anyway, the man page is correct… :p
No. If you want to have spaces in words, quote the arguments, as you would in the shell.
"a b" "c d" "e"
.None :p
You’re comparing apples to oranges.
wordexp()
splits a string up into individual tokens the same way the shell does. The shell builtinread
doesn’t follow the same algorithm; it just does word splitting. You should be comparingwordexp()
to how the arguments to a script or shell function are parsed:This produces
just like the C program.
Now, for the interesting bit. I couldn’t find it explicitly mentioned as such in the POSIX documentation with a quick scan, but the
bash
manual has this to say about word splitting:Let’s try a version that does parameter expansion in its arguments:
which when run via shells like
dash
,ksh93
orbash
(But notzsh
unless you turn on theSH_WORD_SPLIT
option), producesAs you can see, the argument that has a parameter was subject to field splitting, but not the literal one. Making the same change to the string in your C program and running
foo=2:3 ./wordexp
prints out the same thing.