The following code compiles on g++, clang, and Visual Studio:
#define HEX(hex_) 0x##hex_
int main()
{
return HEX(BadC0de);
}
as does this modification, using C++14 digit separators:
return HEX(1'Bad'C0de);
But this won’t compile on g++ or clang (it works on Visual Studio):
#define HEX(hex_) 0x##hex_
int main()
{
return HEX(A'Bad'C0de);
}
g++ output:
<source>:4:1: warning: multi-character character constant [-Wmultichar]
4 | return HEX(A'Bad'C0de);
| ^
<source>: In function 'int main()':
<source>:4:17: error: expected ';' before user-defined character literal
4 | return HEX(A'Bad'C0de);
| ^~~~~~~~~
<source>:1:25: note: in definition of macro 'HEX'
1 | #define HEX(hex_) 0x##hex_
| ^~~~
<source>:4:17: error: unable to find character literal operator 'operator""C0de' with 'int' argument
4 | return HEX(A'Bad'C0de);
| ^~~~~~~~~
<source>:1:25: note: in definition of macro 'HEX'
1 | #define HEX(hex_) 0x##hex_
| ^~~~
UPDATE: interestingly, the preprocessor output for this is
return 0xA'Bad'C0de;
which does compile, so obviously the standalone preprocessor is working differently here than the unified preprocessor.
This also fails on g++/clang, but with different errors:
return HEX(Bad'C0de);
g++ output:
<source>:4:19: warning: missing terminating ' character
4 | return HEX(Bad'C0de);
| ^
<source>:5:2: error: unterminated argument list invoking macro "HEX"
5 | }
| ^
<source>: In function 'int main()':
<source>:4:12: error: 'HEX' was not declared in this scope
4 | return HEX(Bad'C0de);
| ^~~
<source>:4:15: error: expected ';' at end of input
4 | return HEX(Bad'C0de);
| ^
| ;
<source>:4:15: error: expected '}' at end of input
<source>:3:1: note: to match this '{'
3 | {
| ^
Update: preprocessor stops before parsing the HEX() argument in this case.
I’d like to believe this is a g++ bug, but given how badly noncompliant Visual Studio’s preprocessor has historically been, perhaps that is wishful thinking. And in fact, that last program not only fails on g++, it also triggers an internal compiler error on Visual Studio (at least on godbolt.org)!
msvc output:
<source>(4): error C2001: newline in constant
<source>(4): fatal error C1057: unexpected end of file in macro expansion
Internal Compiler Error in Z:optcompiler-explorerwindows19.00.24210binamd64cl.exe. You will be prompted to send an error report to Microsoft later.
INTERNAL COMPILER ERROR in 'Z:optcompiler-explorerwindows19.00.24210binamd64cl.exe'
Please choose the Technical Support command on the Visual C++
Help menu, or open the Technical Support help file for more information
Naively, I would have expected all the compilers to just pass all text to the macro substitution before trying to interpret its meaning (it is a PRE-processor after all!); only after the ## concatenation would I expect the token to be examined for meaning. (Yes I know that some basic parsing happens to match parenthesis, brackets, etc. so that commas within them don’t split arguments, but I would not expect that to extend to any other language constructs.)
Does the standard have anything to say about these programs? Are they somehow non-conformant, or are they legal and the compilers are buggy?
2
Answers
This is one of those nasty holes in the spec. The preprocessor is defined (in the spec) in terms of "preprocessing tokens". The input is first split into a sequence of preprocessing tokens and then macro processing happens on that sequence.
Now the problem comes from the fact that
0xA'Bad'C0de
is a single preprocessing token, butA'Bad'C0de
is not — it is three preprocssing tokens (A
,'Bad'
, andC0de
) and the token paste operator##
is defined to just paste two adjacent tokens. In this case the tokenization phase depends on what macros have been defined and what they might do.Fixing this would require non-trivial spec changes, and require tracking directly-adjacent preprocessing tokens vs non-directly-adjacent tokens (those that have whitespace or comments between them) and having the
##
operator potentially paste additional directly-adjacent tokens when that makes sense.This would still have problems with things like
HEX(A'B)
— how would you tell when the)
should be part of a multichar character constant token vs ending the macro argument list?The preprocessing tokens notably includes preprocessing numbers, which are a superset of regular numeric literals.
1'Bad'C0de
is a valid preprocessing number. It begins with a decimal digit, and conforms to the grammar of pp-number. Thus it is treated as a whole by##
.A'Bad'C0de
, on the other hand, is not a preprocessing number, since it doesn’t begin with a digit.This quirk is implemented correctly in g++ and clang, as well as MSVC’s new preprocessor (which can be enabled with
/Zc:preprocessor
). MSVC’s traditional preprocessor is based on character buffers, rather than preprocessing tokens, which is non-conforming.