skip to Main Content

While porting legacy code to C++20, I replaced string literals (with expected UTF-8 encoded text) to UTF-8 string literals (the one prefixed with u8).

Thereby, I ran into an issue with octal sequences which I used in the past to encode UTF-8 sequences byte for byte:

While
"303274" was the proper encoding of ü,
u8"303274" ended up in ü.

I investigated into this further and found on cppreference.com:

  1. For each numeric escape sequence, given v as the integer value represented by the octal or hexadecimal number comprising the sequence of digits in the escape sequence, and T as the string literal’s array element type (see the table above):

    • If v does not exceed the range of representable values of T, then the escape sequence contributes a single code unit with value v.

(Emphasis mine)

In my own words: In UTF-8 string literals, octal (ooo) and hex (xXX) escape sequences are interpreted as Unicode code points, similar like Unicode sequences (uXXXX and UXXXXXXXX).

Hence, this appeared reasonable to me: For UTF-8 string literals, Unicode escape sequences should be favored over the byte-wise octal sequences (I used in the past).

Out of curiosity (and for the purpose of demonstration), I made a small test on coliru and was surprised to see that with g++ -std=c++20, the octal sequences still are interpreted as single bytes. The above cite in mind, I came to the conclusion:

MSVC seems to be correct, and g++ wrong.

I made an MCVE which I ran in my local Visual Studio 2019:

#include <iostream>
#include <string_view>

void dump(std::string_view text)
{
  const char digits[] = "0123456789abcdef";
  for (unsigned char c : text) {
    std::cout << ' '
      << digits[c >> 4]
      << digits[c & 0xf];
  }
}

#define DEBUG(...) std::cout << #__VA_ARGS__ << ";n"; __VA_ARGS__ 

int main()
{
  DEBUG(const char* const text = "344270255");
  DEBUG(dump(text));
  std::cout << 'n';
  DEBUG(const char8_t* const u8text = u8"344270255");
  DEBUG(dump((const char*)u8text));
  std::cout << 'n';
  DEBUG(const char8_t* const u8textU = u8"u4e2d");
  DEBUG(dump((const char*)u8textU));
  std::cout << 'n';
}

Output for MSVC:

const char* const text = "344270255";
dump(text);
 e4 b8 ad
const char8_t* const u8text = u8"344270255";
dump((const char*)u8text);
 c3 a4 c2 b8 c2 ad
const char8_t* const u8textU = u8"u4e2d";
dump((const char*)u8textU);
 e4 b8 ad

(Please, note that the dump that the dump of the 1st and 3rd literal are identical while the second results in UTF-8 sequences by interpreting each octal sequence as Unicode code point.)

The same code run in Compiler Explorer, compiled with g++ (13.2):

const char* const text = "344270255";
dump(text);
 e4 b8 ad
const char8_t* const u8text = u8"344270255";
dump((const char*)u8text);
 e4 b8 ad
const char8_t* const u8textU = u8"u4e2d";
dump((const char*)u8textU);
 e4 b8 ad

The same code run in Compiler Explorer, compiled with clang (17.0.1):

const char* const text = "344270255";
dump(text);
 e4 b8 ad
const char8_t* const u8text = u8"344270255";
dump((const char*)u8text);
 e4 b8 ad
const char8_t* const u8textU = u8"u4e2d";
dump((const char*)u8textU);
 e4 b8 ad

Demo on Compiler Explorer

Is my conclusion correct that MSVC does it correct according to the C++ standard, in opposition to g++ and clang?


What I found by web search before:


Using hex escape sequences instead of octal sequences doesn’t change anything: Demo on Compiler Explorer.

I preferred the somehow unusual octal sequences as they are limited to 3 digits, no unrelated character may extend them unintendedly — in opposition to hex sequences.


Update:

When I was about to file a bug for MSVC, I realized that this was already done:
escape sequences in unicode string literals are overencoded (non conforming => compiler bug)

2

Answers


  1. In my own words: In UTF-8 string literals, octal (ooo) and hex (xXX) escape sequences are interpreted as Unicode code points, similar like Unicode sequences (uXXXX and UXXXXXXXX).

    No, this is incorrect. In UTF-8, a code unit means an 8-bit unit (= byte), and a Unicode code point is represented by a sequence of one or more code units. Each octal escape sequence corresponds to a single code unit, which is different from Unicode escape sequences, which correspond to code points.

    So GCC and Clang are correct, and MSVC is faulty.

    Login or Signup to reply.
  2. These results are due to a known defect in the MSVC compiler that is described in the "Semantics of numeric-escape-sequences in UTF-8 literals" section of P2029 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals) as adopted for C++23. That paper clarified the intended behavior for numeric escape sequences in UTF-8 literals (and resolved the related CWG 1656 and CWG 2333 issues in doing so).

    Microsoft’s language conformance documentation does not claim conformance with that paper as of today (2024-01-24).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search