Visual Studio Code - C++20 UTF-8 String Literals With Octal Sequences

Scheff39sCat
January 25, 2024
276 views
1 vote
2 Answers

While porting legacy code to C++20, I replaced string literals (with expected UTF-8 encoded text) to UTF-8 string literals (the one prefixed with u8).

Thereby, I ran into an issue with octal sequences which I used in the past to encode UTF-8 sequences byte for byte:

While
"303274" was the proper encoding of ü,
u8"303274" ended up in Ã¼.

I investigated into this further and found on cppreference.com:

For each numeric escape sequence, given v as the integer value represented by the octal or hexadecimal number comprising the sequence of digits in the escape sequence, and T as the string literal’s array element type (see the table above):

If v does not exceed the range of representable values of T, then the escape sequence contributes a single code unit with value v.

(Emphasis mine)

In my own words: In UTF-8 string literals, octal (ooo) and hex (xXX) escape sequences are interpreted as Unicode code points, similar like Unicode sequences (uXXXX and UXXXXXXXX).

Hence, this appeared reasonable to me: For UTF-8 string literals, Unicode escape sequences should be favored over the byte-wise octal sequences (I used in the past).

Out of curiosity (and for the purpose of demonstration), I made a small test on coliru and was surprised to see that with g++ -std=c++20, the octal sequences still are interpreted as single bytes. The above cite in mind, I came to the conclusion:

MSVC seems to be correct, and g++ wrong.

I made an MCVE which I ran in my local Visual Studio 2019:

#include <iostream>
#include <string_view>

void dump(std::string_view text)
{
  const char digits[] = "0123456789abcdef";
  for (unsigned char c : text) {
    std::cout << ' '
      << digits[c >> 4]
      << digits[c & 0xf];
  }
}

#define DEBUG(...) std::cout << #__VA_ARGS__ << ";n"; __VA_ARGS__ 

int main()
{
  DEBUG(const char* const text = "344270255");
  DEBUG(dump(text));
  std::cout << 'n';
  DEBUG(const char8_t* const u8text = u8"344270255");
  DEBUG(dump((const char*)u8text));
  std::cout << 'n';
  DEBUG(const char8_t* const u8textU = u8"u4e2d");
  DEBUG(dump((const char*)u8textU));
  std::cout << 'n';
}

Output for MSVC:

const char* const text = "344270255";
dump(text);
 e4 b8 ad
const char8_t* const u8text = u8"344270255";
dump((const char*)u8text);
 c3 a4 c2 b8 c2 ad
const char8_t* const u8textU = u8"u4e2d";
dump((const char*)u8textU);
 e4 b8 ad

(Please, note that the dump that the dump of the 1^st and 3^rd literal are identical while the second results in UTF-8 sequences by interpreting each octal sequence as Unicode code point.)

The same code run in Compiler Explorer, compiled with g++ (13.2):

const char* const text = "344270255";
dump(text);
 e4 b8 ad
const char8_t* const u8text = u8"344270255";
dump((const char*)u8text);
 e4 b8 ad
const char8_t* const u8textU = u8"u4e2d";
dump((const char*)u8textU);
 e4 b8 ad

The same code run in Compiler Explorer, compiled with clang (17.0.1):

const char* const text = "344270255";
dump(text);
 e4 b8 ad
const char8_t* const u8text = u8"344270255";
dump((const char*)u8text);
 e4 b8 ad
const char8_t* const u8textU = u8"u4e2d";
dump((const char*)u8textU);
 e4 b8 ad

Demo on Compiler Explorer

Is my conclusion correct that MSVC does it correct according to the C++ standard, in opposition to g++ and clang?

What I found by web search before:

Using hex escape sequences instead of octal sequences doesn’t change anything: Demo on Compiler Explorer.

I preferred the somehow unusual octal sequences as they are limited to 3 digits, no unrelated character may extend them unintendedly — in opposition to hex sequences.

Update:

When I was about to file a bug for MSVC, I realized that this was already done:
escape sequences in unicode string literals are overencoded (non conforming => compiler bug)

Tags: c#c++20 utf-8

Answers

- cpplearner
- January 22, 2024 at 12:54 pm
- 0 votes
0
In my own words: In UTF-8 string literals, octal (ooo) and hex (xXX) escape sequences are interpreted as Unicode code points, similar like Unicode sequences (uXXXX and UXXXXXXXX).

No, this is incorrect. In UTF-8, a code unit means an 8-bit unit (= byte), and a Unicode code point is represented by a sequence of one or more code units. Each octal escape sequence corresponds to a single code unit, which is different from Unicode escape sequences, which correspond to code points.

So GCC and Clang are correct, and MSVC is faulty.

Login or Signup to reply.

- TomHonermann
- January 24, 2024 at 11:06 pm
- 0 votes
0
These results are due to a known defect in the MSVC compiler that is described in the "Semantics of numeric-escape-sequences in UTF-8 literals" section of P2029 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals) as adopted for C++23. That paper clarified the intended behavior for numeric escape sequences in UTF-8 literals (and resolved the related CWG 1656 and CWG 2333 issues in doing so).

Microsoft’s language conformance documentation does not claim conformance with that paper as of today (2024-01-24).

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Visual Studio Code – C++20 UTF-8 String Literals With Octal Sequences

Answers