Difficulty getting grapheme lengths with ICU in C++ - Ubuntu

coreyp_1
November 25, 2022
109 views
0 votes
2 Answers

Finding examples for ICU is difficult, but here is what I’m trying to do. I need to be able to carve graphemes out of strings. In order to do this, I need to get the sequence of grapheme lengths in bytes from the string, so I’m trying to do this using a BreakIterator.

I decided to test this with 3 characters. $ is one byte in UTF8, £ is two bytes in UTF8, and 円 is 3 bytes in UTF8.

I expected that calling iter->current() would return the byte offset within the string, but it does not. It returns an incrementing "count" of some sort, but that does not correspond to the code point position within the string, much less the overall grapheme length.

The documentation that I found as well as another SO question, however, implies that it should be returning the byte offset. In my example, the string is 8 bytes long, but only contains 5 graphemes. The loop stops (correctly) after processing the last grapheme but, as you can see from the output, iter->current() never increases by more than 1, even though the grapheme that it is processing is sometimes most certainly larger than one byte long.

Here is the code and output.

Setup:
Ubuntu 22.04 (WSL2)
ICU Installed via apt: icu-devtools, libicu-dev, and libicu70
My C++ program’s compile string:

g++ -std=c++20 main2.cpp `pkg-config --libs --cflags icu-i18n icu-uc icu-io`

Minimal file:

#include <memory>
#include <cassert>
#include <cstring>
#include <iostream>
#include <unicode/uconfig.h>
#include <unicode/ustring.h>
#include <unicode/brkiter.h>

using namespace std;

int main() {
  const char * s = "$u00A3$u5186$";
  UErrorCode err = U_ZERO_ERROR;
  unique_ptr<icu::BreakIterator> iter(icu::BreakIterator::createCharacterInstance(icu::Locale::getDefault(), err));
  assert(U_SUCCESS(err));
  iter->setText(s);
  auto current = iter->current();
  while (iter->next() != icu::BreakIterator::DONE) {
    cout << current << endl;
    current = iter->current();
  }
  cout << current << endl;
  cout << "String length  : " << strlen(s) << endl;
  cout << "String contents: " << s << endl;
  return 0;
}

Output:

I would have expected the list to be:

I’ve been staring at this for a few days… am I just missing something painfully obvious?

Tags: c#grapheme-cluster icu

Answers

- decocijo
- November 27, 2022 at 6:26 pm
- 0 votes
0
The ICU BreakIterator returns the Unicode code points. If you want to get the Unicode code units you should use StringCharacterIterator which can return both. You’ll want to use StringCharacterIterator::next() to get the code units. StringCharacterIterator::next32() would give you the code points.

Login or Signup to reply.

- NM
- November 27, 2022 at 8:12 pm
- 0 votes
0
In this case icu::BreakIterator cannot possibly tell you the byte offset because it has no idea about bytes in your string. What happens is that s implicitly converted to a UnicodeSrting object which is then passed to setText. UnicodeSrting is always UTF-16 encoded. This is not what you want.

You want to use the other overload of setText, the one that works with UText*. UText supports UTF-8. You need to create a UText explicitly.
```
UText* ut = utext_openUTF8(nullptr, s, strlen(s), &err); // C interface
assert(U_SUCCESS(err));
iter->setText(ut, err);
assert(U_SUCCESS(err));
```
This will print your desired output.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Difficulty getting grapheme lengths with ICU in C++ – Ubuntu

Answers