skip to Main Content

Finding examples for ICU is difficult, but here is what I’m trying to do. I need to be able to carve graphemes out of strings. In order to do this, I need to get the sequence of grapheme lengths in bytes from the string, so I’m trying to do this using a BreakIterator.

I decided to test this with 3 characters. $ is one byte in UTF8, £ is two bytes in UTF8, and is 3 bytes in UTF8.

I expected that calling iter->current() would return the byte offset within the string, but it does not. It returns an incrementing "count" of some sort, but that does not correspond to the code point position within the string, much less the overall grapheme length.

The documentation that I found as well as another SO question, however, implies that it should be returning the byte offset. In my example, the string is 8 bytes long, but only contains 5 graphemes. The loop stops (correctly) after processing the last grapheme but, as you can see from the output, iter->current() never increases by more than 1, even though the grapheme that it is processing is sometimes most certainly larger than one byte long.

Here is the code and output.

Setup:
Ubuntu 22.04 (WSL2)
ICU Installed via apt: icu-devtools, libicu-dev, and libicu70
My C++ program’s compile string:

g++ -std=c++20 main2.cpp `pkg-config --libs --cflags icu-i18n icu-uc icu-io`

Minimal file:

#include <memory>
#include <cassert>
#include <cstring>
#include <iostream>
#include <unicode/uconfig.h>
#include <unicode/ustring.h>
#include <unicode/brkiter.h>

using namespace std;

int main() {
  const char * s = "$u00A3$u5186$";
  UErrorCode err = U_ZERO_ERROR;
  unique_ptr<icu::BreakIterator> iter(icu::BreakIterator::createCharacterInstance(icu::Locale::getDefault(), err));
  assert(U_SUCCESS(err));
  iter->setText(s);
  auto current = iter->current();
  while (iter->next() != icu::BreakIterator::DONE) {
    cout << current << endl;
    current = iter->current();
  }
  cout << current << endl;
  cout << "String length  : " << strlen(s) << endl;
  cout << "String contents: " << s << endl;
  return 0;
}

Output:
Output of code execution, showing that iter->current() only increases by one each time throught the loop

I would have expected the list to be:

0
1
3
4
7
8

I’ve been staring at this for a few days… am I just missing something painfully obvious?

2

Answers


  1. The ICU BreakIterator returns the Unicode code points. If you want to get the Unicode code units you should use StringCharacterIterator which can return both. You’ll want to use StringCharacterIterator::next() to get the code units. StringCharacterIterator::next32() would give you the code points.

    Login or Signup to reply.
  2. In this case icu::BreakIterator cannot possibly tell you the byte offset because it has no idea about bytes in your string. What happens is that s implicitly converted to a UnicodeSrting object which is then passed to setText. UnicodeSrting is always UTF-16 encoded. This is not what you want.

    You want to use the other overload of setText, the one that works with UText*. UText supports UTF-8. You need to create a UText explicitly.

    UText* ut = utext_openUTF8(nullptr, s, strlen(s), &err); // C interface
    assert(U_SUCCESS(err));
    iter->setText(ut, err);
    assert(U_SUCCESS(err));
    

    This will print your desired output.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search