skip to Main Content

I’m trying to print the first 30 characters of some UTF-8 strings, and notice that Java’s String.substring() is returning some funky strings. I’ve boiled it down to:

I’m expecting "🤣" to be String with length 1, and String.substring to not try to cut it over in the middle. Why is my expectation not met? Java thinks it has length 2.

I’m pretty sure (1 2) the UTF-8 encoding for 🤣 (U+1F923) "Rolling On the Floor Laughing" is:

0xF0 0x9F 0xA4 0xA3

And so I expect this tiny program:

import java.nio.charset.StandardCharsets;

public class Foo {

  public static void main(String[] args){
    String str = "🤣";
    // These are the UTF-8 bytes for "ROLLING ON THE FLOOR LAUGHING"
    byte[] raw = {(byte)0xf0, (byte)0x9f, (byte)0xa4, (byte)0xa3};
    String str2 = new String(raw, StandardCharsets.UTF_8);
    System.out.println(str.equals(str2));
    System.out.println(str.length());
    System.out.println(str.substring(0,1));
  }
}

To print out:

true
1
🤣

But in fact it prints out:

true
2
?

Am I doing something wrong?

I’ve tried an custom java 11.0.20.1 build and these standard Ubuntu packages with the same results:

$ javac -version
javac 19.0.2

$ java -version
openjdk version "19.0.2" 2023-01-17
OpenJDK Runtime Environment (build 19.0.2+7-Ubuntu-0ubuntu322.04)
OpenJDK 64-Bit Server VM (build 19.0.2+7-Ubuntu-0ubuntu322.04, mixed mode, sharing)

python3 does what I expect:

$ python3 -c 'print(len("🤣"))'
1

$ python3 -c 'print("🤣"[0])'
🤣

2

Answers


  1. Chosen as BEST ANSWER

    Since what I really wanted was the first N "chars" of a string, I'll post a solution here for that. But since that isn't reflected in the question title, I'll accept the other answer, even though this is what I ended up using:

    import java.text.BreakIterator;
    
    public class Foo {
    
      public static String utf8SafeStringPrefix(String input, int numPrefixCodepoints) {
        StringBuilder stringBuilder = new StringBuilder(input.length());
        BreakIterator it = BreakIterator.getCharacterInstance();
        it.setText(input);
        int start = it.first();
        int end = it.next();
    
        int numAppendedCodepoints = 0;
        while (end != BreakIterator.DONE) {
          String codepoint = input.substring(start,end);
          stringBuilder.append(codepoint);
          numAppendedCodepoints++;
          if (numAppendedCodepoints == numPrefixCodepoints) {
            break;
          }
          start = end;
          end = it.next();
        }
        return stringBuilder.toString();
      }
    
      public static void main(String[] args) {
        String str = "this-is-a-🤣-character";
        System.out.println(utf8SafeStringPrefix(str, 10));
        System.out.println(utf8SafeStringPrefix(str, 11));
        System.out.println(utf8SafeStringPrefix(str, 12));
      }
    }
    
    

    Which prints out:

    this-is-a-
    this-is-a-🤣
    this-is-a-🤣-
    

  2. Java stores strings as UTF-16 encoded (or close to). 🤣 takes up two UTF-16 code points (0xd83e, 0xdd23). Hence the reason why its length is 2 and the first UTF-16 code point on its own doesn’t make any sense.

    The Java string functions work on individual code points, not full Unicode characters.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search