I’m trying to print the first 30 characters of some UTF-8 strings, and notice that Java’s String.substring()
is returning some funky strings. I’ve boiled it down to:
I’m expecting "🤣" to be String with length 1, and String.substring
to not try to cut it over in the middle. Why is my expectation not met? Java thinks it has length 2.
I’m pretty sure (1 2) the UTF-8 encoding for 🤣 (U+1F923) "Rolling On the Floor Laughing" is:
0xF0 0x9F 0xA4 0xA3
And so I expect this tiny program:
import java.nio.charset.StandardCharsets;
public class Foo {
public static void main(String[] args){
String str = "🤣";
// These are the UTF-8 bytes for "ROLLING ON THE FLOOR LAUGHING"
byte[] raw = {(byte)0xf0, (byte)0x9f, (byte)0xa4, (byte)0xa3};
String str2 = new String(raw, StandardCharsets.UTF_8);
System.out.println(str.equals(str2));
System.out.println(str.length());
System.out.println(str.substring(0,1));
}
}
To print out:
true
1
🤣
But in fact it prints out:
true
2
?
Am I doing something wrong?
I’ve tried an custom java 11.0.20.1 build and these standard Ubuntu packages with the same results:
$ javac -version
javac 19.0.2
$ java -version
openjdk version "19.0.2" 2023-01-17
OpenJDK Runtime Environment (build 19.0.2+7-Ubuntu-0ubuntu322.04)
OpenJDK 64-Bit Server VM (build 19.0.2+7-Ubuntu-0ubuntu322.04, mixed mode, sharing)
python3 does what I expect:
$ python3 -c 'print(len("🤣"))'
1
$ python3 -c 'print("🤣"[0])'
🤣
2
Answers
Since what I really wanted was the first N "chars" of a string, I'll post a solution here for that. But since that isn't reflected in the question title, I'll accept the other answer, even though this is what I ended up using:
Which prints out:
Java stores strings as UTF-16 encoded (or close to). 🤣 takes up two UTF-16 code points (0xd83e, 0xdd23). Hence the reason why its length is 2 and the first UTF-16 code point on its own doesn’t make any sense.
The Java string functions work on individual code points, not full Unicode characters.