Unicode combining and graphemes -- how two identical Java Strings can be different
I invite you to look closely at the following snippet of Java code:
public class Test1 { public static void main (String[] args) { String s1 = "Café"; String s2 = "Café"; System.out.println ("s1=" + s1); System.out.println ("s2=" + s2); System.out.println ("Equal? " + s1.equals(s2)); } }
And at the output it produces when compiled and executed:
$ java Test1 s1=Café s2=Café Equal? false
Unless you're already familiar with the issues surrounding combining characters
in Unicode, in which case you already know the punchline, this result
will likely strike you as odd, if not bizarre. We are
comparing two Java String
objects for textual equality, and yet the comparison is false
.
How can that be?
If you think I'm cheating -- and I assure you I'm not -- I'd suggest cutting-and-pasting the snippet of Java code into a text editor or IDE, and running it yourself. If you do copy the code, you may see what the problem is. Or maybe not -- it really depends on your system, font settings, etc.
Looking at the source with a hexadecimal editor will show the problem; I've marked in red the bytes that make up the encoding of the word "Café".
00000040 0a 20 20 53 74 72 69 6e 67 20 73 31 20 3d 20 22 |. String s1 = "| 00000050 43 61 66 c3 a9 22 3b 0a 20 20 53 74 72 69 6e 67 |Caf..";. String| 00000060 20 73 32 20 3d 20 22 43 61 66 65 cc 81 22 3b 0a | s2 = "Cafe..";.| 00000070 20 20 53 79 73 74 65 6d 2e 6f 75 74 2e 70 72 69 | System.out.pri|
You'll see that these two identical-looking strings actually have the symbol "é" encoded differently.
These hexadecimal strings represent the UTF-8 encodings of the characters that render as "é" in my sample code. They aren't simply variant encodings of the same Unicode code point (symbol) -- UTF-8 does allow the same Unicode code point to be encoded in different ways, but they should all be treated as the same Unicode point. If something like a variant encoding broke string comparison, we'd have a good reason to think there's a bug in the Java JVM.
No -- these two different instances of the symbol "é" are, in fact, completely different unicode character sequences, that just happen to look the same.
For the record, and although it is not at all obvious from the code
or the hexadecimal, s1
contains the Unicode symbol for
"e with an accute accent", while s2
contains the Unicode
symbol for the English letter 'e', followed by the symbol "combining
acute accent".
It turns out that Unicode defines two completely different ways to encode certain characters. Many accented characters have their own specific code points: "e with an accute accent" is one such. These symbols usually have numeric values between 128 and 256, and were inherited by Unicode from the single-byte encoding ISO-8859-1. This, presumably, was for backward compatibility.
However, if every possible combination of every accent, applied to every letter in every language, were represented individually in Unicode, the Unicode character set would be even more bloated than it currently is (about 140 000 characters at the time of writing).
So Unicode defines the concept of a grapheme cluster. A cluster is two or more Unicode symbols that together make up a single displayed character. That displayed characters can be formed from multiple Unicode symbols comes as a surprise to many people.
The fact is that the JVM is correctly reporting that my two Strings are different -- although they result in the same symbol on screen, they are different at the Unicode level. This oddity does not only affect string comparison -- it affects any operation where characters are compared with one another. If I tried to find the first occurence of the string "Café" in the larger string "Hard Rock Café", whether the search succeeds or not depends on how the "é" is encoded in both strings.
This isn't an obscure problem -- there are a great many European letters (among other things) that can be composed from different grapheme clusters. All these symbols are widely used. In principle, Java's use of Unicode ought to mean that developers don't have to worry about this kind of thing; but clearly we do.
The solution, thankfully, is simple -- provided the developer knows
to apply it. The method java.text.Normalizer.normalize()
will reduce grapheme clusters to single characters (where possible),
or expand single characters to clusters. The normalize
method provides various constants that control what normalization
is to be done -- so long as the same normalization is applied to
all the strings to be compared, the comparison should be valid.
So we can do this:
String s1Norm = Normalizer.normalize (s1, Form.NFD); String s2Norm = Normalizer.normalize (s2, Form.NFD);
It's worth asking why, if Java's lack of Unicode normalization
potentially gives such
odd results, normalization isn't applied automatically before string
comparison. I suppose the pragmatic answer is that this process takes
time and memory, and isn't always necessary. However, it's also the
case that a simple equals()
method doesn't really capture
the various meanings of 'equality' that exist.
We've certainly seen this kind of thing before in Java. Comparing two
strings using the ==
operator almost never gives the
result that the developer expects -- unless the developer has had to
troubleshoot an application where it caused a problem. 'Equality'
is a less straightforward concept than it might appear, even when
applied to two pieces of text.