Stuff Yaron Finds Interesting

Technology, Politics, Food, Finance, etc.

Unicode, Windows and C++

Unicode, Windows, C++, it should all be good right? The answer seems to be that, well, actually, yeah, it is. But there are a lot of details. I do my best to walk through some of them below.

Disclaimer

Although I am an employee of the Microsoft corporation I know exactly nothing about how Windows supports Unicode. In fact I’m writing this article as a desperate bid to figure out the details so I don’t screw up my code. Nothing in this article is supported by Microsoft, in fact, nothing in this article is probably even right. You have been warned.

The problem

This article started with a simple question - what’s the ’right’ way to handle arbitrary Unicode text in Windows using C++? Our scenario is that we are reading in UTF-8 encoded files that could potentially contain arbitrary Unicode content and we need to query against the values. By query I mean things like equality testing, range checking, etc.
In C# land I really never gave any of this any thought because whatever sins C# may or may not commit with Unicode they have never impacted my daily life because C# and the string class supported Unicode from day 1. But in C++ things are very, very different because support for Unicode in C++ in general and Windows in particular is, to be polite, confusing. The following is my attempt to grapple with the issues in the context I care about, which is writing a program that will only run on new version of Windows and is not primarily UX oriented.

A quick history lesson

In the beginning in Windows there pretty much was just ASCII (no I’m not going to talk about EBCDIC and it’s predecessors). Which was mostly just simple characters. Then it expanded to ANSI which added some simple graphics. And yes later it would expand to a pseudo form of ISO 8859-1 but let’s ignore that. But quickly in Windows land it was clear that support was needed for multiple languages. The ’solution’ was code pages. Code pages typically supported some kind of single byte or multi byte or variable multi byte (e.g. a character could be encoded with a variable number of bytes) encoding. So a character string would be interpreted differently depending on what code page Windows was told to apply to it. This led to all sorts of fun interoperability disasters as the same byte string had different meanings depending on the local code page. Before anyone blames Microsoft too much do keep in mind that in the early days various countries came up with their own random overlapping encoding schemes and woe be unto Microsoft if it didn’t support a country’s preferred choice. So as much of an interoperability and consistency nightmare as code pages might have been, such was the state of the world.
Finally Unicode showed up and promised to come up with a single encoding for characters that would encode all characters in the world. The original Unicode standard felt it could do all of this in exactly 2 bytes and so defined the UCS-2 encoding. And for a tiny period of time life was good. Microsoft quickly jumped on UCS-2 and changed Windows to support it. No more code pages except for compatibility scenarios.
Unfortunately it quickly became apparently that 2 bytes weren’t enough. This lead to the introduction by the Unicode consortium of an expanded set of code points, in fact, there could now be up to 1,112,064 code points. There were several answers to how to actually encode these code points, one of which concerns me because it’s the one Windows natively supports, which is UTF-16.

UTF-16 encoding and matching sub-strings

The wikipedia article on UTF-16 is actually quite good so I recommend giving it a read. But the upshot is as follows. UCS-2 encoded what is called the Basic Multilingual Plane (BMP). When Unicode was expanded the new characters were put into ’higher’ planes. All characters from the BMP are puts into values 0x0000-0xD7FF and 0xE000-0xFFFF. Characters in higher planes have to be encoded using 4 bytes or what’s called a surrogate pair.
UTF-16 uses an interesting encoding trick that makes comparison checking and parsing less scary. The lead surrogate (e.g. the first 2 bytes in a surrogate pair) can only be the values 0xD800-0xDBFF and the trailing surrogate (the last 2 bytes) from 0xDC00-0xDFFF. Now compare our three ranges, they don’t overlap. This means that one can always test if a UTF-16 2 byte sequence is either a BPM code point, a lead surrogate or a trailing surrogate. It also means that if one is looking for substrings it is impossible to get synchronization problems so long as you properly stay on 2 byte boundaries. In other words when you are running through a UTF-16 encoded string it’s fine to just iterate 2 bytes at a time. So long as you are comparing against another UTF-16 string there will be no synchronization, overlap or other problems. It will just work. Yeah!
Note however that this only works for ordinal comparisons, more on that later.

What about displaying Unicode in console apps?

I am generally a server side guy so typically most of my code is tested inside of a console app. It’s useful therefore to be able to see Unicode characters in the console app. A guy named Jim Beveridge has a blog article with a bunch of really awesome links to high quality data about Unicode and the console. The short answer is that if one is using primarily wprintf (with UTF-16) or printf (with UTF-8) then one can call _setmode to set one’s environment correctly. The only issue then with outputting to the console is that depending on how one’s console is set up it might not be using a font that can display Unicode characters properly. In general one might not care but if you do there is something to be done. First, go to any cmd windows and right click on the title bar and select defaults. Then change your font to something like Lucinda Console.
Reading online you will find lots of stuff about the console and the chcp command which lets you change a console’s code page. If using _setmode on any modern version of Windows one can safely ignore chcp (including suggestions to set chcp to utf-8). The reason is that the console is reasonably Unicode aware and _setmode gives the hint that will cause the console to display the characters as Unicode.

What’s up with the little boxes I see in the console output?

In testing characters far from the character set native to the local copy of Windows it’s likely one will see lots of boxes rather than characters output to the console. These boxes are interesting because what they represent are Unicode characters that the current console font doesn’t have a glyph for. The character codes are correct and if you copy and paste them out of the console and into an app like Notepad you will actually see the correct characters (most likely). The reason for this is something called font substitution.
Font substitution happens when a program trying to display Unicode code points realizes that the current font doesn’t have a matching glyph. In that case the program will search through other fonts looking for a font that does have a glyph and will use that font instead. The console window doesn’t use font substitution but other programs like Notepad, Word and Visual Studio do.
I actually was playing with a surrogate pair code point which neither Notepad nor Word could properly display but Visual Studio 2012 could, go figure.

Mixing printf and wprintf

Don’t. I tried various tricks such as resetting _setmode but it didn’t work. Once you output with one of these two you have to stick with it. This is particularly fun when dealing with 3rd party programs using printf or wprintf.

TCHAR vs wchar_t vs char16_t

Microsoft’s official guidance is to use Unicode for everything but to do so via macros. One uses values like TCHAR and LPCTSTR that Windows will then dynamically compile based on a define to either the old code page based APIs or the Unicode APIs.
For now apps that only need limited interaction with old APIs there isn’t much benefit from TCHAR. In theory it might provide future proof protection should Windows or Unicode change but honestly UTF-32 (which Linux uses) is really just silly. It provides no substantive benefits over UTF-16 since real world Unicode has to deal with fun things like combining marks which makes being able to index code points directly essentially useless. And no, precomposed characters won’t save you since there are diacritic combinations that don’t have precomposed forms. Microsoft’s article on string normalization gives an example of a diacritic combination without a precomposed form at the end.
So my general feeling is that it’s clearer to just Unicode types and methods directly. The counter-example to this is if one wants to use UTF-8 and leave open the option to switch to UTF-16 without a major code rewrite. In that case TCHAR really can help. One can use TCHAR and its associated functions and set the compile either to Unicode or to code page option with code page set to UTF-8.
But for our purposes I think if we are using UTF-8 it would only be in cases where we know it can really help us (in other cases, such as dealing with Asian or Indian characters from the BMP it’s actually more expensive in terms of space than UTF-16 since it will use 3-4 bytes to UTF-16’s 2). So my guess is that we would have intentional UTF-16 and UTF-8 code paths rather than setting a macro. So my recommendation is that we just use the *W and *A functions directly rather than going through TCHAR so there is no confusion about what is being used there.
Another consideration is portability. Microsoft’s preferred datatype for UTF-16 is wchar_t. But as explained in some detail here wchar_t doesn’t have a portable definition. So bringing a wchar_t code to another platform can have interesting effects. The C++11 standard solved this problem by introducing the char16_t and char32_t types and associated string functions. The latest releases of Visual Studio support C++11 and those types. And, in theory, one should be able to use a char16_t anywhere one can use a wchar_t since they encode exactly the same values.
But Microsoft is really focused on wchar_t for the moment at least and I figure we’ll just use it since portability across OS’s isn’t a major concern for my current project.

Encoding Unicode characters, L and \u vs \U vs \x

To create a UTF-16 string the format is: wstring foo = L”bar”; or wchar_t *foo = L”bar”;. The L character marks the string as being a ’wide’ string, a.k.a. UTF-16 in Windows.
But what if I want to test characters that aren’t in the normal western character set? One way is to just cut and paste the character and save the C++ file as a UTF-8 file. But I’m way too wimpy to try that. So instead I tend to use escape sequences.
_setmode(_fileno(stdout), _O_U16TEXT);
wstring smallUEncoding = L"\u548C";
wstring largeUEncoding = L"\U0000548C";
wstring xEncoding = L"\x548C";
wprintf(L"Do they all match? %s\n", 
  smallUEncoding.compare(largeUEncoding) == 0 && largeUEncoding.compare(xEncoding) == 0 ? 
  L"Yes!" : L"No!");
U+548C is the representation for a Chinese character that means something like harmony or peace. It’s part of the BMP which will prove to be important in a second. One way to encode U+548C in C++ is to use \u548C. But \u can only be used with BMP characters which are all guaranteed to be exactly 4 hex digits long. One could also use \U which is intended for characters outside of the BMP and must be exactly 8 hex characters long. Last but not least one can use \x548C. The big difference between \u and \x is that \u is the Unicode code point while \x represents the actual UTF-16 2 byte hex character. As mentioned previously all BMP character’s have the same Unicode code point value as their UTF-16 value. But now let’s drag up a surrogate pair.
_setmode(_fileno(stdout), _O_U16TEXT);
wstring utf32SurrogatePair = L"\U0002a6d6";
wstring utf16SurrogatePair = L"\xd869\xded6";
wprintf(L"Do they match? %s\n", utf32SurrogatePair.compare(utf16SurrogatePair) == 0 ? L"Yes!" : L"No!");
The Unicode code point U+2A6D6 (which I stole from here when I was looking for surrogate pairs) is a surrogate pair. That means it’s outside of the the BMP. It also helps to illustrate the differences between \u, \U and \x. Because this value is outside of the BMP it can’t be represented with \u. After all \u takes exactly 4 hex digits and U+2A6D6 requires 5. So we can use the \U form which takes exactly 8 hex digits, we just have to prepend 0s to make it the right length. But notice the \x value, it’s D869 DED6 which isn’t anything like 2A6D6. The reason is that, as mentioned above, \x represents the exact UTF-16 encoding. For characters outside of the BMP there is an algorithm that is described for this character in the second link in this paragraph. The end result is D869 DED6 which is then encoded as \xd869\xded6.

Indexing into strings, ++

In ASCII land life was pretty straight forward. If you wanted to compare two values you just compared their bytes. One byte == one character. The most complicated thing one had to deal with was casing. In most foreign languages however things get much, much more complicated. To understand these issues it’s worth reading Microsoft’s official guidance on string normalization. This brings out a lot of the complications in dealing with Unicode for sorting or equality checking. The real issue is that a glyph, the thing on the screen, can be made up of more than one Unicode code point. So even if one is using UTF-32 or restricts UTF-16 to only supporting the BMP one still can’t meaningfully index into a Unicode string. The reason is that code point 3 may or may not map to a glyph that a user can see. This is due to the combining marks already mentioned in the previous section. So the fact that UTF-16 throws in the complexity of surrogate pairs just don’t matter, it already is the case that the only sane way to index through a Unicode string is in order using a function like CharNext.
This brings us to ++. In the case of wchar_t I actually just looked at the assembly and it is incrementing the counter by 2. Due to both surrogate pairs and combining marks this is a pretty useless operation. With one exception, ordinal compares. If one is comparing code points directly to each other then one can walk through a UTF-16 or even a UTF-8 string and do a full match or even substring match with confidence so long as one honors the 2 byte boundaries and ++ will do that automatically.

Measuring the size of a wstring

_setmode(_fileno(stdout), _O_U16TEXT);
wstring wstr = L"\u00C4A\u0308\U0002a6d6";
wprintf(L"UTF-16 String: %s, string size: %d\n", wstr.c_str(), wstr.size());
U+00C4 is a pre-composed character. A and U+0308 are two code points that when precomposed would be U+00C4. And U+2A6D6 is a character outside of the BMP and thus needs a surrogate pair to encode. The value returned by size is 5 even though there are only 4 code points. This actually makes sense if you remember that size is just going to record how many UTF-16 values there are. Which is 5. The total byte size is then 5 * sizeof(wchar_t) which in Windows is 5*2= 10 bytes (not including the terminating \0 character). So when dealing with size or length one has to check the function definition but typically it will return UTF-16 values, not glyphs (e.g. the characters people would actually see).

But... UTF-16 characters are so big... can’t I use UTF-8?

If your character data is primarily western characters then UTF-16 will double your data size. Unless you are dealing with enormous amounts of information this shouldn’t matter and one can just use UTF-16. But if you are dealing with prodigious amounts of data in memory the overhead can really add up. There is, however, some hope. Windows supports UTF-8 as a code page option. So in theory one can use the ’A’ version of APIs (e.g. the old code page based APIs) and set the code page to UTF-8. The Microsoft documentation claims that this will support all of UTF-8 which means it supports all of Unicode and it will do so, for western characters, mostly using one byte. However it means that characters can be more than one byte, in fact, they can be up to 4 bytes long.
But even here some of the same comparison tricks as UTF-16 do work. The first byte in a sequence is guaranteed to start either with a 0 bit or 2 or more 1 bits. So one can always uniquely identify a starting byte. Following bytes always start with 10 (which can never appear in a starting byte) and the starting byte’s encoding tells one how many following bytes there are. So if one is randomly dropped into a UTF-8 string one will have to back up at most 3 bytes from the current byte to find a ’start byte’ and then one can start processing.
Equally importantly for doing substring matches one can just go byte by byte, at least for ordinal comparison. The encoding format guarantees that it’s impossible for two different code points to have the same or even overlapping encodings in UTF-8.
Note also that if you need to switch between UTF-8 and UTF-16 you can use the MultiBytetoWideChar and WideCharToMultiByte functions which will do a 100% lossless conversion.
_setmode(_fileno(stdout), _O_U16TEXT);
wstring wstr = L"\u00C4A\u0308\xd869\xded6";
wprintf(L"UTF-16 String: %s, string size: %d\n", wstr.c_str(), wstr.size());
int sizeNeeded = WideCharToMultiByte(CP_UTF8, 0, (LPWSTR)wstr.c_str(), -1, NULL, 0,NULL, NULL); 
string str(sizeNeeded, 0);
WideCharToMultiByte(CP_UTF8, 0, (LPWSTR)wstr.c_str(), -1, (LPSTR)str.c_str(), sizeNeeded, NULL, NULL);
wstring returnedStr(wstr.size(), 0);
MultiByteToWideChar(CP_UTF8, 0, (LPSTR)str.c_str(), -1, (LPWSTR)returnedStr.c_str(),(int)returnedStr.size());
wprintf(L"Do they match? %s", returnedStr.compare(wstr) == 0 ? L"Yes!" : L"Ooops!");

So what about sorting and comparisons?

I already pointed to Microsoft’s page on string normalization which hits most of the key issues. But let’s just say, the situation is difficult. Sorting (and it’s sibling, comparison) is really, really, REALLY hard. If you don’t believe me just take a gander at Unicode Technical Standard #10 which defines the algorithms using for collation in Unicode. Generally if humans aren’t involved I almost always just use ordinal sorting and comparison. This means I just order things by Unicode code point values and compare by code points. What’s nice about this approach is that it’s simple, fast and invariant (e.g. as Unicode itself changes my code and its outputs won’t change).
Which is all fine and good right up until a human is involved. Then pretty much it’s a total disaster. Sorting by code points produces what to any normal human is a random ordering. The problem is that the ordering people expect depends on multiple variables including what language they speak and what the context of the information is (e.g. in some cultures different kinds of sorting are used for different purposes). So pretty much the only sane way to do a non-ordinal Unicode sort is with a lot of data from the user telling you what their expectations are.
Thankfully Microsoft does provide a function that is intended to handle most of the usual variations, it’s CompareStringEx. Besides the infinite number of options a really fun aspect of Unicode in general and CompareStringEx in particular is - IT CHANGES. That’s right, the Unicode standard is evolving and when it does things like sort rules can actually change. Changes can also occur when Microsoft fixes implementation bugs. This means that if you sort some values on machine A using Unicode version X and then try to process them on machine B using Unicode version Y bad things are going to happen. The security holes alone this can cause are left as an exercise to the reader. Microsoft provides some help with this problem via the GetNLsVersionEx function. See here for a bunch of the details on how to compare versions.
The key point is that when recording data if one is going to depend on it being sorted then one better specify all the parameters that went into it’s sorting or things can easily break.

Leave a Reply

Your email address will not be published. Required fields are marked *