wchar_t Is a Historical Accident

, Programming

Contents

At first glance, it looks like portable C and C++ programs should use wchar_t for text. It’s portable, and it’s Unicode, what more could you want? It turns out that wchar_t is a bit of a historical accident, and it’s only useful for calling Windows API functions. Avoid wchar_t everywhere else, especially if you are writing portable code.

“But wchar_t is portable!” You say.

Unfortunately, the portable things you can do with wchar_t are not very useful, and the useful things you can do with it are not portable.

Note: To be clear, this article is going to gloss over a lot of the deeper problems with text processing. Questions like, “What is a character?” and, “What encoding should we use?” are topics in their own right.

A Bit of History

Unicode 1.0 is published in 1991 after about four years of work. It defines 7,161 characters, which grow to 34,233 by the time 1.1 is released in 1993. Most of these characters are CJK Unified Ideographs.

These early versions are 16-bit, using an encoding called UCS-2, giving 216 (65,536) possible characters, minus a few special code points like U+FFFE. During this era the Unicode version of the Win32 API appears, as well as Sun’s new programming language, Java. 16-bit character types are a no-brainer, since Unicode is obviously the way of the future. Everyone is happy because they can work with text written in nearly any language, and they don’t have to recompile their programs to do it.

The new Windows API looks like this:

// “ANSI” version uses a Windows code page for the filename.
HANDLE CreateFileA(const char *lpFileName, DWORD dwDesiredAccess,
                   DWORD dwShareMode,
                   LPSECURITY_ATTRIBUTES lpSecurityAttributes,
                   DWORD dwCreationDisposition,
                   DWORD dwFlagsAndAttributes,
                   HANDLE hTemplateFile);

// Unicode version uses UCS-2 (later UTF-16) for the filename.
HANDLE CreateFileW(const wchar_t *lpFileName, DWORD dwDesiredAccess,
                   DWORD dwShareMode,
                   LPSECURITY_ATTRIBUTES lpSecurityAttributes,
                   DWORD dwCreationDisposition,
                   DWORD dwFlagsAndAttributes,
                   HANDLE hTemplateFile);

// CreateFile will be an alias for the Unicode or Windows code page
// version of the function, depending on the project build settings.
// New projects should define UNICODE globally and only use Unicode
// versions of functions.
#ifdef UNICODE
# define CreateFile CreateFileW
#else
# define CreateFile CreateFileA
#endif

In 1996, Unicode is expanded by a factor of 17 to make room for future characters. UCS-2 is no longer viable, superceded by UTF-8, UTF-16, and UTF-32.

As a refresher, this is what the encodings look like:

CharacterUCS-2UTF-8UTF-16UTF-32
Latin Capital Letter A
U+0041 A
004141004100000041
Greek Capital Letter Delta
U+0394 Δ
0394CE 94039400000394
CJK Unified Ideograph
U+904E 過
904EE9 81 8E904E0000904E
Musical Symbol G Clef
U+1D11E 𝄞
F0 9D 84 9ED834 DD1E0001D11E

UTF-8 takes 1-4 bytes, UTF-16 takes 2 or 4, and UTF-32 always takes 4 bytes.

Note the gap in the table. Characters beyond U+FFFF, like 𝄞, 🌍, and 😭 simply cannot be represented in UCS-2. Switching from UCS-2 to UTF-32 would bloat program memory usage and create major API incompatibility problems, so Java and Windows switch to UTF-16.

What Happened to Linux and macOS?

Mac and Linux systems don’t have major APIs that use wchar_t.

Mac OS X 10.0.4, the first consumer version of macOS, is released in 2001. It provides new APIs that will eventually replace older Macintosh APIs. The Cocoa GUI framework uses Unicode everywhere, storing strings in NSString class, which hides the details of its encoding (which can vary from string to string!) and forces the programmer to explicitly specify string encodings when converting from C strings. Here’s a sample from the Foundation framework, the lower-level part of Cocoa:

@interface NSData
// Almost all Cocoa APIs take NSString instances instead of C
// strings, so no char or wchar_t.
+ (instancetype)dataWithContentsOfFile:(NSString *)path;
@end

@interface NSString
// When you construct an NSString instance, it will be obvious which
// encoding you’re using.
- (instancetype)initWithUTF8String:(const char *)cString;
- (instancetype)initWithCString:(const char *)cString
                       encoding:(NSStringEncoding)encoding;
// The ‘unichar’ type is UTF-16.
- (instancetype)initWithCharacters:(const unichar *)characters
                            length:(NSUInteger)length;
@end

// This is always UTF-16 regardless of what wchar_t is.
typedef unsigned short unichar;

System calls on macOS, like open and chdir, consume char *, but since these calls weren’t available prior to macOS 10 they don’t need to be backwards-compatible with existing macOS programs. These functions consume UTF-8 strings. The operating system translates them to the encoding that the filesystem uses—for HFS+, this means normalizing the string with a variant of Unicode normalization format D, and encoding the result in UTF-16.

Meanwhile, Linux slowly moves to using UTF-8 everywhere, but this is a popular convention rather than a decision enforced by the operating system. Linux system calls treat filenames as opaque sequences of bytes, only treating “/” and NUL specially. The interpretation of filenames as characters is left to userspace, and can be configured, but UTF-8 is default almost everywhere. Linux filesystems, like ext2, faithfully reproduce whatever bytestring the users use for filenames regardless of whether that string is valid Unicode.

On both Linux and macOS, none of the important APIs use wchar_t, so that decision is left up to the C standard library. The operating system simply doesn’t care what wchar_t is. On both platforms, wchar_t ends up being UTF-32.

Evolution of C and C++

The C and C++ committees recognize that wchar_t has become somewhat less useful. Developers need a portable way to write strings with specific Unicode encodings. Three new ways to write string literals appear, for UTF-8, UTF-16, and UTF-32.

// This will use a different encoding depending on what platform you
// compile this for. Maybe your platform is set up to use UTF-8,
// maybe not. Maybe this won’t even compile!
const char *str = "γειά σου κόσμος";

// This will generally use some kind of Unicode encoding, but the
// exact encoding will be different on different platforms. On
// Windows, UTF-16. On Linux and Mac, UTF-32.
const wchar_t *wstr = L"γειά σου κόσμος";

// Always UTF-8.
const char *u8str = u8"γειά σου κόσμος";

// Always UTF-16.
const char16_t *u16str = u"γειά σου κόσμος";

// Always UTF-32.
const char32_t *u32str = U"γειά σου κόσμος";

The wchar_t type sticks around for compatibility, but it’s clear that there’s no other reason to use it. Its existence is a historical accident.

Everything is Painful

In short, wchar_t is useful on Windows for calling UTF-16 APIs, but on Linux and macOS, it’s not only a completely different encoding but it’s not even useful! No sane developer would choose to use UTF-16 on Windows and then turn around and try to get the same program running in UTF-32 on Linux and macOS, but that’s exactly what you get with wchar_t.

Let’s suppose you’ve ignored this advice and started using wchar_t in your program.

Here’s a snippet for parsing an escape sequence in JSON. Remember that JSON escapes code points above U+FFFF as escaped UTF-16 surrogate pairs, but we need to convert that escape sequence differently depending on whether we are using Windows.

// Parse a JSON string
std::wstring out;
switch (x) {
case '\\':
  // Parse escape sequence
  unsigned char c = *ptr++;
  switch (c) {
  case 'u':
    // Unicode escape sequence /uXXXX
    unsigned codepoint = readHex(4);
    if (codepoint >= 0xd800 && codepoint < 0xdc00) {
      // Combine surrogate pair
      if (*ptr++ != '\\' || *ptr++ != 'u')
        error("expected low surrogate");
      unsigned hi = codepoint, lo = readHex(4);
      codepoint = (((hi & 0x3ff) + 1) << 10) | (lo & 0x3ff);
    } else if (codepoint >= 0xdc00 && codepoint < 0xe000) {
      error("unexpected low surrogate");
    }
#if WIN32
    // Windows uses UTF-16
    if (codepoint > 0x10000) {
        out.push_back(((codepoint >> 10) - 1) | 0xd800);
        out.push_back((codepoint & 0x3ff) | 0xdc00);
    } else {
        out.push_back(codepoint);
    }
#else
    // Everyone else uses UTF-32
    out.push_back(codepoint);
#endif

Here’s a snippet for writing a std::wstring to an HTML document, using entities to produce ASCII-only output. We need to parse the std::wstring differently depending on whether it is UTF-16 or UTF-32.

// Escape HTML entities, ASCII-only output
std::wstring text = ...;
std::string out;
for (auto p = text.begin(), e = text.end(); p != e; ++p) {
  switch (*p) {
  case '<': out.append("&lt;"); break;
  case '>': out.append("&gt;"); break;
  case '&': out.append("&amp;"); break;
  case '\'': out.append("&apos;"); break;
  case '"': out.append("&quot;"); break;
  default:
    if (*p > 0x7f) {
      unsigned codepoint;
#if WIN32
      if (*p >= 0xd800 && *p < 0xde00) {
        wchar_t c1 = p[0], c2 = p[1];
        ++p;
        codepoint = (((c1 & 0x3ff) + 1) << 10) | (c2 & 0x3ff);
      } else {
        codepoint = *p;
      }
#else
      codepoint = *p;
#endif
      out.append("&#");
      out.append(std::to_string(codepoint));
      out.append(";");
    }
    break;
  }
}

Other problems are just as bad. If you need to find grapheme cluster boundaries or do collation with ICU, then you’ll need to convert your wchar_t to UTF-16, except on Windows, and you’ll need to convert the results back to wchar_t afterwards.

Is this a problem with Windows? No! The problem is that you’ve decided that you want to write a program that uses UTF-16 internally on Windows, and UTF-32 on other platforms. This is going to bite you every time you need to parse text, every time you need to put text on the screen, and every time you encode data in text file, like JSON, HTML, or XML.

Making Everything Way Worse

As an aside, there is an alternative to wchar_t that is actually worse. You can use both wchar_t and char on Windows, depending on your build settings.

Microsoft’s passion for backwards-compatibility lead to the invention of the _T() macro, and a bunch of other macros used for selecting Unicode or non-Unicode versions of functions and data structures in your program. That is, if you want to compile your program to use a Unicode or a Windows code page depending on how you build it.

static const TCHAR WindowClass[] = _T("myWClass"),
                   Title[] = _T("My Win32 App");

HWND hWnd = CreateWindow(WindowClass, Title, WS_OVERLAPPEDWINDOW,
                         CW_USEDEFAULT, CW_USEDEFAULT, 640, 480,
                         NULL, NULL, hInstance, NULL);

The _T() macro turns text into either wchar_t or char, depending on the build settings. Likewise, CreateWindow is a macro which is either CreateWindowA or CreateWindowW depending on build configuration, and they have different type signatures.

If you’re stuck with legacy code, then this may be the only way to get your code running with Unicode. If you write new code this way, you are insane. Unfortunately, an enormous amount of documentation and sample is floating around that recommends this path, and new developers get tricked into thinking that this is some kind of “best practice”. It is not.

Fortunately, there is a better way.

Just Choose Your Encodings

Nobody said Unicode is easy, but it’s much easier if you choose your encodings instead of shoehorning your code into the mess that is wchar_t.

UTF-8 is a popular choice for data processing and web applications, and UTF-16 is used by LibICU, Cocoa, and Win32.

Just make sure that whatever encoding you use, you translate it into UTF-8 when you call open on macOS, and into UTF-16 when you call CreateFileW on Windows.