Supporting different languages

I got an email on Friday from a German guy called Bernard. He uses my wiki note-taking app that I wrote to play with the Windows Mobile SDK (that in itself was a surprise!).

He asked if I’d add support for accented characters to it, as he (unsurprisingly, being German!) wanted to use German characters in his notes. That was an easy enough fix – just add a lookup table to the wiki markup parser which replaces characters with their HTML code equivalent.

Hurrah – I could feel suitably smug for making it a little less English-centric.

A guy called Alex brought me back down to earth on Saturday morning with an email pointing out that when he uses my wiki note-taking app (wow – how many people are using this??), it displays the wrong Chinese characters in ‘View’ mode to the ones he enters in ‘Edit’ mode. Chinese? Eeek… this isn’t something I knew about.

A bit of research (with Alex’s help) showed that the answer was actually pretty simple. Joel Spolsky has a brilliant introduction to the topic which helped me get my head around it.

Slightly concerningly…

I have an announcement to make: if you are a programmer working in 2003 and you don’t know the basics of characters, character sets, encodings, and Unicode, and I catch you, I’m going to punish you by making you peel onions for 6 months in a submarine. I swear I will.

A little scary. Still, he lives in the US, so I’m probably safe.

The aim of the article is to:

fill you in on exactly what every working programmer should know. All that stuff about “plain text == ascii == characters are 8 bits” is not only wrong, it’s hopelessly wrong, and if you’re still programming that way, you’re not much better than a medical doctor who doesn’t believe in germs. Please do not write another line of code until you finish reading this article.

A little harsh? Still, any smugness I might have felt after fixing Bernard’s problem was now long gone 🙂

To get back to the problem Alex was having – the wrong characters were being displayed in the WebBrowser component (an embedded Internet Explorer control) used in the wiki’s ‘View’ mode. To quote from Joel’s article some more:

Almost every stupid “my website looks like gibberish” or “she can’t read my emails when I use accents” problem comes down to one naive programmer who didn’t understand the simple fact that if you don’t tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends.

This was the problem – the HTML that my parser was spitting out didn’t identify an encoding. The answer seems to be easy. I can either:

  1. Tell users to right-click on the WebBrowser control, and use the ‘Encoding’ menu to choose something that makes it look right. This should work, but it’s a little icky. Or:
  2. Add something to the header of the HTML file that the Wiki markup parser produces which identifies UTF-8 as the encoding:
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

It seems that UTF-8 is a good choice for any web-app which uses non-European characters.

This has been interesting. Globalization/Localization wasn’t at the front of my mind when hacking together an app for my own personal use, so I think I’ve got a reasonable excuse for not looking into this earlier. But I’m glad I can do a few simple things to make it a little better.

One Response to “Supporting different languages”

  1. […] learning something new Do I learn something new every day? Let’s see! « Supporting different languages […]