I got an email on Friday from a German guy called Bernard. He uses my wiki note-taking app that I wrote to play with the Windows Mobile SDK (that in itself was a surprise!).
He asked if I’d add support for accented characters to it, as he (unsurprisingly, being German!) wanted to use German characters in his notes. That was an easy enough fix – just add a lookup table to the wiki markup parser which replaces characters with their HTML code equivalent.
Hurrah – I could feel suitably smug for making it a little less English-centric.
A guy called Alex brought me back down to earth on Saturday morning with an email pointing out that when he uses my wiki note-taking app (wow – how many people are using this??), it displays the wrong Chinese characters in ‘View’ mode to the ones he enters in ‘Edit’ mode. Chinese? Eeek… this isn’t something I knew about.
A bit of research (with Alex’s help) showed that the answer was actually pretty simple. Joel Spolsky has a brilliant introduction to the topic which helped me get my head around it.
Slightly concerningly…
I have an announcement to make: if you are a programmer working in 2003 and you don’t know the basics of characters, character sets, encodings, and Unicode, and I catch you, I’m going to punish you by making you peel onions for 6 months in a submarine. I swear I will.
A little scary. Still, he lives in the US, so I’m probably safe.
The aim of the article is to:
fill you in on exactly what every working programmer should know. All that stuff about “plain text == ascii == characters are 8 bits” is not only wrong, it’s hopelessly wrong, and if you’re still programming that way, you’re not much better than a medical doctor who doesn’t believe in germs. Please do not write another line of code until you finish reading this article.
A little harsh? Still, any smugness I might have felt after fixing Bernard’s problem was now long gone 🙂
To get back to the problem Alex was having – the wrong characters were being displayed in the WebBrowser component (an embedded Internet Explorer control) used in the wiki’s ‘View’ mode. To quote from Joel’s article some more:
Almost every stupid “my website looks like gibberish” or “she can’t read my emails when I use accents” problem comes down to one naive programmer who didn’t understand the simple fact that if you don’t tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends.
This was the problem – the HTML that my parser was spitting out didn’t identify an encoding. The answer seems to be easy. I can either:
- Tell users to right-click on the WebBrowser control, and use the ‘Encoding’ menu to choose something that makes it look right. This should work, but it’s a little icky. Or:
- Add something to the header of the HTML file that the Wiki markup parser produces which identifies UTF-8 as the encoding:
<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
It seems that UTF-8 is a good choice for any web-app which uses non-European characters.
This has been interesting. Globalization/Localization wasn’t at the front of my mind when hacking together an app for my own personal use, so I think I’ve got a reasonable excuse for not looking into this earlier. But I’m glad I can do a few simple things to make it a little better.
[…] learning something new Do I learn something new every day? Let’s see! « Supporting different languages […]