Supporting different languages

I got an email on Friday from a German guy called Bernard. He uses my wiki note-taking app that I wrote to play with the Windows Mobile SDK (that in itself was a surprise!).

He asked if I’d add support for accented characters to it, as he (unsurprisingly, being German!) wanted to use German characters in his notes. That was an easy enough fix – just add a lookup table to the wiki markup parser which replaces characters with their HTML code equivalent.

Hurrah – I could feel suitably smug for making it a little less English-centric.

A guy called Alex brought me back down to earth on Saturday morning with an email pointing out that when he uses my wiki note-taking app (wow – how many people are using this??), it displays the wrong Chinese characters in ‘View’ mode to the ones he enters in ‘Edit’ mode. Chinese? Eeek… this isn’t something I knew about.

A bit of research (with Alex’s help) showed that the answer was actually pretty simple. Joel Spolsky has a brilliant introduction to the topic which helped me get my head around it.

Slightly concerningly…

I have an announcement to make: if you are a programmer working in 2003 and you don’t know the basics of characters, character sets, encodings, and Unicode, and I catch you, I’m going to punish you by making you peel onions for 6 months in a submarine. I swear I will.

A little scary. Still, he lives in the US, so I’m probably safe.

The aim of the article is to:

fill you in on exactly what every working programmer should know. All that stuff about “plain text == ascii == characters are 8 bits” is not only wrong, it’s hopelessly wrong, and if you’re still programming that way, you’re not much better than a medical doctor who doesn’t believe in germs. Please do not write another line of code until you finish reading this article.

A little harsh? Still, any smugness I might have felt after fixing Bernard’s problem was now long gone 🙂

To get back to the problem Alex was having – the wrong characters were being displayed in the WebBrowser component (an embedded Internet Explorer control) used in the wiki’s ‘View’ mode. To quote from Joel’s article some more:

Almost every stupid “my website looks like gibberish” or “she can’t read my emails when I use accents” problem comes down to one naive programmer who didn’t understand the simple fact that if you don’t tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends.

This was the problem – the HTML that my parser was spitting out didn’t identify an encoding. The answer seems to be easy. I can either:

Tell users to right-click on the WebBrowser control, and use the ‘Encoding’ menu to choose something that makes it look right. This should work, but it’s a little icky. Or:
Add something to the header of the HTML file that the Wiki markup parser produces which identifies UTF-8 as the encoding:
```
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
```

It seems that UTF-8 is a good choice for any web-app which uses non-European characters.

This has been interesting. Globalization/Localization wasn’t at the front of my mind when hacking together an app for my own personal use, so I think I’ve got a reasonable excuse for not looking into this earlier. But I’m glad I can do a few simple things to make it a little better.

This entry was posted on Sunday, December 3rd, 2006 at 12:44 pm and is filed under code. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

One Response to “Supporting different languages”

learning something new » Blog Archive » Getting noticed on the web says:

Tuesday 5th December 2006 at 12:42 am

[…] learning something new Do I learn something new every day? Let’s see! « Supporting different languages […]

dale lane

Supporting different languages

One Response to “Supporting different languages”

Pages

Archives

Disclaimer