Posting to Twitter… carefully

I’ve recently picked up my the code for my Windows Mobile Twitter client again.

It was originally written back in April as a hackday idea. The code posts Twitter updates using a variation on the twitter-from-curl approach of HTTP POSTing “status=MyTweet” to the twitter update url.

I started with the update URL, and appended the message I wanted to tweet. This is fine for a quick hackday demo, but it did mean that you could end up with a URL like:

http://twitter.com/statuses/update.xml?status=Hello (twitter) world! Special chars = a problem?

Which fails if you want to post characters such as accents or characters which have special meaning in URLs, like + ? / & etc.

I was encouraged by a number of users to have another look at this, which I’ve done now, and hopefully version 1.1 solves the problems.

A quick Google turned up that a number of other Twitter apps share at least some of the same problems that mine had, so thought I’d share the fix here.

Step 1 – Url-encode

Okay, so technically I did this on the night on HackDay (it might have been a hack, but even I know that spaces in a URL aren’t the best idea!) but I’m including it here for completeness.

My app was written in C++, so I used InternetCanonicalizeUrl to turn the tweet-posting API into something a little safer – a percent-encoded Uri. For example, this turned spaces in the message into %20

Step 2 – Specifying a content type

A few twitter users such as @reVoid and @michaelmcmillan reported problems when they tried to tweet ÆØÅ characters.

After a bit of experimentation, the answer turned out to be to add charset=utf-8 to the HTTP headers I send when I post.

TCHAR header[]  = TEXT("Content-Type:application/x-www-form-urlencoded;charset=utf-8");

A quick play with accented characters like é and á seemed to suggest that this would get accents working.

Step 3 – UTF-8 encoding

This appeared to fix most accented characters. But then @walti pointed out that my code still broke when posting certain characters with German umlautes. Either the next character after each umlaut would be lost when posted, or sometimes the whole remainder of the tweet after an umlaut would be lost.

Googling showed that other apps such as Twitpic, Ping.fm and betwittered shared this bug.

I didn’t figure this one out for myself, but the answer was given to me by a helpful user on the Twitter API Google Group.

It seems that url encoding isn’t sufficient. You also need to UTF-8 encode the characters. For example, when posting ö – it wasn’t enough for me to url-encode it and send it as %F6, I also need to utf8-encode it to %C3%B6.

I wrote a quick-and-dirty encoder to handle this:

//---------------------------------------------------
// doing a little UTF-8 encoding... 
//---------------------------------------------------

LPWSTR utf8encodedTweet = new TCHAR[dwNewSz * 3];
newptr = 0;
for (int i=0; i < dwNewSz; i++)
{
    if (lpszEncTweetMessage[i] == '%')
    {
        if (lpszEncTweetMessage[i+1] == 'B')
        {
            utf8encodedTweet[newptr++] = '%';
            utf8encodedTweet[newptr++] = 'C';
            utf8encodedTweet[newptr++] = '2';
        }
        else if (lpszEncTweetMessage[i+1] == 'C')
        {
            utf8encodedTweet[newptr++] = '%';
            utf8encodedTweet[newptr++] = 'C';
            utf8encodedTweet[newptr++] = '3';
            utf8encodedTweet[newptr++] = '%';
            utf8encodedTweet[newptr++] = '8';
            i += 2;
        }
        else if (lpszEncTweetMessage[i+1] == 'D')
        {
            utf8encodedTweet[newptr++] = '%';
            utf8encodedTweet[newptr++] = 'C';
            utf8encodedTweet[newptr++] = '3';
            utf8encodedTweet[newptr++] = '%';
            utf8encodedTweet[newptr++] = '9';
            i += 2;
        }
        else if (lpszEncTweetMessage[i+1] == 'E')
        {
            utf8encodedTweet[newptr++] = '%';
            utf8encodedTweet[newptr++] = 'C';
            utf8encodedTweet[newptr++] = '3';
            utf8encodedTweet[newptr++] = '%';
            utf8encodedTweet[newptr++] = 'A';
            i += 2;
        }
        else if (lpszEncTweetMessage[i+1] == 'F')
        {
            utf8encodedTweet[newptr++] = '%';
            utf8encodedTweet[newptr++] = 'C';
            utf8encodedTweet[newptr++] = '3';
            utf8encodedTweet[newptr++] = '%';
            utf8encodedTweet[newptr++] = 'B';
            i += 2;
        }
    }

    utf8encodedTweet[newptr++] = lpszEncTweetMessage[i];		
}
utf8encodedTweet[newptr] = '\0';

Is that it?

I think that covers all the bases...

No doubt someone will point out otherwise before too long. 🙂

Tags: , , , , , , , ,

One Response to “Posting to Twitter… carefully”

  1. Dave Mc says:

    Thanks mate. Incidentally, I’m developing for BlackBerry devices and the little class they provide to do URL encoding can also do UTF-8 encoding…so long as you tell it to do so, which I now am.