Machine Learning for Kids outage report

Machine Learning for Kids was unavailable for most of 29th May 2018. I wanted to share what happened and what I’m doing about it.

The site is hosted in IBM Cloud. I have run multiple instances of the site in parallel for reliability and availability since I first launched the site. However, all of these were deployed into the same location – the “US South” region in Dallas, US.

At approximately 10am on 29th May (UK time), a major routing failure hit applications running in the US South region in IBM Cloud. This meant that although the Machine Learning for Kids application instances were still running, the requests from people’s web browsers weren’t getting routed to them. The application was essentially cut off from the outside world.

I can see from the logs from the Machine Learning for Kids application (as it kept running throughout the day) that virtually no web requests made it to the application. This situation remained for the rest of the day, until very late in the evening (UK time).

Routing was restored before midnight, and the application seems to be accessible again now.

I have no way of knowing how many people tried to access the site during this time. I know of many, who emailed me to ask what was going on, but I imagine there will be many others who just gave up without telling me. To all of them, I’d like to say that I am very very sorry for any inconvenience that this will have caused. I know what it’s like to plan an activity to run with a school class only to have it derailed without notice or warning – and I feel very disappointed and frustrated to know that I caused this for some. I want schools and code clubs to feel that they can rely on the site as a resource.

Although this was started by an infrastructure failure, I made the site vulnerable to this sort of outage by putting all of the instances of the application into the same physical region. I’m working on fixing this now. In future, I will run instances of the application in multiple different regions – to start with, US (Dallas) and UK (London). I’m setting up DNS failover using Cloudflare so that in future web requests will be automatically routed to a working region. In the event of a future IBM Cloud region outage like I saw here, the site should remain accessible as long as at least one of the IBM Cloud regions is still functional.

This is something I should’ve done months ago. I’m sorry that it took a major failure like today to push me to do it.

There might be some intermittent weird behaviour over the next 24 hours as I transfer management of the tools’ DNS to Cloudflare. I’ll do my best to keep this to an absolute minimum.

If you have any questions about any of this, please don’t hesitate to get in touch.

Leave a Reply