Curated sample training datasets for Machine Learning for Kids

Machine Learning for Kids now includes support for a curated collection of training data sets, to enable children to create different types of machine learning projects.


Click to enlarge

The tool lets children make things using machine learning. The principle I’ve worked to is that children train their own machine learning models, as doing this is a great way to teach them about how this tech works.

Preparing their own training data is a useful exercise, but it is time-consuming. Project worksheets I’ve written so far have all been written with the assumption that the student will prepare the training data within a single lesson. This has been a limiting factor on the kinds of ML projects I’ve been able to include.

This new feature enables some more ambitious projects.

I can prepare training data sets that it wouldn’t be practical to expect a school child to create in a lesson.

As an example of this, I’ve started by adding a dataset about Titanic passengers. Students can use this to train a predictive model that can predict how likely a passenger would’ve been to survive the sinking of the Titanic, if given some statistics about a passenger (e.g. male/female, how much they spent on their ticket, age, etc.).


Click to enlarge

This is a very quick and easy model to train, and it performs very well – computers can learn to spot the sorts of patterns you’d expect like “women and children first” very easily. But it’s not a sort of project I’ve been able to do with students before, as getting a child to type in hundreds of passenger record training examples wouldn’t be possible.

This feature also enables some different types of projects.

If a student picks one of these template projects, they get their own copy of the training data copied into their workspace, that they can modify or add to. They can use it as a bootstrap for their own projects, not just as a finished set.

I’ve got ideas for things I could do with this. For example, I could provide intentionally biased datasets as starter templates. They could train a model using that as-is and try it out in a Scratch project. Then the exercise could be to work out for themselves that their ML model is biased, review the training data to figure out why it’s biased, and then fix their copy.

Improving a biased ML model is an interesting idea for a student project, but it’s been difficult to do this well. Asking a student to intentionally create biased training data, and then asking them to fix it, is a slightly artificial exercise. It sort of works, but the fact that the student has to put the problems in the training data in the first place reduces the challenge of finding and fixing them.

Now I have a mechanism for crafting datasets with particular problems, and setting students the challenge of finding and fixing them.

What other datasets could I add?

The work I’ve put into this so far has been to build the mechanism for hosting these datasets, and allowing students to copy them into their own projects.

I’ve put a few quick datasets in there to get started: photos of cats and dogs, headlines from UK newspapers, and the Titanic passengers data.

Next, I want to start fleshing these out.

Can you think of any (ideally public domain) datasets that would enable some interesting machine learning projects for school age children?

Please give me a shout if you’ve got any ideas. Emails, tweets or new github issues are all welcome. (Or even throw me a pull request if you’ve got the data to hand!)

Tags: ,

Comments are closed.