{"id":5669,"date":"2025-10-18T19:01:12","date_gmt":"2025-10-18T19:01:12","guid":{"rendered":"https:\/\/dalelane.co.uk\/blog\/?p=5669"},"modified":"2026-03-14T21:22:52","modified_gmt":"2026-03-14T21:22:52","slug":"introducing-llm-benchmarks-using-scratch","status":"publish","type":"post","link":"https:\/\/dalelane.co.uk\/blog\/?p=5669","title":{"rendered":"Introducing LLM benchmarks using Scratch"},"content":{"rendered":"<p><strong>In this post, I want to share <a href=\"https:\/\/machinelearningforkids.co.uk\/#!\/worksheets?worksheet=Benchmark\">a recent worksheet<\/a> I wrote for <a href=\"https:\/\/machinelearningforkids.co.uk\">Machine Learning for Kids<\/a>. It is perhaps a little on the technical side, but I think there is an interesting idea in here. <\/strong><\/p>\n<p><img decoding=\"async\" src=\"https:\/\/images.dalelane.co.uk\/2025-10-18-benchmark\/worksheet.png?raw=true\" style=\"border: thin black solid; width: 100%; max-width: 650px;\"\/><\/p>\n<h3>The lesson behind this project<\/h3>\n<p>The idea for this project was to get students thinking about the differences between different language models.<\/p>\n<p>There isn&#8217;t a &#8220;best&#8221; model, that is the best at every task. Each model can be good at some tasks, and less good at other tasks.<\/p>\n<p>The best model for a specific task isn&#8217;t always necessarily going to be the largest and most complex model. Smaller and simpler models can be better at some tasks than larger models.<\/p>\n<p>And we can identify how good each model is at a specific task by testing it at that task.<\/p>\n<p><!--more--><\/p>\n<h3>The project<\/h3>\n<p>In this project, students create a mini benchmark in <a href=\"https:\/\/scratch.mit.edu\">Scratch<\/a>, and test a variety of language models to see the answers that the models give to a specific type of question.<\/p>\n<p>Students compare the different language models in two main dimensions:<\/p>\n<ul>\n<li><strong>accuracy<\/strong> &#8211; how many questions the model gives a correct response to<\/li>\n<li><strong>time<\/strong> &#8211; how quickly the model generates responses<\/li>\n<\/ul>\n<p>They use Scratch to create tests that measure these, and also to generate visualisations that give them a deeper insight into each model&#8217;s behaviour.<\/p>\n<h3>The task<\/h3>\n<p>The task I chose for this project is to ask language models to answer maths questions &#8211; specifically simple addition sums (eg. &#8220;What is 12 + 74?&#8221;).<\/p>\n<p>I chose this task for several reasons:<\/p>\n<ul>\n<li>It is easy for students to understand<\/li>\n<li>It is easy to generate a large number of test questions using Scratch code<\/li>\n<li>It is easy to automate checking the model&#8217;s answer in Scratch code, without needing to create a ground truth<\/li>\n<li>It is a task that some smaller and simpler models can outperform larger and more complex models at.<\/li>\n<\/ul>\n<h3>The models<\/h3>\n<p>The models that students use in the project are:<\/p>\n<ul>\n<li><a href=\"https:\/\/huggingface.co\/HuggingFaceTB\/SmolLM2-135M-Instruct\">SmolLM2-135M-Instruct<\/a> (Hugging Face)<\/li>\n<li><a href=\"https:\/\/huggingface.co\/Qwen\/Qwen2.5-0.5B-Instruct\">Qwen2.5-0.5B-Instruct<\/a> (Alibaba)<\/li>\n<li><a href=\"https:\/\/huggingface.co\/TinyLlama\/TinyLlama-1.1B-Chat-v1.0\">TinyLlama-1.1B-Chat-v1.0<\/a> (Singapore Uni of Technology &amp; Design)<\/li>\n<li><a href=\"https:\/\/huggingface.co\/meta-llama\/Llama-3.2-1B-Instruct\">Llama-3.2-1B-Instruct<\/a> (Meta)<\/li>\n<li><a href=\"https:\/\/huggingface.co\/microsoft\/phi-1_5\">phi-1_5<\/a> (Microsoft)<\/li>\n<li><a href=\"https:\/\/huggingface.co\/stabilityai\/stablelm-2-zephyr-1_6b\">stablelm-2-zephyr-1_6b<\/a> (Stability AI)<\/li>\n<li><a href=\"https:\/\/huggingface.co\/google\/gemma-2-2b-it\">gemma-2-2b-it<\/a> (Google)<\/li>\n<li><a href=\"https:\/\/huggingface.co\/togethercomputer\/RedPajama-INCITE-Chat-3B-v1\">RedPajama-INCITE-Chat-3B-v1<\/a> (Together)<\/li>\n<\/ul>\n<p>The approach is that students divide up the models between themselves, so collectively the class can test all of the models. Smaller classes or code clubs might only test a subset of the models, but still follow the same process.<\/p>\n<p>I selected these models to give a range of sizes, so the <a href=\"https:\/\/machinelearningforkids.co.uk\">Machine Learning for Kids<\/a> site displays interactive graphs to show the relative size and complexity of each model.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/images.dalelane.co.uk\/2025-10-18-benchmark\/models.png?raw=true\" style=\"border: thin black solid; width: 100%; max-width: 650px;\"\/><\/p>\n<h3>Accuracy<\/h3>\n<p>When I&#8217;ve tested the project, I&#8217;ve seen accuracy results like this:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/images.dalelane.co.uk\/2025-10-18-benchmark\/accuracy.png?raw=true\" style=\"border: thin black solid; width: 100%; max-width: 650px;\"\/><br \/>\n<small>percentage of questions that the model gave the correct answer to<\/small><\/p>\n<p>The actual test results that students get is less important than the opportunity for them to compare the accuracy results from their own testing with the model size and complexity. At the least, they&#8217;ll see that the performance of each model is different. Hopefully they&#8217;ll see for themselves that larger and more complex models are not always better at every task than smaller models.<\/p>\n<p>(Gemma is obviously very good at this, but I was most impressed with how Qwen performed given it&#8217;s size!)<\/p>\n<h3>Temperature and Top-P<\/h3>\n<p>In the worksheet, I also get students to experiment with temperature and Top-P settings. This is a bit of a tangent, given the objectives for this project, and that I already <a href=\"https:\/\/dalelane.co.uk\/blog\/?p=5538\">covered this in other worksheets<\/a>, but I think it&#8217;s interesting.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/images.dalelane.co.uk\/2025-10-18-benchmark\/temperature.png?raw=true\" style=\"border: thin black solid; width: 100%; max-width: 650px;\"\/><br \/>\n<small>percentage of questions that the model gave the correct answer to<\/small><\/p>\n<p>My expectation, and what I observed when I&#8217;ve tested the project, is that lower temperature and Top-P settings will result in a small improvement in accuracy. The <a href=\"https:\/\/dalelane.co.uk\/blog\/?p=5538\">Language models<\/a> worksheet went more into the intuition for why than I did for this project &#8211; but I still thought it was a good lesson to revisit here.<\/p>\n<h3>Visualisations<\/h3>\n<p>Scratch is a fun tool to enable students to easily create data visualisations. In this project, students use Scratch to plot the questions that are submitted to the language model on a graph.<\/p>\n<p>For example, if the question &#8220;What is 10 + 1000?&#8221; is generated, this would be plotted here.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/images.dalelane.co.uk\/2025-10-18-benchmark\/graph.png?raw=true\" style=\"width: 100%; max-width: 650px;\"\/><br \/>\n<small>(<em>I&#8217;m using a logarithmic scale to encourage students to see the model&#8217;s performance with a wider range of inputs&#8230; hopefully this won&#8217;t confuse them!<\/em>)<\/small><\/p>\n<p>The point is coloured green if the model gives the correct answer, and red if the model gives an incorrect answer.<\/p>\n<p>This underlines just how good Gemma is at this:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/images.dalelane.co.uk\/2025-10-18-benchmark\/gemma-low.png?raw=true\" style=\"width: 100%; max-width: 650px;\"\/><br \/>\n<small>my results from a test with Gemma<\/small><\/p>\n<p>And shows how bad models like RedPajama were:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/images.dalelane.co.uk\/2025-10-18-benchmark\/redpajama-high.png?raw=true\" style=\"width: 100%; max-width: 650px;\"\/><br \/>\n<small>my results from a test with RedPajama<\/small><\/p>\n<p>More importantly, I wanted this to help students get a better idea of the model&#8217;s behaviour than they can get from an overall accuracy score alone.<\/p>\n<p>For example, the relatively low accuracy score that I got when I tried testing Phi is not as interesting as seeing that it gives correct answers for sums with small numbers, but then gets things wrong with larger numbers.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/images.dalelane.co.uk\/2025-10-18-benchmark\/phi-low.png?raw=true\" style=\"width: 100%; max-width: 650px;\"\/><br \/>\n<small>my results from a test with Phi<\/small><\/p>\n<p>Intuitively, this seems reasonable for a language model that is treating the sums as regular English sentences to predict the most likely next token. Small sums like &#8220;What is 2 + 3?&#8221; are presumably relatively common in general documents, and large sums like &#8220;What is 9128 + 1724?&#8221; are perhaps unlikely to show up at all.<\/p>\n<p>Qwen similarly seemed to struggle with very large numbers, but was still giving correct answers into the low-thousands.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/images.dalelane.co.uk\/2025-10-18-benchmark\/qwen-low.png?raw=true\" style=\"width: 100%; max-width: 650px;\"\/><br \/>\n<small>my results from a test with Qwen<\/small><\/p>\n<p>A little surprisingly, Tiny Llama seems to do better with larger numbers as the first number in a sum (e.g. &#8220;What is 4832 + 5?&#8221; which it tends to answer correctly) than it does with larger numbers as the second number in a sum (e.g. &#8220;What is 5 + 4832?&#8221; which it tends to answer incorrectly).<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/images.dalelane.co.uk\/2025-10-18-benchmark\/tinyllama-low.png?raw=true\" style=\"width: 100%; max-width: 650px;\"\/><br \/>\n<small>my results from a test with Tiny Llama<\/small><\/p>\n<p>Overall, the aim is to get students to use Scratch to fire a lot of questions at their language model, and be creative in finding ways to display the answers that the model gives. And then to compare, consider, and discuss the different behaviours of each of the models that they are testing.<\/p>\n<p><iframe loading=\"lazy\" width=\"450\" height=\"270\" src=\"https:\/\/www.youtube.com\/embed\/EDWqGDeRP0U?si=C6YZKhC07CmRgF5g\" title=\"YouTube video player\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe><br \/>\n<small><a href=\"https:\/\/youtu.be\/EDWqGDeRP0U\">youtu.be\/EDWqGDeRP0U<\/a><\/small><\/p>\n<h3>Accuracy isn&#8217;t the only important thing<\/h3>\n<p>I wanted the worksheet to help make the point that accuracy isn&#8217;t the only consideration when AI projects choose a model. Consider model A with an accuracy of 60% in testing and model B with an accuracy of 90% in testing.<\/p>\n<p>I use the model size (number of parameters) as a proxy for model cost to get students to consider whether they would still choose model B if model B was six times more expensive to use than model A. For some jobs, where cost is important, maybe 60% accuracy is good enough?<\/p>\n<p>Scratch has built in timer blocks, which make it easy to time how long their projects take to run &#8211; so I similarly ask students to consider when faster models might be a better choice, even if their accuracy is a little lower.<\/p>\n<h3>Learning about real benchmarks<\/h3>\n<p>Obviously, in the interest of keeping things simple and accessible, students are running a trivially simple test in this project. I&#8217;ve included some pointers to <a href=\"https:\/\/ibm.biz\/benchmark\">descriptions of real LLM benchmarks<\/a> &#8211; and hopefully that will make more sense for students after they&#8217;ve gone through this taster first.<\/p>\n<h3>Feedback is welcome!<\/h3>\n<p>As I said at the top, this is certainly one of my more dry and technical worksheets. I hope there is still something in here that students will enjoy, but I haven&#8217;t had a chance to try it with a class yet. I&#8217;m sure that I&#8217;ll come back and improve it once I&#8217;ve had the opportunity to see how a class reacts to it, but &#8211; as always &#8211; if you try it with a class or code club, I&#8217;d love to hear what works and what could be improved.<\/p>\n<p>You can find the worksheet at <a href=\"https:\/\/machinelearningforkids.co.uk\/#!\/worksheets?worksheet=Benchmark\">MachineLearningForKids.co.uk\/worksheets<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this post, I want to share a recent worksheet I wrote for Machine Learning for Kids. It is perhaps a little on the technical side, but I think there is an interesting idea in here. The lesson behind this project The idea for this project was to get students thinking about the differences between [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":5670,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[580,587,536],"class_list":["post-5669","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tech","tag-machine-learning","tag-mlforkids-tech","tag-scratch"],"_links":{"self":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/5669","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=5669"}],"version-history":[{"count":1,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/5669\/revisions"}],"predecessor-version":[{"id":5892,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/5669\/revisions\/5892"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/media\/5670"}],"wp:attachment":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=5669"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=5669"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=5669"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}