{"id":3403,"date":"2016-07-15T19:43:52","date_gmt":"2016-07-15T19:43:52","guid":{"rendered":"http:\/\/dalelane.co.uk\/blog\/?p=3403"},"modified":"2016-07-16T01:50:53","modified_gmt":"2016-07-16T01:50:53","slug":"normalised-discounted-cumulative-gain","status":"publish","type":"post","link":"https:\/\/dalelane.co.uk\/blog\/?p=3403","title":{"rendered":"Normalised Discounted Cumulative Gain"},"content":{"rendered":"<p><strong>A ramble about accuracy compared with NDCG scores for evaluating effectiveness of information retrieval. It&#8217;s an introduction, so if you already know what they are, you can skip this.<\/strong><\/p>\n<p>A couple of weeks ago, <a href=\"http:\/\/dalelane.co.uk\/blog\/?p=3381\">we released a new tool<\/a> for training the <a href=\"http:\/\/www.ibm.com\/watson\/developercloud\/retrieve-rank.html\">IBM Watson Retrieve and Rank service<\/a> (a search service that uses machine learning to train it how to sort search results into the best order).<\/p>\n<p>This afternoon, I <a href=\"https:\/\/twitter.com\/dalelane\/status\/753986491848744960\">deployed a collection of small updates to the tool <\/a> and thought I&#8217;d make a few notes about what&#8217;s changed.<\/p>\n<p>Most of them are a bunch of incremental updates and minor bug fixes. <\/p>\n<p><img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-upload.png\" style=\"border: thin solid black; margin-top: 5px; margin-bottom: 5px;\"\/><\/p>\n<p>For example, support for a wider range of <a href=\"https:\/\/bluemix.net\">Bluemix<\/a> environments, support for larger document cluster sizes, displaying the amount of disk space left in a cluster, and so on.<\/p>\n<p>One update in particular I thought was more interesting, and worth explaining.<\/p>\n<p><!--more-->It&#8217;s something I&#8217;ve wanted to fix for a while, but didn&#8217;t get the time to before. I <a href=\"http:\/\/dalelane.co.uk\/blog\/?p=3381\">mentioned in my last post<\/a> that the tool represents a minimum viable product, and that there were lots of great ideas and many possible features that there wasn&#8217;t time to include. But two things in particular bugged me &#8211; two things that I wish I&#8217;d managed to get done in time for our initial release. And this is one of them.<\/p>\n<h2>Evaluation<\/h2>\n<p>The tool needs to guide users through training the service. It also needs to evaluate how the training is going. This helps the tool decide how to drive the training next: the types of training task to give the user, the types of questions to focus on, and so on.<\/p>\n<p>It&#8217;s also something we explain to the user, so that they have a feel for how their training is going, and the impact that their work is having.<\/p>\n<p>We built this tool for someone who isn&#8217;t an expert in information retrieval or machine learning. That means we can&#8217;t show them a detailed dashboard with technical data. We can&#8217;t expect them to be able to interpret a learning curve. We decided to distill the results of the tests we run for them into a single figure. One metric that would represent a summarised view of the performance of the system they&#8217;ve trained.<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-accuracy.png\" style=\"border: thin solid black; margin-top: 5px; margin-bottom: 5px;\"\/><\/p>\n<h2>Accuracy<\/h2>\n<p>For the initial release, that metric was &#8216;accuracy@5&#8217;.<\/p>\n<p>That is to say, the percentage of the questions where the system returned a correct answer in one of the first five responses.<\/p>\n<p>I could argue that this is good because it&#8217;s easiest to understand for non-technical users. I could argue that typical use cases for search services like Retrieve and Rank will return a short list of answers, so if it&#8217;s returning something relevant in the first five answers then it&#8217;s doing a good job, making that a useful thing to measure. But really the biggest benefit was that it was very quick and easy to implement.<\/p>\n<h3>The problems with accuracy<\/h3>\n<p>However, there&#8217;s a lot I don&#8217;t like about it.<\/p>\n<p>For example, imagine that for half the questions in the test, the service returns a list of answers like this:<\/p>\n<ol>\n<li> <img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-star.png\" height=16 width=16\/> A terrible answer\n<\/li>\n<li> <img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-star.png\" height=16 width=16\/> A truly awful answer\n<\/li>\n<li> <img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-star.png\" height=16 width=16\/> A ridiculous non-sensical answer\n<\/li>\n<li> <img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-star.png\" height=16 width=16\/> An unbelievably stupid answer\n<\/li>\n<li> <img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-star.png\" height=16 width=16\/><img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-star.png\" height=16 width=16\/><img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-star.png\" height=16 width=16\/> A partial but correct answer\n<\/li>\n<li>&#8230;\n<\/li>\n<\/ol>\n<p>The accuracy@5 score is <strong>50%<\/strong>.<br \/>\nHalf of the questions in the test met the requirement.<\/p>\n<p>Now imagine you spend a few days diligently rating answers to questions. You do all the training tasks the tool prepares for you to continue training.<\/p>\n<p>The test is run again. Imagine that for half the questions in the test, the service returns:<\/p>\n<ol>\n<li> <img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-star.png\" height=16 width=16\/><img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-star.png\" height=16 width=16\/><img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-star.png\" height=16 width=16\/><img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-star.png\" height=16 width=16\/> An amazing perfect answer\n<\/li>\n<li> <img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-star.png\" height=16 width=16\/><img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-star.png\" height=16 width=16\/><img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-star.png\" height=16 width=16\/><img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-star.png\" height=16 width=16\/> Another brilliant detailed answer\n<\/li>\n<li> <img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-star.png\" height=16 width=16\/><img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-star.png\" height=16 width=16\/><img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-star.png\" height=16 width=16\/><img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-star.png\" height=16 width=16\/> A fantastic answer\n<\/li>\n<li> <img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-star.png\" height=16 width=16\/><img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-star.png\" height=16 width=16\/><img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-star.png\" height=16 width=16\/> A very good answer\n<\/li>\n<li> <img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-star.png\" height=16 width=16\/><img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-star.png\" height=16 width=16\/> A partial, but still correct and relevant answer\n<\/li>\n<li> &#8230;\n<\/li>\n<\/ol>\n<p>The accuracy@5 score is <strong>50%<\/strong>.<br \/>\nHalf of the questions in the test met the requirement.<\/p>\n<p>If that&#8217;s the only figure you look at, it looks like you haven&#8217;t achieved anything in all the training work you&#8217;ve done since the last test. It looks like you haven&#8217;t improved anything.<\/p>\n<p>The metric fails to really describe the experience that someone would get from using the service, because it doesn&#8217;t differentiate between an answer that&#8217;s just okay and an answer that&#8217;s perfect.<br \/>\nAnd it doesn&#8217;t differentiate between returning the answer at the top of the list, and returning an answer further down the list.<br \/>\nAnd it doesn&#8217;t reflect the difference between easy questions and hard questions.<br \/>\nAnd it doesn&#8217;t acknowledge questions where a correct answer isn&#8217;t even in any of your documents.<br \/>\nAnd so on.<\/p>\n<p>If the tool is only going to show a single metric, these sorts of issues are a problem. It&#8217;s not enough to let users know how their training is going. And this is what I mean by something that has bugged me.<\/p>\n<h2>Normalised Discounted Cumulative Gain<\/h2>\n<p>Now that the tool is out, I had some time to try a variety of alternate approaches, and I&#8217;ve settled on using NDCG scores.<\/p>\n<p>NDCG stands for <strong>Normalised Discounted Cumulative Gain<\/strong>, and addresses most of the problems I had with using accuracy scores.<\/p>\n<p>The easiest way to explain it is to start at the end, and work backwards&#8230;<\/p>\n<h3>Gain<\/h3>\n<p>You assign a value that represents the amount of gain you get from an answer that the system returns, based on how relevant it is. The more relevant, the more gain. And there&#8217;s no gain from an incorrect answer.<\/p>\n<p>The tool is already guiding users to rate the answers they are shown as part of the training, so this maps to gain values very easily.<\/p>\n<p>For example, I could say:<\/p>\n<ul>\n<li> perfect answers &#8211; gain of <strong>3<\/strong>\n<\/li>\n<li> good answers &#8211; gain of <strong>2<\/strong>\n<\/li>\n<li> answers that are at least in the right topic area &#8211; gain of <strong>1<\/strong>\n<\/li>\n<li> incorrect answers &#8211; <strong>zero<\/strong> gain\n<\/li>\n<\/ul>\n<p>There are different gain functions you can come up with, but this is what I&#8217;ve gone with for the tool, because it maps so neatly with the relevance labels we&#8217;re asking the users to collect already.<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-stars.png\" style=\"border: thin solid black; margin-top: 5px; margin-bottom: 5px;\"\/><\/p>\n<p>This means we&#8217;re rewarding a system that returns perfect answers more than one that returns answers that are just okay.<\/p>\n<h3>Cumulative gain<\/h3>\n<p>You add up the gain for the different answers that the system returns.<\/p>\n<p>For example, if you&#8217;re interested in the first five answers that it returns to a question:<\/p>\n<ol>\n<li> perfect answer\n<\/li>\n<li> another perfect answer\n<\/li>\n<li> a good answer\n<\/li>\n<li> a good answer\n<\/li>\n<li> an incorrect answer\n<\/li>\n<\/ol>\n<p>The cumulative gain is<br \/>\n<strong>3 + 3 + 2 + 2 + 0 = 10<\/strong><\/p>\n<p>(For the tool, I&#8217;ve gone with the cumulative gain for the first ten answers to every question, but you get the idea).<\/p>\n<p>This means we&#8217;re rewarding a system that returns a range of good answers more than one that returns a good answer together with a load of rubbish.<\/p>\n<h3>Discounted cumulative gain<\/h3>\n<p>You discount the gain that an answer has, the further down the list it&#8217;s returned.<\/p>\n<p>For example:<\/p>\n<ol>\n<li> gain  (answer in first place gets the full gain score)\n<\/li>\n<li> gain \/ 2 (answer second in the list gets half of it&#8217;s gain score)\n<\/li>\n<li> gain \/ 3 (answer third in the list gets a third of it&#8217;s gain score)\n<\/li>\n<li> gain \/ 4 (answer fourth in the list gets a quarter of it&#8217;s gain score)\n<\/li>\n<li> gain \/ 5 (answer fifth in the list gets a fifth of it&#8217;s gain score)\n<\/li>\n<\/ol>\n<p>In other words, using the same example as before&#8230;<\/p>\n<ol>\n<li> perfect answer\n<\/li>\n<li> another perfect answer\n<\/li>\n<li> a good answer\n<\/li>\n<li> a good answer\n<\/li>\n<li> an incorrect answer\n<\/li>\n<\/ol>\n<p>The discounted cumulative gain (DCG) could be<br \/>\n<strong>3 + (3 \/ 2) + (2 \/ 3) + (2 \/ 4) + (0 \/ 5) = 5.7<\/strong><\/p>\n<p>There are different discounting functions you can use. The one I&#8217;ve described above is a little harsh. Halving the gain just because something is in 2nd place instead of 1st in a list of answers is probably too extreme a discount.<\/p>\n<p>A more common function, and what I&#8217;ve gone with for the tool, is log<sub>2<\/sub><\/p>\n<ol>\n<li> gain\n<\/li>\n<li> gain \/ log<sub>2<\/sub> 2\n<\/li>\n<li> gain \/ log<sub>2<\/sub> 3\n<\/li>\n<li> gain \/ log<sub>2<\/sub> 4\n<\/li>\n<li> gain \/ log<sub>2<\/sub> 5\n<\/li>\n<\/ol>\n<p>e.g. The DCG would be<br \/>\n<strong>3 + (3 \/ log<sub>2<\/sub> 2) + (2 \/ log<sub>2<\/sub> 3) + (2 \/ log<sub>2<\/sub> 4) + (0 \/ log<sub>2<\/sub> 5) = 8.3<\/strong><\/p>\n<p>This means that we&#8217;re rewarding systems that return good answers higher in the list.<\/p>\n<h3>Normalised discounted cumulative gain<\/h3>\n<p>You normalise the score compared with the best possible score.<\/p>\n<p>There might not be a perfect answer in any of your documents to one question.<br \/>\nAnother question might have loads of perfect answers in several documents.<\/p>\n<p>It&#8217;s not really fair to compare their scores &#8211; it&#8217;s not necessarily as easy to find perfect answers to one question as it is to find perfect answers to another.<\/p>\n<p>So instead, we normalise them, compared with the DCG you&#8217;d get for the best possible known answers in the perfect order.<\/p>\n<p>For example, imagine a system with only five known possible answers returned:<\/p>\n<ol>\n<li> An incorrect answer (gain = 0)\n<\/li>\n<li> A good answer (gain = 2)\n<\/li>\n<li> A perfect answer (gain = 3)\n<\/li>\n<li> A relevant answer (gain = 1)\n<\/li>\n<li> A perfect answer (gain = 3)\n<\/li>\n<\/ol>\n<p>The DCG would be<br \/>\n<strong>0 + (2 \/ log<sub>2<\/sub> 2) + (3 \/ log<sub>2<\/sub> 3) + (1 \/ log<sub>2<\/sub> 4) + (3 \/ log<sub>2<\/sub> 5) = 5.7<\/strong><\/p>\n<p>The ideal DCG (iDCG) can be found by getting the score for:<\/p>\n<ol>\n<li> A perfect answer (gain = 3)\n<\/li>\n<li> A perfect answer (gain = 3)\n<\/li>\n<li> A good answer (gain = 2)\n<\/li>\n<li> A relevant answer (gain = 1)\n<\/li>\n<li> An incorrect answer (gain = 0)\n<\/li>\n<\/ol>\n<p>The iDCG would be<br \/>\n<strong>3 + (3 \/ log<sub>2<\/sub> 2) + (2 \/ log<sub>2<\/sub> 3) + (1 \/ log<sub>2<\/sub> 4) + (0 \/ log<sub>2<\/sub> 5) = 7.8<\/strong><\/p>\n<p>The NDCG is <strong>DCG \/ iDCG<\/strong> and so would be<br \/>\n<strong>5.7 \/ 7.8 = 0.7<\/strong><\/p>\n<p>The final score is in a predictable range &#8211; from 0.0 to 1.0.<\/p>\n<p>0.0 for a system that performs terribly, to 1.0 for a system that returns the best possible answers in the best possible order (because DCG and iDCG will be the same, and DCG\/iDCG = 1)<\/p>\n<p>This means that we&#8217;re rewarding a system based on how it returns from what is actually possible.<\/p>\n<h2>Explaining performance<\/h2>\n<p>Most of the users we&#8217;ve built the tool for won&#8217;t be aware of any of this. They won&#8217;t know what NDCG means or how the score was computed. And that&#8217;s fine. We&#8217;re showing them a single number and telling them that a bigger number means it&#8217;s doing better and a lower number is worse. <\/p>\n<p>(That said, we&#8217;ve put &#8220;NDCG&#8221; in the UI so some knowledgable users will have an idea what it means, and the curious users can always Google it to find out)<\/p>\n<p>Regardless, what we needed was to distill the performance of their trained system into a single metric. We needed one number that gives them a feel for how their training is going, factoring in the relevancy of the answers it&#8217;s returning and the order that they&#8217;re in. And of all the different metrics I&#8217;ve tried, I&#8217;m happiest with this one.<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/160715-scores.png\" style=\"border: thin solid black; margin-top: 5px; margin-bottom: 5px;\"\/><\/p>\n<p>With this afternoon&#8217;s updates, users who start training new rankers will see the performance of their ranker in NDCG scores.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A ramble about accuracy compared with NDCG scores for evaluating effectiveness of information retrieval. It&#8217;s an introduction, so if you already know what they are, you can skip this. A couple of weeks ago, we released a new tool for training the IBM Watson Retrieve and Rank service (a search service that uses machine learning [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7,4],"tags":[569,505],"class_list":["post-3403","post","type-post","status-publish","format-standard","hentry","category-code","category-ibm","tag-ibm","tag-watson"],"_links":{"self":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/3403","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3403"}],"version-history":[{"count":0,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/3403\/revisions"}],"wp:attachment":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3403"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3403"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3403"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}