{"id":5304,"date":"2024-10-14T20:13:25","date_gmt":"2024-10-14T20:13:25","guid":{"rendered":"https:\/\/dalelane.co.uk\/blog\/?p=5304"},"modified":"2026-04-02T17:23:32","modified_gmt":"2026-04-02T17:23:32","slug":"analysing-wikipedia-edits-with-ibm-event-processing","status":"publish","type":"post","link":"https:\/\/dalelane.co.uk\/blog\/?p=5304","title":{"rendered":"Analysing Wikipedia edits with IBM Event Processing"},"content":{"rendered":"<p><strong>In this post, I&#8217;ll share a demo I gave today to explain some of the processing nodes in the palette of <a href=\"https:\/\/www.ibm.com\/products\/event-automation\/event-processing\">IBM Event Processing<\/a>.<\/strong><\/p>\n<p>I&#8217;ve found that demonstrations of Event Processing are easier to understand when I don&#8217;t need to explain the stream of events I&#8217;m processing in the first place. This means I&#8217;m always looking for interesting real-world event streams that are widely understood, as they can make for the most effective demos.<\/p>\n<p>With this in mind, today I tried explaining a few of the Event Processing nodes by using them with a live stream of events representing pages that are being created and edited in the English Wikipedia.<\/p>\n<p><a target=\"_blank\" href=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/screenshot-topic.png\" rel=\"noopener\"><img decoding=\"async\" style=\"width: 100%; max-width: 800px; border: thin black solid;\" src=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/thumbs\/screenshot-topic.png\"\/><\/a><br \/>\n<a style=\"font-size: 0.6em; font-style: italic;\" target=\"_blank\" href=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/screenshot-topic.png\" rel=\"noopener\">Click on the image for a higher-resolution screenshot<\/a><\/p>\n<p>Each event contains:<\/p>\n<ul>\n<li>title of the page<\/li>\n<li>who made the edit (user ID if logged in, or IP address if anonymous)<\/li>\n<li>was this the creation of a new page, or an edit of an existing page?<\/li>\n<\/ul>\n<p>Every edit on Wikipedia results in an event on the Kafka topic, so there are typically a few events a second. It&#8217;s not a super-high-throughput topic in Kafka terms, but there are enough events to try out interesting ideas.<\/p>\n<p><a target=\"_blank\" href=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/screenshot-eventsource.png\" rel=\"noopener\"><img decoding=\"async\" style=\"width: 100%; max-width: 800px; border: thin black solid;\" src=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/thumbs\/screenshot-eventsource.png\"\/><\/a><br \/>\n<a style=\"font-size: 0.6em; font-style: italic;\" target=\"_blank\" href=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/screenshot-eventsource.png\" rel=\"noopener\">Click on the image for a higher-resolution screenshot<\/a><\/p>\n<p>Here are a few of the demos I gave today. <\/p>\n<p>This is by no means an exhaustive list of what you could do with this data, but it was enough to let me show what the most commonly-used tools in the palette can do.<\/p>\n<p><!--more--><\/p>\n<hr \/>\n<h3>How many Wikipedia edits are made per day?<\/h3>\n<p>The <strong>Aggregate<\/strong> node lets us easily count how many edits we can see in the event stream.<\/p>\n<p><a target=\"_blank\" href=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/screenshot-edits-per-day.png\" rel=\"noopener\"><img decoding=\"async\" style=\"width: 100%; max-width: 800px; border: thin black solid;\" src=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/thumbs\/screenshot-edits-per-day.png?rlkey=zmeja2dac795vs8ru8trmh3dr&#038;st=a3n39ogm&#038;raw=1\"\/><\/a><br \/>\n<a style=\"font-size: 0.6em; font-style: italic;\" target=\"_blank\" href=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/screenshot-edits-per-day.png\" rel=\"noopener\">Click on the image for a higher-resolution screenshot<\/a><\/p>\n<p><strong>Aggregate node:<\/strong> <code style=\"color: #770000; font-weight: bold;\">edits per day<\/code><\/p>\n<ul>\n<li>Time window: 1 day<\/li>\n<li>Aggregate function: COUNT<\/li>\n<\/ul>\n<hr \/>\n<h3>Which Wikipedia pages were edited the most times each day?<\/h3>\n<p>Using the <strong>Aggregate<\/strong> node together with a <strong>Top-n<\/strong> node lets us count things, and then keep the ones with the highest counts.<\/p>\n<p>For example, for each day, we can see which three pages had the most edit events.<\/p>\n<p><a target=\"_blank\" href=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/screenshot-most-edited-pages.png\" rel=\"noopener\"><img decoding=\"async\" style=\"width: 100%; max-width: 800px; border: thin black solid;\" src=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/thumbs\/screenshot-most-edited-pages.png\"\/><\/a><br \/>\n<a style=\"font-size: 0.6em; font-style: italic;\" target=\"_blank\" href=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/screenshot-most-edited-pages.png\" rel=\"noopener\">Click on the image for a higher-resolution screenshot<\/a><\/p>\n<p><strong>Aggregate node:<\/strong> <code style=\"color: #770000; font-weight: bold;\">edits per page<\/code><\/p>\n<ul>\n<li>Time window: 1 day<\/li>\n<li>Aggregate function: COUNT<\/li>\n<li>Group by: title<\/li>\n<\/ul>\n<p><strong>Top-n node:<\/strong> <code style=\"color: #770000; font-weight: bold;\">pages with most edits<\/code><\/p>\n<ul>\n<li>Number of results to keep: 3<\/li>\n<li>Ordered by: number of edits (descending)<\/li>\n<\/ul>\n<hr \/>\n<h3>Who made the most edits on Wikipedia each day?<\/h3>\n<p>Adding a <strong>Filter<\/strong> node before the <strong>Aggregate<\/strong> node means we only count the events that are relevant to our query &#8211; and then the <strong>Top-n<\/strong> node lets us keep the results with the highest counts.<\/p>\n<p>For example, for each day, we can see which logged-in users produced the most edit events.<\/p>\n<p><a target=\"_blank\" href=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/screenshot-users-most-edits.png\" rel=\"noopener\"><img decoding=\"async\" style=\"width: 100%; max-width: 800px; border: thin black solid;\" src=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/thumbs\/screenshot-users-most-edits.png\"\/><\/a><br \/>\n<a target=\"_blank\" style=\"font-size: 0.6em; font-style: italic;\" href=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/screenshot-users-most-edits.png\" rel=\"noopener\">Click on the image for a higher-resolution screenshot<\/a><\/p>\n<p><strong>Filter node:<\/strong> <code style=\"color: #770000; font-weight: bold;\">ignore anon users & bots<\/code><\/p>\n<ul>\n<li>userid <> 0 (<em>Wikipedia uses 0 to indicate anonymous users<\/em>)<\/li>\n<li>userid <> &#8216;bot name&#8217; (repeat this for the most popular bots, such as &#8220;Citation bot&#8221;, &#8220;InternetArchiveBot&#8221;, &#8220;WikiCleanerBot&#8221;, etc.)<\/li>\n<\/ul>\n<p><strong>Aggregate node:<\/strong> <code style=\"color: #770000; font-weight: bold;\">edits per user<\/code><\/p>\n<ul>\n<li>Time window: 1 day<\/li>\n<li>Aggregate function: COUNT<\/li>\n<li>Group by: user<\/li>\n<\/ul>\n<p><strong>Top-n node:<\/strong> <code style=\"color: #770000; font-weight: bold;\">users with most edits<\/code><\/p>\n<ul>\n<li>Number of results to keep: 3<\/li>\n<li>Ordered by: number of edits (descending)<\/li>\n<\/ul>\n<p><\/p>\n<h3>Where are most of the anonymous Wikipedia editors?<\/h3>\n<p>For example, for each day, we can see the IP address where most of the anonymous edits were made from.<\/p>\n<p><a target=\"_blank\" href=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/screenshot-location-most-anon-edits.png\" rel=\"noopener\"><img decoding=\"async\" style=\"width: 100%; max-width: 800px; border: thin black solid;\" src=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/thumbs\/screenshot-location-most-anon-edits.png\"\/><\/a><br \/>\n<a target=\"_blank\" style=\"font-size: 0.6em; font-style: italic;\" href=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/screenshot-location-most-anon-edits.png\" rel=\"noopener\">Click on the image for a higher-resolution screenshot<\/a><\/p>\n<p><strong>Filter node:<\/strong> <code style=\"color: #770000; font-weight: bold;\">anonymous users<\/code><\/p>\n<ul>\n<li>userid = 0 (<em>Wikipedia uses 0 to indicate anonymous users<\/em>)<\/li>\n<\/ul>\n<p><strong>Aggregate node:<\/strong> <code style=\"color: #770000; font-weight: bold;\">count edits per location<\/code><\/p>\n<ul>\n<li>Time window: 1 day<\/li>\n<li>Aggregate function: COUNT<\/li>\n<li>Group by: user<\/li>\n<\/ul>\n<p><strong>Top-n node:<\/strong> <code style=\"color: #770000; font-weight: bold;\">locations with most anon edits<\/code><\/p>\n<ul>\n<li>Number of results to keep: 1<\/li>\n<li>Ordered by: number of edits (descending)<\/li>\n<\/ul>\n<hr \/>\n<h3>How many anonymous Wikipedia editors have an IPv6 address?<\/h3>\n<p>Using a <strong>Transform<\/strong> node lets us derive new properties from the existing event attributes.<\/p>\n<p>For example, using regular expressions on the IP address in the events for anonymous edits lets us recognise and count the number of edits from IPv4 and IPv6 addresses.<\/p>\n<p><a target=\"_blank\" href=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/screenshot-anon-edits-iptype-count.png\" rel=\"noopener\"><img decoding=\"async\" style=\"width: 100%; max-width: 800px; border: thin black solid;\" src=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/thumbs\/screenshot-anon-edits-iptype-count.png\"\/><\/a><br \/>\n<a target=\"_blank\" style=\"font-size: 0.6em; font-style: italic;\" href=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/screenshot-anon-edits-iptype-count.png\" rel=\"noopener\">Click on the image for a higher-resolution screenshot<\/a><\/p>\n<p><strong>Filter node:<\/strong> <code style=\"color: #770000; font-weight: bold;\">anonymous users<\/code><\/p>\n<ul>\n<li>userid = 0 (<em>Wikipedia uses 0 to indicate anonymous users<\/em>)<\/li>\n<\/ul>\n<p><strong>Transform node:<\/strong> <code style=\"color: #770000; font-weight: bold;\">check IP address type<\/code><\/p>\n<ul>\n<li>isIPv4 = <br \/><code style=\"color: #770000;\">IF(REGEXP(`user`, '\\b(?:\\d{1,3}\\.){3}\\d{1,3}\\b'), 1, 0) <\/code><\/li>\n<li>isIPv6 = <br \/><code style=\"color: #770000;\">IF(REGEXP(`user`, '\\b([0-9a-fA-F]{1,4}:){7}([0-9a-fA-F]{1,4})\\b'), 1, 0)<\/code><\/li>\n<\/ul>\n<p><strong>Aggregate node:<\/strong> <code style=\"color: #770000; font-weight: bold;\">count<\/code><\/p>\n<ul>\n<li>Time window: 1 day<\/li>\n<li>Aggregate function: SUM isIPv4<\/li>\n<li>Aggregate function: SUM isIPv6<\/li>\n<\/ul>\n<p><\/p>\n<p>A <strong>Transform<\/strong> node also lets us transform results into a form that is easier to consume.<\/p>\n<p>For example, we can take the raw count from the previous example, and turn them into percentages.<\/p>\n<p><a target=\"_blank\" href=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/screenshot-anon-edits-iptype-percentage.png\" rel=\"noopener\"><img decoding=\"async\" style=\"width: 100%; max-width: 800px; border: thin black solid;\" src=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/thumbs\/screenshot-anon-edits-iptype-percentage.png\"\/><\/a><br \/>\n<a target=\"_blank\" style=\"font-size: 0.6em; font-style: italic;\" href=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/screenshot-anon-edits-iptype-percentage.png\" rel=\"noopener\">Click on the image for a higher-resolution screenshot<\/a><\/p>\n<p><strong>Transform node:<\/strong> <code style=\"color: #770000; font-weight: bold;\">calculate percentages<\/code><\/p>\n<ul>\n<li>IPv4 edits (%) = <br \/><code style=\"color: #770000;\">ROUND(100 * CAST(`edits from IPv4 addresses` AS DOUBLE) \/ (`edits from IPv4 addresses` + `edits from IPv6 addresses`), 0)<\/code><\/li>\n<li>IPv6 edits (%) = <br \/><code style=\"color: #770000;\">ROUND(100 * CAST(`edits from IPv6 addresses` AS DOUBLE) \/ (`edits from IPv4 addresses` + `edits from IPv6 addresses`), 0)<\/code><\/li>\n<\/ul>\n<hr \/>\n<h3>Which new Wikipedia pages received the most edits in the first hour after creation?<\/h3>\n<p>Using an <strong>Interval join<\/strong> lets us make time-based correlations between event streams.<\/p>\n<p>For example, if we split the Wikipedia events into events about creation of new pages, and events about edits of existing pages, we can correlate to see which new pages received the most edits.<\/p>\n<p><a target=\"_blank\" href=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/screenshot-new-pages-most-edits.png\" rel=\"noopener\"><img decoding=\"async\" style=\"width: 100%; max-width: 800px; border: thin black solid;\" src=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/thumbs\/screenshot-new-pages-most-edits.png\"\/><\/a><br \/>\n<a target=\"_blank\" style=\"font-size: 0.6em; font-style: italic;\" href=\"https:\/\/images.dalelane.co.uk\/2024-10-14-flinkwiki\/screenshot-new-pages-most-edits.png\" rel=\"noopener\">Click on the image for a higher-resolution screenshot<\/a><\/p>\n<p><strong>Filter node:<\/strong> <code style=\"color: #770000; font-weight: bold;\">new page<\/code><\/p>\n<ul>\n<li>type = &#8216;new&#8217;<\/li>\n<\/ul>\n<p><strong>Filter node:<\/strong> <code style=\"color: #770000; font-weight: bold;\">edits<\/code><\/p>\n<ul>\n<li>type = &#8216;edit&#8217;<\/li>\n<\/ul>\n<p><strong>Interval join node:<\/strong> <code style=\"color: #770000; font-weight: bold;\">edits of new pages<\/code><\/p>\n<ul>\n<li>Join condition: <code style=\"color: #770000;\">`new pages`.`title` = `edits`.`title`<\/code><\/li>\n<li>Time window: 1 hour from new pages event_time<\/li>\n<\/ul>\n<p><strong>Aggregate node:<\/strong> <code style=\"color: #770000; font-weight: bold;\">number of edits per page<\/code><\/p>\n<ul>\n<li>Time window: 1 day<\/li>\n<li>Aggregate function: COUNT<\/li>\n<li>Group by: page title<\/li>\n<\/ul>\n<p><strong>Top-n node:<\/strong> <code style=\"color: #770000; font-weight: bold;\">new pages with the most edits in the first hour<\/code><\/p>\n<ul>\n<li>Number of results to keep: 1<\/li>\n<li>Ordered by: number of edits (descending)<\/li>\n<\/ul>\n<hr \/>\n<h3>Want to try this for yourself?<\/h3>\n<p>If you&#8217;d like to recreate this demo for yourself, I have instructions for how to get access to this stream of events at <a href=\"https:\/\/github.com\/dalelane\/kafka-demos\/blob\/master\/README.md#wikipedia-edits\">github.com\/dalelane\/kafka-demos<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this post, I&#8217;ll share a demo I gave today to explain some of the processing nodes in the palette of IBM Event Processing. I&#8217;ve found that demonstrations of Event Processing are easier to understand when I don&#8217;t need to explain the stream of events I&#8217;m processing in the first place. This means I&#8217;m always [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7,4],"tags":[593,583,584],"class_list":["post-5304","post","type-post","status-publish","format-standard","hentry","category-code","category-ibm","tag-apachekafka","tag-ibmeventstreams","tag-kafka"],"_links":{"self":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/5304","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=5304"}],"version-history":[{"count":2,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/5304\/revisions"}],"predecessor-version":[{"id":5948,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/5304\/revisions\/5948"}],"wp:attachment":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=5304"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=5304"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=5304"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}