{"id":5713,"date":"2025-11-03T17:34:37","date_gmt":"2025-11-03T17:34:37","guid":{"rendered":"https:\/\/dalelane.co.uk\/blog\/?p=5713"},"modified":"2026-03-14T21:20:51","modified_gmt":"2026-03-14T21:20:51","slug":"flink-sql-aggregate-functions","status":"publish","type":"post","link":"https:\/\/dalelane.co.uk\/blog\/?p=5713","title":{"rendered":"Flink SQL aggregate functions"},"content":{"rendered":"<p><strong>In this post, I want to share a couple of very quick and simple examples for how to use <code style=\"background-color: #FFFFC0; color: #770000; padding: 0.2em 0.4em; font-weight: bold;\">LISTAGG<\/code> and <code style=\"background-color: #FFFFC0; color: #770000; padding: 0.2em 0.4em; font-weight: bold;\">ARRAY_AGG<\/code> in Flink SQL.<\/strong><\/p>\n<p><em>This started as an answer I gave to a colleague asking about how to output collections of events from Flink SQL. I&#8217;ve removed the details and used this post to share a more general version of my answer, so I can point others to it in future.<\/em><\/p>\n<h3>Windowed aggregations<\/h3>\n<p>One of the great things in Flink is that it makes it easy to do time-based aggregations on a stream of events.<\/p>\n<p>Using this is one of the first things that I see people try when they start playing with Flink. The third tutorial we give to new Event Processing users is to take a stream of order events and <a href=\"https:\/\/ibm.github.io\/event-automation\/tutorials\/guided\/tutorial-3\">count the number of orders per hour<\/a>.<\/p>\n<p>In Flink SQL, that looks like:<\/p>\n<pre style=\"border: thin #AA0000 solid; color: #770000; background-color: #ffffc0; padding: 1em; overflow-y: scroll; overflow-x: scroll; font-size: 0.85em; white-space: pre;\">\nSELECT\n    COUNT (*) AS `number of orders`,\n    window_start,\n    window_end,\n    window_time\nFROM\n    TABLE (\n        TUMBLE (\n            TABLE orders,\n            DESCRIPTOR (ordertime),\n            INTERVAL '1' HOUR\n        )\n    )\nGROUP BY\n    window_start,\n    window_end,\n    window_time\n<\/pre>\n<p>In our low-code UI, it looks like this:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/images.dalelane.co.uk\/2025-11-03-aggregates\/screenshot-1.png?raw=true\" style=\"width: 100%; max-width: 600px; border: thin black solid;\"\/><\/p>\n<p>However you do it, the result is a flow that emits a result at the end of every hour, with a count of how many order events were observed during the last hour.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/images.dalelane.co.uk\/2025-11-03-aggregates\/screenshot-2.png?raw=true\" style=\"width: 100%; max-width: 600px; border: thin black solid;\"\/><\/p>\n<p>But what if you don&#8217;t just want a count of the orders?<\/p>\n<p>What if you want the collection of the actual order events emitted at the end of every hour?<\/p>\n<p>To dream up a scenario using this stream of order events:<\/p>\n<blockquote style=\"font-weight: bold;\"><p>At the end of each hour, emit a list of all products ordered during the last hour, so the warehouse pickers can prepare those items for delivery.<\/p><\/blockquote>\n<p>This is where some of the other <a href=\"https:\/\/nightlies.apache.org\/flink\/flink-docs-master\/docs\/dev\/table\/functions\/systemfunctions\/#aggregate-functions\">aggregate functions in Flink SQL<\/a> can help.<\/p>\n<h3>LISTAGG<\/h3>\n<p>If you just want a single property (e.g. the name \/ description of the product that was ordered) from all of the events that you collect within each hourly window, then <code style=\"background-color: #FFFFC0; color: #770000; padding: 0.2em 0.4em; font-weight: bold;\">LISTAGG<\/code> can help.<\/p>\n<p>For example:<\/p>\n<pre style=\"border: thin #AA0000 solid; color: #770000; background-color: #ffffc0; padding: 1em; overflow-y: scroll; overflow-x: scroll; font-size: 0.85em; white-space: pre;\">\nSELECT\n    LISTAGG (description) AS `products to pick`,\n    window_start,\n    window_end,\n    window_time\nFROM\n    TABLE (\n        TUMBLE (\n            TABLE orders,\n            DESCRIPTOR (ordertime),\n            INTERVAL '1' HOUR\n        )\n    )\nGROUP BY\n    window_start,\n    window_end,\n    window_time\n<\/pre>\n<p>That gives you a concatenated string, with a comma-separated list of all of the product descriptions from each of the events within each hour.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/images.dalelane.co.uk\/2025-11-03-aggregates\/screenshot-3.png?raw=true\" style=\"width: 100%; max-width: 600px; border: thin black solid;\"\/><\/p>\n<p><em>You can use a different separator, but it&#8217;s a comma by default.<\/em><\/p>\n<h3>ARRAY_AGG<\/h3>\n<p>If you want to output an object, with some or even all of the properties (e.g. the name <strong>and<\/strong> quantity of the products that was ordered) from all of the events that you collect within each hourly window, then <code style=\"background-color: #FFFFC0; color: #770000; padding: 0.2em 0.4em; font-weight: bold;\">ARRAY_AGG<\/code> can help.<\/p>\n<p>For example:<\/p>\n<pre style=\"border: thin #AA0000 solid; color: #770000; background-color: #ffffc0; padding: 1em; overflow-y: scroll; overflow-x: scroll; font-size: 0.85em; white-space: pre;\">\nSELECT\n    ARRAY_AGG (\n        CAST (\n            ROW (description, quantity)\n                AS\n            ROW &lt;description STRING, quantity INT&gt;\n        )\n    ) AS `products to pick`,\n    window_start,\n    window_end,\n    window_time\nFROM\n    TABLE (\n        TUMBLE (\n            TABLE orders,\n            DESCRIPTOR (ordertime),\n            INTERVAL '1' HOUR\n        )\n    )\nGROUP BY\n    window_start,\n    window_end,\n    window_time\n<\/pre>\n<p><em>The CAST isn&#8217;t necessary, but it lets you give names to the properties instead of the default names you get such as <code style=\"background-color: #FFFFC0; color: #770000; padding: 0.2em 0.4em; font-weight: bold;\">EXPR$0<\/code>, so downstream processing is easier.<\/em><\/p>\n<p>In each hour, an event is emitted that contains an array of objects, made up of properties from the events in that hour.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/images.dalelane.co.uk\/2025-11-03-aggregates\/screenshot-4.png?raw=true\" style=\"width: 100%; max-width: 600px; border: thin black solid;\"\/><\/p>\n<p>And naturally you could add additional <code style=\"background-color: #FFFFC0; color: #770000; padding: 0.2em 0.4em; font-weight: bold;\">GROUP BY<\/code>, for example, if you wanted a separate pick list event for each region:<\/p>\n<pre style=\"border: thin #AA0000 solid; color: #770000; background-color: #ffffc0; padding: 1em; overflow-y: scroll; overflow-x: scroll; font-size: 0.85em; white-space: pre;\">\nSELECT\n    region,\n    ARRAY_AGG (\n        CAST (\n            ROW (description, quantity)\n                AS\n            ROW &lt;description STRING, quantity INT&gt;\n        )\n    ) AS `products to pick`,\n    window_start,\n    window_end,\n    window_time\nFROM\n    TABLE (\n        TUMBLE (\n            TABLE orders,\n            DESCRIPTOR (ordertime),\n            INTERVAL '1' HOUR\n        )\n    )\nGROUP BY\n    window_start,\n    window_end,\n    window_time,\n    region\n<\/pre>\n<p>This outputs five events at the end of every hour, one with a list of each of the orders for the NA, SA, EMEA, APAC, ANZ regions.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/images.dalelane.co.uk\/2025-11-03-aggregates\/screenshot-5.png?raw=true\" style=\"width: 100%; max-width: 600px; border: thin black solid;\"\/><\/p>\n<h4>Try it yourself<\/h4>\n<p>I&#8217;ve used the &#8220;Loosehanger&#8221; orders data generator in these examples, so if you want to try it for yourself, you can find it at<br \/>\n<a href=\"https:\/\/github.com\/IBM\/kafka-connect-loosehangerjeans-source\">github.com\/IBM\/kafka-connect-loosehangerjeans-source<\/a>.<\/p>\n<p>If you want to try it in Event Processing, the instructions for setting up the tutorial environment can be found at<br \/>\n<a href=\"https:\/\/ibm.github.io\/event-automation\/tutorials\/guided\/tutorial-0\">ibm.github.io\/event-automation\/tutorials<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this post, I want to share a couple of very quick and simple examples for how to use LISTAGG and ARRAY_AGG in Flink SQL. This started as an answer I gave to a colleague asking about how to output collections of events from Flink SQL. I&#8217;ve removed the details and used this post to [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7],"tags":[610],"class_list":["post-5713","post","type-post","status-publish","format-standard","hentry","category-code","tag-flink"],"_links":{"self":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/5713","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=5713"}],"version-history":[{"count":1,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/5713\/revisions"}],"predecessor-version":[{"id":5885,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/5713\/revisions\/5885"}],"wp:attachment":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=5713"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=5713"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=5713"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}