{"id":6001,"date":"2026-04-11T16:28:05","date_gmt":"2026-04-11T16:28:05","guid":{"rendered":"https:\/\/dalelane.co.uk\/blog\/?p=6001"},"modified":"2026-04-13T19:49:51","modified_gmt":"2026-04-13T19:49:51","slug":"how-many-kafka-events-will-flink-process-per-second","status":"publish","type":"post","link":"https:\/\/dalelane.co.uk\/blog\/?p=6001","title":{"rendered":"&#8220;How many Kafka events will Flink process per second?&#8221;"},"content":{"rendered":"<p>\nI&#8217;m often asked this. The specific question varies, but it&#8217;s typically some variation of asking how quickly a single CPU of Flink processes events from a Kafka topic.\n<\/p>\n<blockquote style=\"margin: 0; padding-left: 6px; padding-right: 6px; color: black; border-radius: 6px; background-color: #d5d6d7; border: 1px solid #004a59;\">\n<p>\n<em>Why &#8220;per CPU&#8221;? Maybe because enterprise software is typically charged per CPU? Maybe because I tend to talk to people who run everything in Kubernetes, who think of running software in terms of <a href=\"https:\/\/kubernetes.io\/docs\/concepts\/configuration\/manage-resources-containers\/\">requests \/ limits<\/a>? Not sure, but the question tends to be framed from the perspective of asking how much processing they can expect to get from a CPU.<\/em>\n<\/p>\n<\/blockquote>\n<p>\nI try to avoid doing the engineer thing of answering &#8220;<a href=\"https:\/\/trishagee.github.io\/post\/it_depends\/\">it depends<\/a>&#8220;&#8230; but&#8230; it really <strong>does<\/strong> depend!\n<\/p>\n<p>\n<a href=\"https:\/\/knowyourmeme.com\/memes\/depends-on-the-context\"><img decoding=\"async\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/depends.jpg\" style=\"border: thin black solid; width: 100%; max-width: 600px;\"\/><\/a>\n<\/p>\n<p>\nThat is the motivation behind this post: to give me something I can point at as an illustration of the degree to which Flink&#8217;s performance varies (and a taste of the range of interrelated factors that influence it).\n<\/p>\n<p><!--more--><\/p>\n<h3>How different can it be?<\/h3>\n<p>\nI&#8217;ve talked before that my go-to example when demo&#8217;ing Flink SQL for now are <a href=\"https:\/\/dalelane.co.uk\/blog\/?p=5806\">Flink SQL flows that process click tracking events in a retail scenario<\/a>. They are easy to explain, easy to understand, and let me show a variety of Flink SQL features.\n<\/p>\n<p>\nUsing that Kafka topic, and those sorts of flows as a basis, how fast do the Flink jobs I demo consume events (if I set the CPU limit to 1 for the job manager and task manager)?\n<\/p>\n<p>\nI ran those flows a bunch of times, <a href=\"#details\">changing some configuration options<\/a> along the way, to see what sort of range of speeds I could get as a result. When I plot each Flink job onto an x-axis based on the number of records it processed per second, I get:\n<\/p>\n<div class=\"flink-chart-embed\">\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-00.png\" style=\"max-width: 100%; height: auto; display: block;\"\/>\n<\/div>\n<p>\nLook at the range there.\n<\/p>\n<p>\nAt one end, one of the flows consumed approximately 1,340 events per second.<\/p>\n<p>\nAt the other end, another job consumed approximately 44,320 events per second.\n<\/p>\n<blockquote style=\"margin: 0; padding-left: 8px; padding-right: 8px; color: black; font-size: 1.01em; border-radius: 8px; background-color: #d1ecf1; border: 1px solid #bee5eb;\">\n<p>\n<strong>TLDR<\/strong> &#8211; If you takeaway nothing else from this post, that&#8217;s the message. All of those flows were Flink running in the same place, consuming the same type of data, from the same Kafka cluster, using the same amount of CPU and memory.\n<\/p>\n<p>\nThe <strong>throughput varied massively depending on the type of processing the flow was doing, the type of data it was consuming, and how Flink was configured<\/strong>.\n<\/p>\n<\/blockquote>\n<p>\nThese results are specific to my environment, workload, and the configurations I tried. (And I took some questionable shortcuts that I describe below!) I&#8217;m sharing this as an interesting illustration of how variable Flink performance can be, and as an encouragement of the value of testing and benchmarking.\n<\/p>\n<p>\nIn other words, I guess I&#8217;m still saying <a href=\"https:\/\/trishagee.github.io\/post\/it_depends\/\">it depends<\/a>. \ud83d\ude09\n<\/p>\n<h3>Digging into the differences<\/h3>\n<p>\nIf you hover over the points in the visualisation above, the tooltip gives an overview of what I did in that Flink job.\n<\/p>\n<p>\nSeeing all of them in one group is helpful to get a feel for the distribution, but it doesn&#8217;t make it easy to see patterns. While I&#8217;ve got this data, it&#8217;s worth poking at it a little closer.\n<\/p>\n<p>\nFor example, let me pick out two of the Flink job runs. These two Flink jobs both:<\/p>\n<ul>\n<li>included an interval join (with a one-hour interval) between the click tracking topic and a second low-frequency topic <\/li>\n<li>included a session window (using a one-hour session gap)<\/li>\n<li>performed the same processing<\/li>\n<li>consumed the same events from the same source topics<\/li>\n<\/ul>\n<p>The only thing I changed between these two runs was the configuration option I used for the <a href=\"https:\/\/nightlies.apache.org\/flink\/flink-docs-release-2.2\/docs\/ops\/state\/state_backends\/\">state backend<\/a> (where Flink held its state): either in memory using a <strong>hashmap<\/strong>, or in a persistent key-value store using <strong>RocksDB<\/strong>.\n<\/p>\n<p>\nPutting these two Flink jobs side-by-side on a <a href=\"https:\/\/datavizproject.com\/data-type\/slope-chart\/\">slope chart<\/a>, I get:\n<\/p>\n<div class=\"slope-chart-container\">\n<!--<a href=\"slope-chart-generator.html?ymin=0&ymax=45000&g0=interval+join%3D1+hour%7Cwindowed+state%3D1+hour%7Cformat%3Davro%7Ccompression%3Dcompressed%7Cbatching%3Dbatched&n=2&c0=state+backend%3Drocksdb&c1=state+backend%3Dhashmap&l0=RocksDB&l1=hashmap\">edit<\/a>--><br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-01.png\" class=\"slope-chart-placeholder\" style=\"max-width: 450px;\"\/><br \/>\n<script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/slope-chart-embed.js?ymin=0&#038;ymax=45000&#038;g0=interval+join%3D1+hour%7Cwindowed+state%3D1+hour%7Cformat%3Davro%7Ccompression%3Dcompressed%7Cbatching%3Dbatched&#038;n=2&#038;c0=state+backend%3Drocksdb&#038;c1=state+backend%3Dhashmap&#038;l0=RocksDB&#038;l1=hashmap\"><\/script>\n<\/div>\n<p>\nThe job I ran using the hashmap state backend went a lot faster than the job I ran using the RocksDB state backend.\n<\/p>\n<h3>Impact of the state backend<\/h3>\n<p>\nWhat if I add all the other <strong>stateful Flink jobs<\/strong> I ran to that slope chart?\n<\/p>\n<p>\nIn other words, here are <strong>all the Flink jobs I ran that included an interval join, or a session window, or both<\/strong>.\n<\/p>\n<div class=\"slope-chart-container\">\n<!--<a href=\"slope-chart-generator.html?ymin=0&ymax=45000&g0=%21interval+join%3Dnone&g1=%21windowed+state%3Dnone&n=2&c0=state+backend%3Drocksdb&c1=state+backend%3Dhashmap&l0=RocksDB&l1=hashmap\">edit<\/a>--><br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-02.png\" class=\"slope-chart-placeholder\" style=\"max-width: 450px;\"\/><br \/>\n<script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/slope-chart-embed.js?ymin=0&#038;ymax=45000&#038;g0=%21interval+join%3Dnone&#038;g1=%21windowed+state%3Dnone&#038;n=2&#038;c0=state+backend%3Drocksdb&#038;c1=state+backend%3Dhashmap&#038;l0=RocksDB&#038;l1=hashmap\"><\/script>\n<\/div>\n<blockquote style=\"margin: 0; padding-left: 6px; padding-right: 6px; color: black; border-radius: 6px; background-color: #d5d6d7; border: 1px solid #004a59;\">\n<p>\nSome of these jobs consumed Avro events, some consumed JSON. Some consumed events that were compressed, some consumed uncompressed events. And so on. Each line is a different Flink job or configuration.\n<\/p>\n<p>\nThe <strong><a href=\"https:\/\/datavizproject.com\/data-type\/slope-chart\/\">slope chart connects Flink jobs that were implemented and configured identically<\/a><\/strong>, except for the state backend.\n<\/p>\n<\/blockquote>\n<p>\nAll of these jobs above were stateful in some way. Some consumed from only a single topic using a session window &#8211; others consumed from two topics correlating with an interval join, in addition to using a session window.\n<\/p>\n<p>\nBut what if I draw the slope chart with the <strong>stateless Flink jobs<\/strong> I ran?\n<\/p>\n<p>\n<em>For the purposes of this post, by &#8220;stateless&#8221; I mean Flink SQL that included no interval join and no windowed aggregates (rather than more general use of state in Flink or the Kafka consumers).<\/em>\n<\/p>\n<p>\nHere are <strong>Flink jobs I ran that did not make any use of state<\/strong> and used complex string functions (e.g. multiple uses of string parsing, regular expressions, etc.) on Avro format click-tracking events:\n<\/p>\n<div class=\"slope-chart-container\">\n<!--<a href=\"slope-chart-generator.html?ymin=0&ymax=45000&g0=interval+join%3Dnone%7Cwindowed+state%3Dnone%7Cformat%3Davro%7Cstateless+processing%3Dcomplex+string+functions&n=2&c0=state+backend%3Drocksdb&c1=state+backend%3Dhashmap&l0=RocksDB&l1=hashmap\">edit<\/a>--><br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-03.png\" class=\"slope-chart-placeholder\" style=\"max-width: 450px;\"\/><br \/>\n<script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/slope-chart-embed.js?ymin=0&#038;ymax=45000&#038;g0=interval+join%3Dnone%7Cwindowed+state%3Dnone%7Cformat%3Davro%7Cstateless+processing%3Dcomplex+string+functions&#038;n=2&#038;c0=state+backend%3Drocksdb&#038;c1=state+backend%3Dhashmap&#038;l0=RocksDB&#038;l1=hashmap\"><\/script>\n<\/div>\n<blockquote style=\"margin: 0; padding-left: 6px; padding-right: 6px; color: black; border-radius: 6px; background-color: #d5d6d7; border: 1px solid #004a59;\">\n<p>\nGreen lines means the number of records processed per second increased. Red lines means the number of records processed per second decreased.\n<\/p>\n<p>\n<em>I&#8217;m not doing formal performance testing here. For one thing, I ran all of this on my development cluster, which is a shared environment. In the absence of an isolated environment, variation between runs is inevitable.<\/em>\n<\/p>\n<\/blockquote>\n<p>\nThe choice of state backend made no significant difference in the speed of these Flink jobs I ran that didn&#8217;t use state.\n<\/p>\n<p>\nSo far, so obvious, right? When I ran Flink SQL jobs that made heavy use of state, they ran faster using a memory-based state store than they did using a disk-based state store. When I ran Flink SQL jobs that didn&#8217;t use state, it didn&#8217;t make much difference either way. I would&#8217;ve predicted that, but it&#8217;s reassuring to see it confirmed.\n<\/p>\n<p>\nIt wasn&#8217;t all so obvious. When I add <strong>all of the stateless Flink jobs I ran<\/strong> onto the chart:\n<\/p>\n<div class=\"slope-chart-container\">\n<!--<a href=\"slope-chart-generator.html?ymin=0&ymax=45000&g0=interval+join%3Dnone%7Cwindowed+state%3Dnone&n=2&c0=state+backend%3Drocksdb&c1=state+backend%3Dhashmap&l0=RocksDB&l1=hashmap\">edit<\/a>--><br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-04.png\" class=\"slope-chart-placeholder\" style=\"max-width: 450px;\"\/><br \/>\n<script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/slope-chart-embed.js?ymin=0&#038;ymax=45000&#038;g0=interval+join%3Dnone%7Cwindowed+state%3Dnone&#038;n=2&#038;c0=state+backend%3Drocksdb&#038;c1=state+backend%3Dhashmap&#038;l0=RocksDB&#038;l1=hashmap\"><\/script>\n<\/div>\n<p>\nIn other words, this shows not just the ones using Avro source topics, and not just the ones that included complex string functions, etc.).\n<\/p>\n<p>\nThese results are a little harder to explain. I think this still shows that the state backend had a mostly minimal impact, but there were a small collection of jobs that seemed to go faster which I haven&#8217;t properly understood.\n<\/p>\n<blockquote style=\"margin: 0; padding-left: 8px; padding-right: 8px; color: black; font-size: 1.01em; border-radius: 8px; background-color: #d1ecf1; border: 1px solid #bee5eb;\">\n<p>\nAll the stateful Flink jobs I ran went faster when I used a hashmap backend. Does this mean I would always use hashmap for stateful jobs?\n<\/p>\n<p>\nNot necessarily. For one thing, some Flink jobs hold so much state that it&#8217;s just not feasible to give them enough memory. For some jobs, the amount of state varies so much it&#8217;s not efficient to give them enough memory to guarantee it will have enough to always hold the entire state.\n<\/p>\n<p>\nThere are other considerations, too &#8211; some of which I&#8217;ll get to below.\n<\/p>\n<\/blockquote>\n<h3>Impact of the processing complexity<\/h3>\n<p>\nLooking at the stateless jobs a little closer, I ran a few different types of stateless Flink SQL jobs on my click tracking events. I created  Flink jobs that did:<\/p>\n<ul>\n<li><strong>no processing<\/strong> &#8211; only deserialized source events and sinked them<\/li>\n<li><strong>simple<\/strong> &#8211; only performed numeric operations on source events<\/li>\n<li><strong>complex<\/strong> &#8211; performed complex string functions, including multiple uses of string parsing, regular expressions, etc.<\/li>\n<\/ul>\n<p>Looking at the <strong>impact of the type of processing<\/strong> on these stateless jobs:<\/p>\n<div class=\"slope-chart-container\">\n<!--<a href=\"slope-chart-generator.html?ymin=20000&ymax=42000&g0=interval+join%3Dnone%7Cwindowed+state%3Dnone&n=2&c0=stateless+processing%3Dnone&c1=stateless+processing%3Dsimple+numeric+function&l0=no%2520processing&l1=simple\">edit<\/a>--><br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-05.png\" class=\"slope-chart-placeholder\" style=\"max-width: 450px;\"\/><br \/>\n<script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/slope-chart-embed.js?ymin=20000&#038;ymax=42000&#038;g0=interval+join%3Dnone%7Cwindowed+state%3Dnone&#038;n=2&#038;c0=stateless+processing%3Dnone&#038;c1=stateless+processing%3Dsimple+numeric+function&#038;l0=no%2520processing&#038;l1=simple\"><\/script>\n<\/div>\n<div class=\"slope-chart-container\">\n<!--<a href=\"slope-chart-generator.html?ymin=20000&ymax=42000&g0=interval+join%3Dnone%7Cwindowed+state%3Dnone&n=2&c0=stateless+processing%3Dnone&c1=stateless+processing%3Dcomplex+string+functions&l0=no%2520processing&l1=complex\">edit<\/a>--><br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-06.png\" class=\"slope-chart-placeholder\" style=\"max-width: 450px;\"\/><br \/>\n<script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/slope-chart-embed.js?ymin=20000&#038;ymax=42000&#038;g0=interval+join%3Dnone%7Cwindowed+state%3Dnone&#038;n=2&#038;c0=stateless+processing%3Dnone&#038;c1=stateless+processing%3Dcomplex+string+functions&#038;l0=no%2520processing&#038;l1=complex\"><\/script>\n<\/div>\n<p>\nThe impact of the more CPU-intensive string parsing functions in the &#8220;complex&#8221; jobs had a noticeably larger impact on the speed of the jobs that I ran.\n<\/p>\n<blockquote style=\"margin: 0; padding-left: 8px; padding-right: 8px; color: black; font-size: 1.01em; border-radius: 8px; background-color: #d1ecf1; border: 1px solid #bee5eb;\">\n<p>\nAny processing reduced the number of events that Flink processes per second compared with not doing any work at all.\n<\/p>\n<p>\nNumerical functions were unsurprisingly efficient and had only a minimal impact on the job&#8217;s speed.\n<\/p>\n<p>\nWhen I included string parsing and regular expression functions in my stateless jobs, I reduced the number of events my Flink jobs processed per second by nearly 25%.\n<\/p>\n<\/blockquote>\n<h3>Impact of the format of source events<\/h3>\n<p>\nI created separate versions of the source topics that the Flink jobs could consume from: <strong>JSON<\/strong> and <strong>Avro<\/strong>.\n<\/p>\n<p>\nComparing the performance of Flink jobs that (were otherwise identical other than whether they) consumed from the JSON or Avro topic:\n<\/p>\n<div class=\"slope-chart-container\">\n<!--<a href=\"slope-chart-generator.html?ymin=0&ymax=45000&n=2&c0=format%3Djson&c1=format%3Davro&l0=JSON&l1=Avro\">edit<\/a>--><br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-07.png\" class=\"slope-chart-placeholder\" style=\"max-width: 450px;\"\/><br \/>\n<script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/slope-chart-embed.js?ymin=0&#038;ymax=45000&#038;n=2&#038;c0=format%3Djson&#038;c1=format%3Davro&#038;l0=JSON&#038;l1=Avro\"><\/script>\n<\/div>\n<p>\nFor this setup, my Flink jobs that consumed Avro events almost always processed events more quickly than the Flink jobs that consumed JSON events. But the benefit that Avro brought was not consistent across all types of jobs that I ran.\n<\/p>\n<p>\nFor example, looking at <strong>stateful jobs<\/strong> (jobs that included an interval join and\/or session window aggregates) that consumed from topics where the Kafka producers had <strong>not used compression<\/strong>:\n<\/p>\n<div class=\"slope-chart-container\">\n<!--<a href=\"slope-chart-generator.html?ymin=25000&ymax=43000&g0=compression%3Duncompressed%7C%21interval+join%3Dnone&g1=compression%3Duncompressed%7C%21windowed+state%3Dnone&n=2&c0=format%3Djson&c1=format%3Davro&l0=JSON&l1=Avro\">edit<\/a>--><br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-08.png\" class=\"slope-chart-placeholder\" style=\"max-width: 450px;\"\/><br \/>\n<script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/slope-chart-embed.js?ymin=25000&#038;ymax=43000&#038;g0=compression%3Duncompressed%7C%21interval+join%3Dnone&#038;g1=compression%3Duncompressed%7C%21windowed+state%3Dnone&#038;n=2&#038;c0=format%3Djson&#038;c1=format%3Davro&#038;l0=JSON&#038;l1=Avro\"><\/script>\n<\/div>\n<p>\nIn other words, in these jobs, the Flink jobs in the left category were consuming a large amount of whitespace in multi-line, pretty-printed JSON events, with a large number of JSON keys repeated multiple times in the consumed data. This was JSON at its worst, and showed the most significant benefit from using Avro.\n<\/p>\n<p>\nCompare that with these <strong>stateful Flink jobs<\/strong> consuming from topics where Kafka producers had <strong>used compression and large batch sizes<\/strong>:\n<\/p>\n<div class=\"slope-chart-container\">\n<!--<a href=\"slope-chart-generator.html?ymin=25000&ymax=43000&g0=compression%3Dcompressed%7C%21interval+join%3Dnone%7Cbatching%3Dbatched&g1=compression%3Dcompressed%7C%21windowed+state%3Dnone%7Cbatching%3Dbatched&n=2&c0=format%3Djson&c1=format%3Davro&l0=JSON&l1=Avro\">edit<\/a>--><br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-09.png\" class=\"slope-chart-placeholder\" style=\"max-width: 450px;\"\/><br \/>\n<script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/slope-chart-embed.js?ymin=25000&#038;ymax=43000&#038;g0=compression%3Dcompressed%7C%21interval+join%3Dnone%7Cbatching%3Dbatched&#038;g1=compression%3Dcompressed%7C%21windowed+state%3Dnone%7Cbatching%3Dbatched&#038;n=2&#038;c0=format%3Djson&#038;c1=format%3Davro&#038;l0=JSON&#038;l1=Avro\"><\/script>\n<\/div>\n<p>\nAvro topics still showed a performance benefit for most jobs, but this was more modest. When Kafka producers had used compression and large batch sizes, this had mitigated some of the costs of JSON.\n<\/p>\n<p>\nLooking at the <strong>impact of source format on stateless Flink jobs that did no processing<\/strong> (i.e. jobs that only deserialized source events and sinked them):\n<\/p>\n<div class=\"slope-chart-container\">\n<!--<a href=\"slope-chart-generator.html?ymin=25000&ymax=43000&g0=interval+join%3Dnone%7Cwindowed+state%3Dnone%7Cstateless+processing%3Dnone&n=2&c0=format%3Djson&c1=format%3Davro&l0=JSON&l1=Avro\">edit<\/a>--><br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-10.png\" class=\"slope-chart-placeholder\" style=\"max-width: 450px;\"\/><br \/>\n<script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/slope-chart-embed.js?ymin=25000&#038;ymax=43000&#038;g0=interval+join%3Dnone%7Cwindowed+state%3Dnone%7Cstateless+processing%3Dnone&#038;n=2&#038;c0=format%3Djson&#038;c1=format%3Davro&#038;l0=JSON&#038;l1=Avro\"><\/script>\n<\/div>\n<p>\nAgain, the Flink jobs almost always processed events from Avro topics at a faster rate than JSON. The benefit was also more modest.\n<\/p>\n<p>\nDeserializing Avro events typically requires less CPU than deserializing JSON events. These Flink jobs would not have been CPU constrained, so the benefit they gained from CPU cycles freed up by deserialization were perhaps reduced.\n<\/p>\n<blockquote style=\"margin: 0; padding-left: 8px; padding-right: 8px; color: black; font-size: 1.01em; border-radius: 8px; background-color: #d1ecf1; border: 1px solid #bee5eb;\">\n<p>\nThis is a difficult aspect to try and discuss in isolation.\n<\/p>\n<p>\nIn theory, processing Avro events rather than JSON brings network transfer benefits (my JSON events were larger than their Avro equivalents, so Flink&#8217;s Kafka consumer could consume more Avro events more quickly than JSON). In a network-constrained environment, this could be hugely important.\n<\/p>\n<p>\nBut in a low-latency network environment, or even if the Flink job processing was so complex that the source was backpressured, then this benefit is reduced &#8211; it potentially just increases the time the source spends backpressured.\n<\/p>\n<p>\nIn theory, processing Avro events rather than JSON brings CPU benefits (deserializing JSON requires more CPU than Avro).\n<\/p>\n<p>\nBut if the Flink job is not CPU constrained, then this benefit is reduced &#8211; it potentially just increases time that the CPU is spent idle.\n<\/p>\n<p>\nUsing Avro instead of JSON seemed to consistently improve performance. But the degree to which it improved performance was diminished in some runs impacted by other factors.\n<\/p>\n<\/blockquote>\n<h3>Impact of producer batching<\/h3>\n<p>\nKafka brokers cannot send partial batches from a topic, so the size of batches that a producer sends to a topic will influence the amount of data that Flink job will receive when it fetches from its Kafka source.\n<\/p>\n<p>\nTo see the impact of producer batching, I set up versions of the click tracking topics where producers used:<\/p>\n<ul>\n<li><strong>&#8220;micro&#8221;<\/strong> &#8211; tiny, single-message batches <br \/><code style=\"font-size: 0.95em; background-color: #FFFFC0; color: #770000; padding: 3px; font-weight: 600;\">batch.size: 0<\/code><\/li>\n<li><strong>&#8220;large&#8221;<\/strong> &#8211; larger-than-default batches <br \/><code style=\"font-size: 0.95em; background-color: #FFFFC0; color: #770000; padding: 3px; font-weight: 600;\">batch.size: 409600, linger.ms: 10000<\/code><\/li>\n<\/ul>\n<p>\nFlink jobs consuming from the &#8220;micro&#8221; topics will have received a large number of single-message batches in each poll.\n<\/p>\n<p>\nFlink jobs consuming from the &#8220;large&#8221; topics will have received a couple of multiple-message batches in each poll.\n<\/p>\n<div class=\"slope-chart-container\">\n<!--<a href=\"slope-chart-generator.html?ymin=0&ymax=45000&n=2&c0=batching%3Dunbatched&c1=batching%3Dbatched&l0=micro&l1=large\">edit<\/a>--><br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-11.png\" class=\"slope-chart-placeholder\" style=\"max-width: 450px;\"\/><br \/>\n<script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/slope-chart-embed.js?ymin=0&#038;ymax=45000&#038;n=2&#038;c0=batching%3Dunbatched&#038;c1=batching%3Dbatched&#038;l0=micro&#038;l1=large\"><\/script>\n<\/div>\n<p>\nThe impact of this on the rate that the Flink jobs I ran was mixed. Two thirds of the jobs faster with the larger batches, one third went slower.\n<\/p>\n<p>\nThere were Flink jobs where it always went faster, such as <strong>stateful jobs using a hashmap state backend to consume compressed batches<\/strong>:\n<\/p>\n<div class=\"slope-chart-container\">\n<!--<a href=\"slope-chart-generator.html?ymin=24000&ymax=45000&g0=%21windowed+state%3Dnone%7Ccompression%3Dcompressed%7Cstate+backend%3Dhashmap&g1=compression%3Dcompressed%7Cstate+backend%3Dhashmap%7C%21interval+join%3Dnone&n=2&c0=batching%3Dunbatched&c1=batching%3Dbatched&l0=micro&l1=large\">edit<\/a>--><br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-12.png\" class=\"slope-chart-placeholder\" style=\"max-width: 450px;\"\/><br \/>\n<script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/slope-chart-embed.js?ymin=24000&#038;ymax=45000&#038;g0=%21windowed+state%3Dnone%7Ccompression%3Dcompressed%7Cstate+backend%3Dhashmap&#038;g1=compression%3Dcompressed%7Cstate+backend%3Dhashmap%7C%21interval+join%3Dnone&#038;n=2&#038;c0=batching%3Dunbatched&#038;c1=batching%3Dbatched&#038;l0=micro&#038;l1=large\"><\/script>\n<\/div>\n<p>\nThere were Flink jobs where it didn&#8217;t help, such as <strong>stateless jobs that consumed uncompressed batches<\/strong>:\n<\/p>\n<div class=\"slope-chart-container\">\n<!--<a href=\"slope-chart-generator.html?ymin=24000&ymax=45000&g0=windowed+state%3Dnone%7Ccompression%3Duncompressed%7Cstate+backend%3Drocksdb%7Ccheckpoints%3Dnone%7Cinterval+join%3Dnone&n=2&c0=batching%3Dunbatched&c1=batching%3Dbatched&l0=micro&l1=large\">edit<\/a>--><br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-13.png\" class=\"slope-chart-placeholder\" style=\"max-width: 450px;\"\/><br \/>\n<script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/slope-chart-embed.js?ymin=24000&#038;ymax=45000&#038;g0=windowed+state%3Dnone%7Ccompression%3Duncompressed%7Cstate+backend%3Drocksdb%7Ccheckpoints%3Dnone%7Cinterval+join%3Dnone&#038;n=2&#038;c0=batching%3Dunbatched&#038;c1=batching%3Dbatched&#038;l0=micro&#038;l1=large\"><\/script>\n<\/div>\n<h3>Impact of producer compression<\/h3>\n<p>\nIf producers compress the batches they send to Kafka, Kafka consumers need to decompress the batches before deserializing the events. This could reduce the network traffic the Flink job needs to wait for when receiving events from its source, however it also introduces a CPU overhead for Flink to decompress each batch it receives.\n<\/p>\n<p>\nTo see the impact of producer compression, I set up versions of the click tracking topics where producers used:<\/p>\n<ul>\n<li><strong>&#8220;uncompressed&#8221;<\/strong> &#8211; no compression <code style=\"font-size: 0.95em; background-color: #FFFFC0; color: #770000; padding: 3px; font-weight: 600;\">compression.type: none<\/code><\/li>\n<li><strong>&#8220;compressed&#8221;<\/strong> &#8211; compression using <code style=\"font-size: 0.95em; background-color: #FFFFC0; color: #770000; padding: 3px; font-weight: 600;\">compression.type: snappy<\/code><\/li>\n<\/ul>\n<p>\nLooking at the <strong>impact of compression<\/strong> on the Flink jobs that consumed from the <strong>&#8220;micro&#8221;-batch<\/strong> topics:\n<\/p>\n<div class=\"slope-chart-container\">\n<!--<a href=\"slope-chart-generator.html?ymin=0&ymax=40000&g0=batching%3Dunbatched&n=2&c0=compression%3Duncompressed&c1=compression%3Dcompressed&l0=uncompressed&l1=compressed\">edit<\/a>--><br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-14.png\" class=\"slope-chart-placeholder\" style=\"max-width: 450px;\"\/><br \/>\n<script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/slope-chart-embed.js?ymin=0&#038;ymax=40000&#038;g0=batching%3Dunbatched&#038;n=2&#038;c0=compression%3Duncompressed&#038;c1=compression%3Dcompressed&#038;l0=uncompressed&#038;l1=compressed\"><\/script>\n<\/div>\n<p>\nThe compressed topics slowed down almost all of these Flink jobs.\n<\/p>\n<blockquote style=\"margin: 0; padding-left: 6px; padding-right: 6px; color: black; border-radius: 6px; background-color: #d5d6d7; border: 1px solid #004a59;\">\n<p>\n<em>Compression is applied at a per-batch level. Its effectiveness is dependent on the presence of repeated data in the batch. <\/em>\n<\/p>\n<\/blockquote>\n<p>\nProducers using single-message batches with compression is inefficient. It doesn&#8217;t reduce the data size (it could have even increased the data size).\n<\/p>\n<p>\nIt still introduced a CPU overhead for consumers needing to decompress data before deserializing, and this can be seen in the impact that it had on how many click tracking events my Flink jobs processed per second from the &#8220;compressed&#8221; topics.\n<\/p>\n<p>\nLooking at the <strong>impact of compression<\/strong> on  Flink jobs that consumed from the <strong>&#8220;large&#8221;-batch<\/strong> topics when using a <strong>hashmap<\/strong> state backend:\n<\/p>\n<div class=\"slope-chart-container\">\n<!--<a href=\"slope-chart-generator.html?ymin=20000&ymax=45000&g0=batching%3Dbatched%7Cstate+backend%3Dhashmap&n=2&c0=compression%3Duncompressed&c1=compression%3Dcompressed&l0=uncompressed&l1=compressed\">edit<\/a>--><br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-15.png\" class=\"slope-chart-placeholder\" style=\"max-width: 450px;\"\/><br \/>\n<script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/slope-chart-embed.js?ymin=20000&#038;ymax=45000&#038;g0=batching%3Dbatched%7Cstate+backend%3Dhashmap&#038;n=2&#038;c0=compression%3Duncompressed&#038;c1=compression%3Dcompressed&#038;l0=uncompressed&#038;l1=compressed\"><\/script>\n<\/div>\n<p>\nLarge batches compress very well, so any additional CPU overhead of decompressing the batches that Flink incurred appeared to be outweighed in these jobs by the benefits of the more efficient network I\/O.\n<\/p>\n<p>\nSurprisingly, I didn&#8217;t see that consistent benefit from the <strong>impact of compression<\/strong> on Flink jobs that consumed from the <strong>&#8220;large&#8221;-batch topics<\/strong> when using a <strong>RocksDB<\/strong> state backend:\n<\/p>\n<div class=\"slope-chart-container\">\n<!--<a href=\"slope-chart-generator.html?ymin=0&ymax=45000&g0=batching%3Dbatched%7Cstate+backend%3Drocksdb&n=2&c0=compression%3Duncompressed&c1=compression%3Dcompressed&l0=uncompressed&l1=compressed\">edit<\/a>--><br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-16.png\" class=\"slope-chart-placeholder\" style=\"max-width: 450px;\"\/><br \/>\n<script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/slope-chart-embed.js?ymin=0&#038;ymax=45000&#038;g0=batching%3Dbatched%7Cstate+backend%3Drocksdb&#038;n=2&#038;c0=compression%3Duncompressed&#038;c1=compression%3Dcompressed&#038;l0=uncompressed&#038;l1=compressed\"><\/script>\n<\/div>\n<p>\nPerhaps some of these Flink jobs were sufficiently CPU bottlenecked that the additional overhead of decompression became a more significant cost. There may be other explanations. Because I only collected a single metric (the number of records in the Kafka source) for these jobs, I can&#8217;t get any deeper understanding of the behaviour.\n<\/p>\n<blockquote style=\"margin: 0; padding-left: 8px; padding-right: 8px; color: black; font-size: 1.01em; border-radius: 8px; background-color: #d1ecf1; border: 1px solid #bee5eb;\">\n<p>\nOkay, Kafka producers configured to use compression <strong>and<\/strong> batch.size of 0 is contrived, and not something you are likely to come across.\n<\/p>\n<p>\nI&#8217;m using this to illustrate that the number of records a Flink job can process per second is influenced by the configuration choices made by the Kafka producer that created the records your Flink job is processing (even when you don&#8217;t change any configuration of your Flink job).\n<\/p>\n<p>\nWhat I&#8217;m showing here is an extreme example of this.\n<\/p>\n<\/blockquote>\n<h3>Impact of using interval joins<\/h3>\n<p>\nSome of the Flink jobs I ran included an interval join with a second, low-throughput topic. For those jobs, I was <a href=\"https:\/\/dalelane.co.uk\/blog\/?p=5806\">correlating the high-throughput click stream topic with a second topic of new customer registrations<\/a>.\n<\/p>\n<p>\nComparing the Flink jobs I ran that:<\/p>\n<ul>\n<li><strong>&#8220;no join&#8221;<\/strong> &#8211; did not include an interval join<\/li>\n<li><strong>&#8220;join&#8221;<\/strong> &#8211; included an interval join <\/li>\n<\/ul>\n<p>(but were otherwise consuming the same events from the same topics and doing the same processing, configured in the same way), I get:\n<\/p>\n<div class=\"slope-chart-container\">\n<!--<a href=\"slope-chart-generator.html?ymin=0&ymax=45000&n=2&c0=interval+join%3Dnone&c1=interval+join%3D15+minutes%26interval+join%3D1+hour&l0=no%2520join&l1=join\">edit<\/a>--><br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-17.png\" class=\"slope-chart-placeholder\" style=\"max-width: 450px;\"\/><br \/>\n<script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/slope-chart-embed.js?ymin=0&#038;ymax=45000&#038;n=2&#038;c0=interval+join%3Dnone&#038;c1=interval+join%3D15+minutes%26interval+join%3D1+hour&#038;l0=no%2520join&#038;l1=join\"><\/script>\n<\/div>\n<p>\nI saw a small, but mostly repeated, reduction in the rate that Flink could consume the click tracking events when I included an interval join.\n<\/p>\n<blockquote style=\"margin: 0; padding-left: 6px; padding-right: 6px; color: black; border-radius: 6px; background-color: #d5d6d7; border: 1px solid #004a59;\">\n<p>\n<em>The top group are all Flink jobs using a hashmap state backend.<\/em>\n<\/p>\n<p>\n<em>The bottom group are all Flink jobs with a RocksDB state backend. <\/em>\n<\/p>\n<\/blockquote>\n<p>\nIt&#8217;s worth noting that the impact here won&#8217;t solely have come from the JOIN itself &#8211; the Flink jobs I ran that included an interval join needed a second Kafka source for the second topic. In addition to the additional Kafka consumer, the Flink job had a another topic to deserialize events for. While I intentionally chose a very low-throughput second topic to minimise this, the impact won&#8217;t have been nothing.\n<\/p>\n<p>\nI also tried two different options for the interval join duration, to see the <strong>impact of interval duration<\/strong> on the number of records processed per second.\n<\/p>\n<div class=\"slope-chart-container\">\n<!--<a href=\"slope-chart-generator.html?ymin=0&ymax=45000&n=2&c0=interval+join%3D15+minutes&c1=interval+join%3D1+hour&l0=15%2520mins&l1=1%2520hour\">edit<\/a>--><br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-18.png\" class=\"slope-chart-placeholder\" style=\"max-width: 450px;\"\/><br \/>\n<script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/slope-chart-embed.js?ymin=0&#038;ymax=45000&#038;n=2&#038;c0=interval+join%3D15+minutes&#038;c1=interval+join%3D1+hour&#038;l0=15%2520mins&#038;l1=1%2520hour\"><\/script>\n<\/div>\n<p>\nI tried two variations: using a 15 minute interval, and using a 1 hour interval. I expect that the Flink job runs in the right category will have stored roughly 4 times as much state as the job runs on the left.\n<\/p>\n<p>\nThis didn&#8217;t seem to make much difference.\n<\/p>\n<blockquote style=\"margin: 0; padding-left: 8px; padding-right: 8px; color: black; font-size: 1.01em; border-radius: 8px; background-color: #d1ecf1; border: 1px solid #bee5eb;\">\n<p>\nAcross the variety of Flink jobs that I ran, I saw jobs using an interval join processed click tracking events slower than jobs that didn&#8217;t, but this reduction was small.\n<\/p>\n<p>\nThe duration of the interval join didn&#8217;t appear to make a significant difference. This is perhaps because on my topics, the matches to be found mostly occured within the smaller 15 minute interval anyway, and would not have significantly changed the output. As a result, the impact primarily comes from consuming and deserializing events from a second topic, and having to do a lookup on every click tracking event &#8211; all of which was work that was needed whatever the size of the state that it is doing the lookup in.\n<\/p>\n<\/blockquote>\n<h3>Impact of using session window aggregates<\/h3>\n<p>\nSome of the Flink jobs I ran aggregated click tracking events within a session window, computing <a href=\"https:\/\/dalelane.co.uk\/blog\/?p=5806\">aggregates for click events by the same user that were part of the same session<\/a>.<\/p>\n<ul>\n<li><strong>&#8220;no window&#8221;<\/strong> &#8211; did not include any windows<\/li>\n<li><strong>&#8220;window&#8221;<\/strong> &#8211; used a session window to aggregate click events with the same sessionid<\/li>\n<\/ul>\n<p>\nComparing the <strong>impact of including session window aggregates<\/strong> on Flink jobs that I ran that used a <strong>hashmap<\/strong> state backend:\n<\/p>\n<div class=\"slope-chart-container\">\n<!--<a href=\"slope-chart-generator.html?ymin=20000&ymax=45000&g0=state+backend%3Dhashmap&n=2&c0=windowed+state%3Dnone&c1=windowed+state%3D15+minutes%26windowed+state%3D1+hour&l0=no%2520window&l1=window\">edit<\/a>--><br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-19.png\" class=\"slope-chart-placeholder\" style=\"max-width: 450px;\"\/><br \/>\n<script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/slope-chart-embed.js?ymin=20000&#038;ymax=45000&#038;g0=state+backend%3Dhashmap&#038;n=2&#038;c0=windowed+state%3Dnone&#038;c1=windowed+state%3D15+minutes%26windowed+state%3D1+hour&#038;l0=no%2520window&#038;l1=window\"><\/script>\n<\/div>\n<p>\nMy Flink jobs using a session window consistently consumed click tracking events at a faster rate than the equivalent jobs without a session window (when I used a hashmap state backend).\n<\/p>\n<blockquote style=\"margin: 0; padding-left: 6px; padding-right: 6px; color: black; border-radius: 6px; background-color: #d5d6d7; border: 1px solid #004a59;\">\n<p>\nThe Flink jobs without a session window were outputting something for every click tracking event input. The Flink jobs using a session window only produced output when each session window closed.\n<\/p>\n<p>\nFor example, if Bob spends ten minutes clicking around the retail site, generating thirty separate click events, the Flink jobs without a session window would output 30 events about Bob, the Flink jobs with a session window would output 1 event about Bob.\n<\/p>\n<\/blockquote>\n<p>\nPerhaps the reduced amount of output needing to be serialized freed up resource to allow the Flink jobs to process the click tracking events more quickly, even after the cost of maintaining the session state and computing the aggregates.\n<\/p>\n<p>\nComparing the <strong>impact of including session window aggregates<\/strong> on Flink jobs that I ran that used a <strong>RocksDB<\/strong> state backend:\n<\/p>\n<div class=\"slope-chart-container\">\n<!--<a href=\"slope-chart-generator.html?ymin=5000&ymax=32500&g0=state+backend%3Drocksdb&n=2&c0=windowed+state%3Dnone&c1=windowed+state%3D15+minutes%26windowed+state%3D1+hour&l0=no%2520window&l1=window\">edit<\/a>--><br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-20.png\" class=\"slope-chart-placeholder\" style=\"max-width: 450px;\"\/><br \/>\n<script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/slope-chart-embed.js?ymin=5000&#038;ymax=32500&#038;g0=state+backend%3Drocksdb&#038;n=2&#038;c0=windowed+state%3Dnone&#038;c1=windowed+state%3D15+minutes%26windowed+state%3D1+hour&#038;l0=no%2520window&#038;l1=window\"><\/script>\n<\/div>\n<p>\nMy Flink jobs using a session window consistently consumed click tracking events at a slower rate than the equivalent jobs without a session window.\n<\/p>\n<p>\nThese jobs should have benefited from the same reduction in output that the hashmap jobs did. I assume that this was outweighed by the increased cost in updating a disk-based state store in response to every click tracking event, and having to retrieve the state from a disk-based state store when computing aggregates.\n<\/p>\n<p>\nI also tried two different options for the session window gap, to see the <strong>impact of session window size<\/strong> on the number of records processed per second.\n<\/p>\n<p>\nLooking first at the impact on Flink jobs using a <strong>hashmap<\/strong> state backend:\n<\/p>\n<div class=\"slope-chart-container\">\n<!--<a href=\"slope-chart-generator.html?ymin=24000&ymax=45000&g0=state+backend%3Dhashmap&n=2&c0=windowed+state%3D15+minutes&c1=windowed+state%3D1+hour&l0=15%2520mins&l1=1%2520hour\">edit<\/a>--><br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-21.png\" class=\"slope-chart-placeholder\" style=\"max-width: 450px;\"\/><br \/>\n<script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/slope-chart-embed.js?ymin=24000&#038;ymax=45000&#038;g0=state+backend%3Dhashmap&#038;n=2&#038;c0=windowed+state%3D15+minutes&#038;c1=windowed+state%3D1+hour&#038;l0=15%2520mins&#038;l1=1%2520hour\"><\/script>\n<\/div>\n<p>\n<em>I tried two variations: using a 15 minute session window gap, and using a 1 hour session window gap.<\/em>\n<\/p>\n<p>\nThe longer session window caused a relatively small, but consistent, reduction in the rate that click tracking events were processed.\n<\/p>\n<p>\nThe jobs will have produced mostly the same output either way, the difference is that the jobs on the right will have waited for a longer time for the session gap to be exceeded before computing and outputting the aggregates.\n<\/p>\n<p>\nLooking at the <strong>impact of session window size<\/strong> on the Flink jobs that used a <strong>RocksDB<\/strong> state backend:\n<\/p>\n<div class=\"slope-chart-container\">\n<!--<a href=\"slope-chart-generator.html?ymin=0&ymax=10000&g0=state+backend%3Drocksdb&n=2&c0=windowed+state%3D15+minutes&c1=windowed+state%3D1+hour&l0=15%2520mins&l1=1%2520hour\">edit<\/a>--><br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-22.png\" class=\"slope-chart-placeholder\" style=\"max-width: 450px;\"\/><br \/>\n<script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/slope-chart-embed.js?ymin=0&#038;ymax=10000&#038;g0=state+backend%3Drocksdb&#038;n=2&#038;c0=windowed+state%3D15+minutes&#038;c1=windowed+state%3D1+hour&#038;l0=15%2520mins&#038;l1=1%2520hour\"><\/script>\n<\/div>\n<p>\nThe impact here was more mixed.\n<\/p>\n<blockquote style=\"margin: 0; padding-left: 8px; padding-right: 8px; color: black; font-size: 1.01em; border-radius: 8px; background-color: #d1ecf1; border: 1px solid #bee5eb;\">\n<p>\nUsing a session window significantly reduced the amount of output that my Flink jobs produced for this scenario.\n<\/p>\n<p>\nThis is perhaps why I saw an increase in number of records they processed per second when the cost of state this required was cheap (using a hashmap state backend) but a decrease in the number of records processed per second where the cost of using state was expensive (using a RocksDB state backend).\n<\/p>\n<\/blockquote>\n<h3>Impact of checkpointing<\/h3>\n<p>\nLooking still at the results for the most stateful Flink jobs that I ran (the jobs that included both an interval join and a session window), I ran them with:<\/p>\n<ul>\n<li><strong>&#8220;no checkpoints&#8221;<\/strong> &#8211; checkpointing was disabled<\/li>\n<li><strong>&#8220;checkpoints&#8221;<\/strong> &#8211; checkpoints were created every 30 seconds<\/li>\n<\/ul>\n<p>\nThe <strong>impact of enabling checkpoints<\/strong> on <strong>stateful jobs<\/strong> that used a <strong>hashmap<\/strong> state backend.\n<\/p>\n<div class=\"slope-chart-container\">\n<!--<a href=\"slope-chart-generator.html?ymin=24000&ymax=44000&g0=%21interval+join%3Dnone%7C%21windowed+state%3Dnone%7Cstate+backend%3Dhashmap&n=2&c0=checkpoints%3Dnone%7Ccheckpoint+storage%3Dnot+used&c1=checkpoints%3D30+seconds%7Ccheckpoint+storage%3Dlocal%26checkpoints%3D30+seconds%7Ccheckpoint+storage%3Dremote&l0=no%2520checkpoints&l1=checkpoints\">edit<\/a>--><br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-23.png\" class=\"slope-chart-placeholder\" style=\"max-width: 450px;\"\/><br \/>\n<script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/slope-chart-embed.js?ymin=24000&#038;ymax=44000&#038;g0=%21interval+join%3Dnone%7C%21windowed+state%3Dnone%7Cstate+backend%3Dhashmap&#038;n=2&#038;c0=checkpoints%3Dnone%7Ccheckpoint+storage%3Dnot+used&#038;c1=checkpoints%3D30+seconds%7Ccheckpoint+storage%3Dlocal%26checkpoints%3D30+seconds%7Ccheckpoint+storage%3Dremote&#038;l0=no%2520checkpoints&#038;l1=checkpoints\"><\/script>\n<\/div>\n<p>\nAs I expected, there was an impact in requesting that Flink periodically save its state to files. Checkpoints bring resiliency across crashes and restarts, so I consider them worth the cost, but it was interesting to see the degree of impact it had on these particular click-track processing jobs.\n<\/p>\n<p>\nLooking at the <strong>impact of enabling checkpoints<\/strong> on <strong>stateful jobs<\/strong> that used a <strong>RocksDB<\/strong> state backend.\n<\/p>\n<div class=\"slope-chart-container\">\n<!--<a href=\"slope-chart-generator.html?ymin=0&ymax=20000&g0=%21interval+join%3Dnone%7C%21windowed+state%3Dnone%7Cstate+backend%3Drocksdb&n=2&c0=checkpoints%3Dnone%7Ccheckpoint+storage%3Dnot+used&c1=checkpoints%3D30+seconds%7Ccheckpoint+storage%3Dlocal%26checkpoints%3D30+seconds%7Ccheckpoint+storage%3Dremote&l0=no%2520checkpoints&l1=checkpoints\">edit<\/a>--><br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-24.png\" class=\"slope-chart-placeholder\" style=\"max-width: 450px;\"\/><br \/>\n<script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/slope-chart-embed.js?ymin=0&#038;ymax=20000&#038;g0=%21interval+join%3Dnone%7C%21windowed+state%3Dnone%7Cstate+backend%3Drocksdb&#038;n=2&#038;c0=checkpoints%3Dnone%7Ccheckpoint+storage%3Dnot+used&#038;c1=checkpoints%3D30+seconds%7Ccheckpoint+storage%3Dlocal%26checkpoints%3D30+seconds%7Ccheckpoint+storage%3Dremote&#038;l0=no%2520checkpoints&#038;l1=checkpoints\"><\/script>\n<\/div>\n<p>\nThese jobs went significantly faster when I enabled checkpoints. I wasn&#8217;t expecting that.\n<\/p>\n<p>\nWhen I changed the checkpoint interval, from the 30 seconds shown above, to <strong>15 minutes<\/strong> (still showing <strong>stateful jobs<\/strong> that used a <strong>RocksDB<\/strong> state backend):\n<\/p>\n<div class=\"slope-chart-container\">\n<!--<a href=\"slope-chart-generator.html?ymin=0&ymax=20000&g0=%21interval+join%3Dnone%7C%21windowed+state%3Dnone%7Cstate+backend%3Drocksdb&n=2&c0=checkpoints%3Dnone%7Ccheckpoint+storage%3Dnot+used&c1=checkpoints%3D15+minutes%7Ccheckpoint+storage%3Dlocal%26checkpoints%3D15+minutes%7Ccheckpoint+storage%3Dremote&l0=no%2520checkpoints&l1=15%2520min%2520checkpoints\">edit<\/a>--><br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-25.png\" class=\"slope-chart-placeholder\" style=\"max-width: 450px;\"\/><br \/>\n<script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/slope-chart-embed.js?ymin=0&#038;ymax=20000&#038;g0=%21interval+join%3Dnone%7C%21windowed+state%3Dnone%7Cstate+backend%3Drocksdb&#038;n=2&#038;c0=checkpoints%3Dnone%7Ccheckpoint+storage%3Dnot+used&#038;c1=checkpoints%3D15+minutes%7Ccheckpoint+storage%3Dlocal%26checkpoints%3D15+minutes%7Ccheckpoint+storage%3Dremote&#038;l0=no%2520checkpoints&#038;l1=15%2520min%2520checkpoints\"><\/script>\n<\/div>\n<p>\nThere was still a speed benefit from creating checkpoints (which still surprised me), but checkpointing less frequently reduced the benefit. This was the opposite of what I expected &#8211; I&#8217;d incorrectly assumed that (if you don&#8217;t mind the increased recovery time) checkpointing less frequently would tend to reduce the impact of doing checkpointing.\n<\/p>\n<p>\nThis sent me down a fascinating rabbit-hole&#8230; &#8220;<a href=\"https:\/\/flink.apache.org\/2018\/01\/30\/managing-large-state-in-apache-flink-an-intro-to-incremental-checkpointing\/\">Managing Large State in Apache Flink: An Intro to Incremental Checkpointing<\/a>&#8221; talks about incremental checkpointing triggering &#8220;a flush in RocksDB, forcing all memtables into sstables on disk, and hard-linked in a local temporary directory&#8221;.\n<\/p>\n<p>\nAnd the <a href=\"https:\/\/github.com\/facebook\/rocksdb\/wiki\/RocksDB-Tuning-Guide\">RocksDB tuning guide<\/a> has a whole section on the impact of <a href=\"https:\/\/github.com\/facebook\/rocksdb\/wiki\/RocksDB-Tuning-Guide#tuning-flushes-and-compactions\">flushes and compactions<\/a> on RocksDB performance. There is a whole world of <a href=\"https:\/\/nightlies.apache.org\/flink\/flink-docs-master\/docs\/ops\/state\/large_state_tuning\/#tuning-rocksdb-or-forst\">tuning info for RocksDB in Flink<\/a> that I haven&#8217;t properly explored, so I can&#8217;t claim deep expertise here.\n<\/p>\n<p>\nUltimately, for the small number of Flink job runs I did as part of this, the performance of the more stateful jobs benefited from checkpointing. Perhaps this was because the incremental checkpointing was triggering RocksDB compaction and memtable flushes, which prevented the LSM tree from fragmenting or bloating as much as it did when I ran the Flink job with checkpointing disabled. But with only a single metric (the number of records per second in the Kafka source), there&#8217;s no way to know without a proper analysis.\n<\/p>\n<p>\nThis may well have been an edge case that applies for my specific types of flows or the click tracking events I was processing. I don&#8217;t assume that this is a generally applicable tuning recommendation, but I was still pleased to learn some new Flink stuff from doing this!\n<\/p>\n<p><script src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/v1\/all-runs-summary-embedded.js\"><\/script><\/p>\n<h3>Summary<\/h3>\n<h4>What about&#8230;?<\/h4>\n<p>\nThere are so many aspects I didn&#8217;t explore.\n<\/p>\n<p>\nPerhaps most importantly, I didn&#8217;t explore the impact of parallelism. All jobs I ran had a max parallelism of 1 on a single CPU core, consuming from single partition topics. This constraint must have been influential. For the Flink jobs that appear to have been bottlenecked on I\/O (e.g. stateful jobs using a RocksDB state backend) there will have been time where the CPU was blocking on disk access. Increasing the parallelism would allow this idle time to be used on computations for other sub-tasks. It would have been interesting to see the impact this would have on throughput, even if I didn&#8217;t increase the CPU limit available to the task manager pod. This may well have outweighed the patterns I did observe.\n<\/p>\n<p>\nThere are other aspects I could&#8217;ve tried too, but even with the few factors I did try the number of permutations got out of hand!\n<\/p>\n<h4>What I did see<\/h4>\n<p>\nUltimately, this was never meant to be exhaustive. I picked a few aspects that would be simple to alter, and that I suspected would have an easily noticeable impact. I did see some patterns in the performance of the Flink jobs that I ran:<\/p>\n<ul>\n<li>my stateful jobs ran faster when I used a hashmap state backend than when I used RocksDB (for stateless jobs, it made minimal difference)<\/li>\n<li>my stateless jobs ran slower as I used more complex functions<\/li>\n<li>my jobs consuming Avro data from topics ran faster than jobs consuming JSON data &#8211; but the amount of impact this had varied<\/li>\n<li>my jobs consuming from topics that had events in large, compressed batches went faster when using a hashmap state backend<\/li>\n<li>my jobs that included interval joins ran a bit slower than similar jobs without a join (the interval duration didn&#8217;t make much difference)<\/li>\n<li>my jobs that computed aggregates over a session window were slower (than similar jobs that didn&#8217;t use a session window) when using a RocksDB state backend, but faster than similar non-window jobs using a hashmap <\/li>\n<li>enabling checkpointing slowed down my jobs that used a hashmap state backend but sped up the jobs using a RocksDB state backend<\/li>\n<\/ul>\n<p>\nAll of this resulted in a wide spread in terms of how quickly my Flink jobs processed the same type of data.\n<\/p>\n<div class=\"flink-chart-embed\">\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/screenshot-00.png\" style=\"max-width: 100%; height: auto; display: block;\"\/>\n<\/div>\n<h4>What this means<\/h4>\n<p>\nBut to be clear, this isn&#8217;t a benchmark. These results are specific to my jobs processing events from my topics. Don&#8217;t mistake this for a generic performance report.\n<\/p>\n<p>\nAt any rate, I&#8217;m not claiming that the ~45,000 events per second I saw in my fastest jobs I ran is particularly high or impressive. As I noted above, I didn&#8217;t run any of this on isolated hardware &#8211; it was all run on a low-spec virtualised development cluster, on a platform that I share with many other developers.\n<\/p>\n<p>\nTo avoid needing to alter memory limits between different jobs, I gave every job the same ridiculously large amount of memory. That probably masked some bottlenecks, and maybe even introduced other problems. This was a quick-and-dirty demonstration in a shared noisy environment, that&#8217;s all.\n<\/p>\n<p>\nPlease don&#8217;t misunderstand my intention here when I point out patterns I think are interesting in the data I happened to see in these particular jobs. I&#8217;m absolutely not claiming any of these as definitive rules or that they would apply to other types of Flink jobs processing other types of events or data.\n<\/p>\n<p>\nMore importantly, please don&#8217;t misinterpret anything here as tuning recommendations for your Flink job with your data. I&#8217;m not saying that setting option X or pulling lever Y will always make a Flink job run faster. I&#8217;m trying to say almost the <strong>opposite<\/strong> of that.\n<\/p>\n<blockquote style=\"margin: 0; padding-left: 8px; padding-right: 8px; color: black; font-size: 1.01em; border-radius: 8px; background-color: #d1ecf1; border: 1px solid #bee5eb;\">\n<p>\nI&#8217;m saying that understanding how many Kafka events your Flink job will process per second will depend on an understanding of:<\/p>\n<ul>\n<li>the types of processing you include in your SQL<\/li>\n<li>how you tune and configure Flink <\/li>\n<li>a whole bunch of things you probably can&#8217;t control from Flink &#8211; such as the format of the events you&#8217;re processing, whether the events are compressed, how they are batched, and much more besides<\/li>\n<\/ul>\n<\/blockquote>\n<p>\nThings that might reduce CPU usage in one area might be hugely beneficial for some jobs. For jobs that aren&#8217;t CPU bottlenecked in that area, it might have negligible impact.\n<\/p>\n<p>\nAs I described in my post on <a href=\"https:\/\/dalelane.co.uk\/blog\/?p=5906\">how I deploy Flink jobs in Kubernetes<\/a>, part of deploying your job is to look at the metrics from your job running in your environment, and to use that to inform how you configure it.\n<\/p>\n<p>\nIf you really want to know the answer to &#8220;<strong>How many Kafka events will Flink process per second?<\/strong>&#8220;, there is no substitute for benchmarking and testing in a representative environment using a representative workload.\n<\/p>\n<p>\nFlink is highly tunable, but the configuration options available to us are context sensitive. An understanding of your event data and your processing characteristics are essential to identify sensible configuration &#8211; in a way that blanket &#8220;turn on option X&#8221; rules won&#8217;t cover.\n<\/p>\n<p style=\"margin-top: 70px;\">\n<hr \/>\n<hr \/>\n<p><a name=\"details\"><\/a><\/p>\n<h3>How I did this<\/h3>\n<p>\nMy starting point was the sorts of <a href=\"https:\/\/dalelane.co.uk\/blog\/?p=5806\">Flink SQL flows I typically create in demos using click-tracking events<\/a>. I created a few variations to give me jobs with\/without interval joins, session windows, different types of stateless processing, etc. I ended up with seven basic types of Flink SQL flow.\n<\/p>\n<p>\nI deployed these flows to the same Kubernetes cluster where my Kafka cluster was, using the <a href=\"https:\/\/nightlies.apache.org\/flink\/flink-kubernetes-operator-docs-main\/\">Flink Kubernetes Operator<\/a> and <a href=\"https:\/\/dalelane.co.uk\/blog\/?p=5906\">the approach I talked through last month<\/a>.\n<\/p>\n<p>\nAll Flink jobs consumed the same click-tracking event data from single-partition topics on the same Kafka cluster, and all Flink jobs were run with a parallelism of 1.\n<\/p>\n<p>\nAll Flink jobs consumed from the earliest offset on the source topic, and ran until they had consumed three days (72 hours) of events. The topics had over 84 hours of events, so the Flink jobs were never allowed to catch up with the latest offset &#8211; they were consuming historical retained events as fast as they could. (The first 5 minutes of each Flink job activity was treated as startup and warmup time, and ignored for all of this.)\n<\/p>\n<p>\nAll Flink jobs were run individually &#8211; no other jobs were running at the same time.\n<\/p>\n<p>\nAll Flink jobs were run on the same k8s worker node, a worker node which didn&#8217;t host any of the Kafka brokers &#8211; to try and keep network latency between Flink and Kafka roughly consistent.\n<\/p>\n<p>\nAll Flink jobs were run on a task manager with CPU limited to a single CPU core, and a very high amount of memory available (20gb).\n<\/p>\n<p>\nAll Flink jobs used the same version of Flink (2.2).\n<\/p>\n<p>\nWhen checkpointing was used (see below), incremental checkpointing was always enabled. When RocksDB was used (see below), it always used local disk storage.\n<\/p>\n<p>\nI used numRecordsIn metrics from Flink for the click tracking Kafka source, averaged over the duration of the job, as a simplified representative proxy for how fast the Flink job was going. (For jobs that were also consuming a second topic to include an interval join, I ignored the metrics for this second Kafka source).\n<\/p>\n<h3>How I used Generative AI for this post<\/h3>\n<h4>Orchestrating running the Flink jobs<\/h4>\n<p>\nI used <a href=\"https:\/\/www.ibm.com\/products\/bob\">an AI IDE<\/a> to write the script that automated:<\/p>\n<ul>\n<li>generating a configmap with a set of configuration options to test<\/li>\n<li>starting the Flink job (by applying k8s custom resources for the FlinkDeployment and supporting configmaps and PVCs)<\/li>\n<li>polling the Flink REST API to:\n<ul>\n<li>collect and average the metrics<\/li>\n<li>monitor the watermark for the click tracking event source to identify when the job had reached the target timestamp<\/li>\n<\/ul>\n<\/li>\n<li>deleting and cleaning up the Flink job (by deleting the FlinkDeployment and supporting other custom resources)<\/li>\n<li>writing the results to a CSV file<\/li>\n<li>repeating this for the different permutations of configuration options to be tested<\/li>\n<\/ul>\n<p>\nThis meant I could leave this running in the background for a few weeks and forget about it, especially useful <a href=\"https:\/\/dalelane.co.uk\/blog\/?p=5993\">as I went on holiday<\/a>!\n<\/p>\n<h4>Creating the visualisations<\/h4>\n<p>\nI also used <a href=\"https:\/\/www.ibm.com\/products\/bob\">an AI IDE<\/a> to vibe-code a tool to generate slope charts &#8211; charts that I used to explore the data and then to illustrate this post.\n<\/p>\n<p>\nThe chart generator is a single-page HTML+JavaScript tool, that I&#8217;m the only user of, so I didn&#8217;t pay very much attention to the implementation other than to check it generates accurate charts.\n<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/images.dalelane.co.uk\/2026-04-10-flinkperf\/chart-generator.jpg\" style=\"border: thin black solid; width: 100%; max-width: 600px;\"\/><\/p>\n<h4>Why AI?<\/h4>\n<p>\nI don&#8217;t think either of these are things that I couldn&#8217;t have done myself. But, it was quicker to let AI do them. GenAI meant that the jump from my idle thought of &#8220;I wonder what difference it&#8217;d make if I ran this Flink SQL job a bunch of times with different configs&#8221; to actually doing it was so small that this was an easy thing to kick off before heading home one Friday afternoon.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;m often asked this. The specific question varies, but it&#8217;s typically some variation of asking how quickly a single CPU of Flink processes events from a Kafka topic. Why &#8220;per CPU&#8221;? Maybe because enterprise software is typically charged per CPU? Maybe because I tend to talk to people who run everything in Kubernetes, who think [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":6003,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7],"tags":[615,610],"class_list":["post-6001","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-code","tag-apacheflink","tag-flink"],"_links":{"self":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/6001","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=6001"}],"version-history":[{"count":11,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/6001\/revisions"}],"predecessor-version":[{"id":6017,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/6001\/revisions\/6017"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/media\/6003"}],"wp:attachment":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=6001"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=6001"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=6001"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}