Archive for the ‘code’ Category

“How many Kafka events will Flink process per second?”

Saturday, April 11th, 2026

I’m often asked this. The specific question varies, but it’s typically some variation of asking how quickly a single CPU of Flink processes events from a Kafka topic.

Why “per CPU”? Maybe because enterprise software is typically charged per CPU? Maybe because I tend to talk to people who run everything in Kubernetes, who think of running software in terms of requests / limits? Not sure, but the question tends to be framed from the perspective of asking how much processing they can expect to get from a CPU.

I try to avoid doing the engineer thing of answering “it depends“… but… it really does depend!

That is the motivation behind this post: to give me something I can point at as an illustration of the degree to which Flink’s performance varies (and a taste of the range of interrelated factors that influence it).

(more…)

Extending Flink SQL

Sunday, March 29th, 2026

In this post, I’ll share examples of how writing user-defined functions (UDFs) extends what is possible using built-in Flink SQL functions alone.

I’ll share examples of how UDFs can:

(more…)

Processing JSON with Kafka Connect

Wednesday, February 18th, 2026

In this post, I’ll share examples of how to process JSON data in a Kafka Connect pipeline, and explain the schema format that Kafka uses to describe JSON events. 

Using sink connectors

Kafka Connect sink connectors let you send the events on your Kafka topics to external systems. I’ve talked about this before, but to recap the structure looks a bit like this:

Imagine that you have this JSON event on a Kafka topic. 

{
    "id": 12345678,
    "message": "Hello World",
    "isDemo": true
}

How should you configure Kafka Connect to send that somewhere? 

It depends…

(more…)

Improving support for older computers and mobile devices on Machine Learning for Kids

Friday, January 16th, 2026

In this post, I want to share some changes I’ve been making to how I train models in Machine Learning for Kids.

(more…)

Flink SQL examples with click tracking events

Monday, January 12th, 2026

In this post, I introduce a few core Flink SQL functions using worked examples of processing a stream of click tracking events from a retail website.

I find that a practical, real-world (ish) example can help to explain how to use Flink SQL in a way that abstract descriptions, such as processing coloured blocks sometimes doesn’t quite achieve.

I’ll use this post to give examples of my most-used Flink SQL functions, in the context of a retail scenario: a stream of events from customers on the website for a clothing retailer.

Note: I used Event Processing to create the flows, as the assistants in the canvas helped me create examples quickly. Everything I’ve created is standard Apache Flink SQL, so you don’t need to have Event Processing to try these examples.

(more…)

Flink SQL aggregate functions

Monday, November 3rd, 2025

In this post, I want to share a couple of very quick and simple examples for how to use LISTAGG and ARRAY_AGG in Flink SQL.

This started as an answer I gave to a colleague asking about how to output collections of events from Flink SQL. I’ve removed the details and used this post to share a more general version of my answer, so I can point others to it in future.

Windowed aggregations

One of the great things in Flink is that it makes it easy to do time-based aggregations on a stream of events.

Using this is one of the first things that I see people try when they start playing with Flink. The third tutorial we give to new Event Processing users is to take a stream of order events and count the number of orders per hour.

In Flink SQL, that looks like:

SELECT
    COUNT (*) AS `number of orders`,
    window_start,
    window_end,
    window_time
FROM
    TABLE (
        TUMBLE (
            TABLE orders,
            DESCRIPTOR (ordertime),
            INTERVAL '1' HOUR
        )
    )
GROUP BY
    window_start,
    window_end,
    window_time

In our low-code UI, it looks like this:

However you do it, the result is a flow that emits a result at the end of every hour, with a count of how many order events were observed during the last hour.

But what if you don’t just want a count of the orders?

What if you want the collection of the actual order events emitted at the end of every hour?

To dream up a scenario using this stream of order events:

At the end of each hour, emit a list of all products ordered during the last hour, so the warehouse pickers can prepare those items for delivery.

This is where some of the other aggregate functions in Flink SQL can help.

LISTAGG

If you just want a single property (e.g. the name / description of the product that was ordered) from all of the events that you collect within each hourly window, then LISTAGG can help.

For example:

SELECT
    LISTAGG (description) AS `products to pick`,
    window_start,
    window_end,
    window_time
FROM
    TABLE (
        TUMBLE (
            TABLE orders,
            DESCRIPTOR (ordertime),
            INTERVAL '1' HOUR
        )
    )
GROUP BY
    window_start,
    window_end,
    window_time

That gives you a concatenated string, with a comma-separated list of all of the product descriptions from each of the events within each hour.

You can use a different separator, but it’s a comma by default.

ARRAY_AGG

If you want to output an object, with some or even all of the properties (e.g. the name and quantity of the products that was ordered) from all of the events that you collect within each hourly window, then ARRAY_AGG can help.

For example:

SELECT
    ARRAY_AGG (
        CAST (
            ROW (description, quantity)
                AS
            ROW <description STRING, quantity INT>
        )
    ) AS `products to pick`,
    window_start,
    window_end,
    window_time
FROM
    TABLE (
        TUMBLE (
            TABLE orders,
            DESCRIPTOR (ordertime),
            INTERVAL '1' HOUR
        )
    )
GROUP BY
    window_start,
    window_end,
    window_time

The CAST isn’t necessary, but it lets you give names to the properties instead of the default names you get such as EXPR$0, so downstream processing is easier.

In each hour, an event is emitted that contains an array of objects, made up of properties from the events in that hour.

And naturally you could add additional GROUP BY, for example, if you wanted a separate pick list event for each region:

SELECT
    region,
    ARRAY_AGG (
        CAST (
            ROW (description, quantity)
                AS
            ROW <description STRING, quantity INT>
        )
    ) AS `products to pick`,
    window_start,
    window_end,
    window_time
FROM
    TABLE (
        TUMBLE (
            TABLE orders,
            DESCRIPTOR (ordertime),
            INTERVAL '1' HOUR
        )
    )
GROUP BY
    window_start,
    window_end,
    window_time,
    region

This outputs five events at the end of every hour, one with a list of each of the orders for the NA, SA, EMEA, APAC, ANZ regions.

Try it yourself

I’ve used the “Loosehanger” orders data generator in these examples, so if you want to try it for yourself, you can find it at
github.com/IBM/kafka-connect-loosehangerjeans-source.

If you want to try it in Event Processing, the instructions for setting up the tutorial environment can be found at
ibm.github.io/event-automation/tutorials.

Using time series models with IBM Event Automation

Tuesday, July 22nd, 2025

Intro

graphic of an e-bike hire park

Imagine you run a city e-bike hire scheme.

Let’s say that you’ve instrumented your bikes so you can track their location and battery level.

When a bike is on the move, it emits periodic updates to a Kafka topic, and you use these events for a range of maintenance, logistics, and operations reasons.

You also have other Kafka topics, such as a stream of events with weather sensor readings covering the area of your bike scheme.

Do you know how to use predictive models to forecast the likely demand for bikes in the next few hours?

Could you compare these forecasts with the actual usage that follows, and use this to identify unusual demand?

Time series models

A time series is how a machine learning or data scientist would describe a dataset that consists of data values, ordered sequentially over time, and labelled with timestamps.

A time series model is a specific type of machine learning model that can analyze this type of sequential time series data. These models are used to predict future values and to identify anomalies.

For those of us used to working with Kafka topics, the machine learning definition of a “time series” sounds exactly like our definition of a Kafka topic. Kafka topics are a sequential ordered set of data values, each labelled with timestamps.

(more…)

How to use kafka-console-consumer.sh to view the contents of Apache Avro-encoded events

Thursday, June 12th, 2025

kafka-console-consumer.sh is one of the most useful tools in the Kafka user’s toolkit. But if your topic has Avro-encoded events, the output can be a bit hard to read.

You don’t have to put up with that, as the tool has a formatter plugin framework. With the right plugin, you can get nicely formatted output from your Avro-encoded events.

With this in mind, I’ve written a new Avro formatter for a few common Avro situations. You can find it at:

github.com/IBM/kafka-avro-formatters

The README includes instructions on how to add it to your Kafka console command, and configure it with how to find your schema.

(more…)