How to write your first Avro schema

Any time there is more than one developer using a Kafka topic, they will need a way to agree on the shape of the data that will go into messages. The most common way to document the schema of messages in Kafka is to use the Apache Avro serialization system.

This post is a beginner’s guide to writing your first Avro schema, and a few tips for how to use it in your Kafka apps.

Strings

Let’s start simple.

A schema that describes a message with a single string field.

So you want to have messages a bit like this:

{
    "something" : "Hello"
}

I’ve got one field, and it’s a string called something.

You’d document that like this:

Then you can use that in your Kafka Producer apps by creating a message to send like this:

You don’t have to create the Type1 class – Avro will generate that for you, based on the schema definition. I’ll explain how to do that at the end of the post. For now, the point is, once you’ve defined the schema, that will be represented as a Java class with set methods for each of the fields.

In a Kafka Consumer, you’d use getters instead of setters, but it’s basically the same sort of thing.

Other primitive data types

Once you get that idea, it’s pretty simple to guess how to use other data types in your schemas.

You’d use that in a Kafka Producer like this:

Constants / Enums

Enums are super useful. For example, maybe you’ve want a field in your Kafka messages that will always be a compass direction.

You can use these constants when creating messages for your Kafka Producer like this:

Arrays

It’s not all just individual values though. Lists of things can be defined like this:

In Java, this would map to normal Lists.

Maps

To allow an arbitrary set of keys, use a Map.

This would be a Map in Java, too.

Combining all of these

Combining primitives, enums, arrays and maps lets you do a wide variety of schemas.

For example, what about a map, where every value is an enum?

You’d use that sort of schema like this:

Or what about a map, where every value is an array of things?

Translating these to Java is straightforward:

Union

Maybe you want one field to be a string or an integer? Or another field to be some sort of number, but that can be an integer or a decimal?

You can specify lists of types, so that your schema says a value of any of these types is valid.

That gives you more options, for example:

Optional fields

The other thing you can use this “union” support for lists of types is to define optional fields.

If you have null in the list of types, then that means the field can be optional. (It also means that all fields that don’t have a null type are therefore required).

Doing this means I’ve got one required string field, and one optional one:

Logical types

In addition to all of these data types, you can also use an additional logical type to describe how the field value should be interpreted. There are a bunch of pre-defined logical types.

For example, you can encode a date as a number in a variety of ways. So adding a logicalType property is a way of documenting which way you are using.

A logical type of date means number of days since epoch.

So you could use that like this:

A logical type of time-millis means number of milliseconds since epoch.

That is your way of describing that dates in your messages are defined something more like this:

Custom types

You can also add your own logical type definitions:

These need a little more work to setup – as you need to add your custom logical type definition. That can be something like this:

You need to register each logical type in your applications, but that’s just a one-liner:

Once you’ve registered it, you can use your own custom logical types like this:

Using an Avro schema with Kafka

Now you know how to write a schema to define the data that you will be putting on your Kafka topics. So where will you store your schemas?

The easiest way to manage them is to use a schema registry.

If you use IBM Event Streams as your Kafka solution, one of the many things it adds to Apache Kafka to make it easier to use is a Schema Registry, that not only stores schemas for you, it lets you manage different versions of schemas, enforce compatibility of schema versions, integrate with your client apps so they automatically fetch the schema definitions on demand, and much much more.

If you upload your schemas into the Event Streams schema registry, it will generate all the Java classes you need, with helpful getters and setters – so you can start using them as I did in the examples above. You can download JAR files for the schemas you want through the UI, or use the integrated maven server to download them automatically on demand.

If you’re not using Event Streams, you can manually generate the Java classes you need based on Avro schema files yourself as described in the Avro documentation.

Putting your schemas in a Schema Registry that’s integrated with the rest of your Kafka management is very helpful. Instead of thinking of the records on your Kafka topics as raw data, it means you can start to treat them as semantic data.

Whatever Kafka distribution you choose, the idea of a schema registry is worth considering. The schemas they store form a contract between your different developers, letting your producers promise what they will send to topics, and let your consumers know what to expect.

Tags: apacheavro, apachekafka, avro, ibmeventstreams, kafka

This entry was posted on Saturday, July 20th, 2019 at 10:31 pm and is filed under code. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

2 Responses to “How to write your first Avro schema”

Srinivasan Ragavan says:

Monday 22nd July 2019 at 5:35 am

Nice article 🙂
Zhicheng Lai says:

Saturday 11th July 2020 at 3:35 pm

Best article on Avro

dale lane