{"id":3781,"date":"2019-07-20T22:31:13","date_gmt":"2019-07-20T22:31:13","guid":{"rendered":"https:\/\/dalelane.co.uk\/blog\/?p=3781"},"modified":"2019-07-20T22:43:19","modified_gmt":"2019-07-20T22:43:19","slug":"how-to-write-your-first-avro-schema","status":"publish","type":"post","link":"https:\/\/dalelane.co.uk\/blog\/?p=3781","title":{"rendered":"How to write your first Avro schema"},"content":{"rendered":"<p><img decoding=\"async\" style=\"border: thin black solid\" src=\"\/\/dalelane.co.uk\/blog\/post-images\/190721-avro.png\"\/><\/p>\n<p>Any time there is more than one developer using a Kafka topic, they will need a way to agree on the shape of the data that will go into messages. The most common way to document the schema of messages in Kafka is to use the <a href=\"https:\/\/avro.apache.org\/\">Apache Avro<\/a> serialization system.<\/p>\n<p><strong>This post is a beginner&#8217;s guide to writing your first Avro schema, and a few tips for how to use it in your Kafka apps.<\/strong> <\/p>\n<p><!--more--><\/p>\n<h3>Strings<\/h3>\n<p>Let&#8217;s start simple. <\/p>\n<p>A schema that describes a message with a single string field. <\/p>\n<p>So you want to have messages a bit like this:<\/p>\n<pre style=\"border: thin solid silver; background-color: #eeeeee; padding: 0.7em\">{\r\n    \"something\" : \"Hello\"\r\n}<\/pre>\n<p>I&#8217;ve got one field, and it&#8217;s a string called <code>something<\/code>. <\/p>\n<p>You&#8217;d document that like this:<\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/47e6ab8ce7c502a0f474c9e2ce5914fe.js\"><\/script><\/p>\n<p>Then you can use that in your Kafka Producer apps by creating a message to send like this:<\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/9c4ea72c8c2525223d48a6b9210f8567.js\"><\/script><\/p>\n<p>You don&#8217;t have to create the <code>Type1<\/code> class &#8211; Avro will generate that for you, based on the schema definition. I&#8217;ll explain how to do that at the end of the post. For now, the point is, once you&#8217;ve defined the schema, that will be represented as a Java class with <code>set<\/code> methods for each of the fields. <\/p>\n<p>In a Kafka Consumer, you&#8217;d use getters instead of setters, but it&#8217;s basically the same sort of thing.<\/p>\n<h3>Other primitive data types<\/h3>\n<p>Once you get that idea, it&#8217;s pretty simple to guess how to use other data types in your schemas. <\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/2e702980ee9e7880ab76580268aff553.js\"><\/script><\/p>\n<p>You&#8217;d use that in a Kafka Producer like this:<\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/a6bacee633593a6696fa87af2c78ec0d.js\"><\/script><\/p>\n<h3>Constants \/ Enums<\/h3>\n<p>Enums are super useful. For example, maybe you&#8217;ve want a field in your Kafka messages that will always be a compass direction.<\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/79508e000fff74f7225b9d14e90521d4.js\"><\/script><\/p>\n<p>You can use these constants when creating messages for your Kafka Producer like this:<\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/1b3525933e0d23d0836d487f128a362d.js\"><\/script><\/p>\n<h3>Arrays<\/h3>\n<p>It&#8217;s not all just individual values though. Lists of things can be defined like this:<\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/aa350332df2a9a5476f1a45744cc33f3.js\"><\/script><\/p>\n<p>In Java, this would map to normal <code>List<\/code>s. <\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/18934aaa7e008ae16e30092b1e0deab3.js\"><\/script><\/p>\n<h3>Maps<\/h3>\n<p>To allow an arbitrary set of keys, use a Map. <\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/d63bfa4e9fbdcd45fc4e5592c76747a5.js\"><\/script><\/p>\n<p>This would be a <code>Map<\/code> in Java, too.<\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/926b2dfff2e8ef129d2e613fab166d4c.js\"><\/script><\/p>\n<h3>Combining all of these<\/h3>\n<p>Combining primitives, enums, arrays and maps lets you do a wide variety of schemas. <\/p>\n<p>For example, what about a map, where every value is an enum?<\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/b7a0c8b302c4645a61889383a7c7f086.js\"><\/script><\/p>\n<p>You&#8217;d use that sort of schema like this:<\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/0942ce2947281e3e4ac23dbf0197bb9a.js\"><\/script><\/p>\n<p>Or what about a map, where every value is an array of things?<\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/ae70dc75e7a6a9631034dc60dbab2b73.js\"><\/script><\/p>\n<p>Translating these to Java is straightforward:<\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/a47cde6d66c6db642c79263adcf2cdb6.js\"><\/script><\/p>\n<h3>Union<\/h3>\n<p>Maybe you want one field to be a string or an integer? Or another field to be some sort of number, but that can be an integer or a decimal?<\/p>\n<p>You can specify lists of types, so that your schema says a value of any of these types is valid. <\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/3bb02437547223931c96eac4064223d0.js\"><\/script><\/p>\n<p>That gives you more options, for example:<\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/35808f31ef0e941e686c8f3691aa50ad.js\"><\/script><\/p>\n<h3>Optional fields<\/h3>\n<p>The other thing you can use this &#8220;union&#8221; support for lists of types is to define optional fields. <\/p>\n<p>If you have <code>null<\/code> in the list of types, then that means the field can be optional. (It also means that all fields that don&#8217;t have a <code>null<\/code> type are therefore required).<\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/360d349dff170d5d0e24a841e772b2dd.js\"><\/script><\/p>\n<p>Doing this means I&#8217;ve got one required string field, and one optional one:<\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/918a1da2cace73aee6cf42d84bf329e4.js\"><\/script><\/p>\n<h3>Logical types<\/h3>\n<p>In addition to all of these data types, you can also use an additional logical type to describe how the field value should be interpreted. There are a bunch of pre-defined logical types. <\/p>\n<p>For example, you can encode a date as a number in a variety of ways. So adding a <code>logicalType<\/code> property is a way of documenting which way you are using.<\/p>\n<p>A logical type of <code>date<\/code> means number of days since epoch.<\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/d9fef854d28b1d09910434e6ffe360da.js\"><\/script><\/p>\n<p>So you could use that like this:<\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/0e9aa210d9703a2f401060418385a6bf.js\"><\/script><\/p>\n<p>A logical type of <code>time-millis<\/code> means number of milliseconds since epoch.<\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/a532eb6b6d7f8e2d62af5b9987bd46fe.js\"><\/script><\/p>\n<p>That is your way of describing that dates in your messages are defined something more like this:<\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/3d74f36af9aa0876c114ad1db605917d.js\"><\/script><\/p>\n<h3>Custom types<\/h3>\n<p>You can also add your own logical type definitions:<\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/81c779f1ad73028a36089c43c93fabbd.js\"><\/script><\/p>\n<p>These need a little more work to setup &#8211; as you need to add your custom logical type definition. That can be something like this:<\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/35416f3f05dfed135d437a029afbee41.js\"><\/script><\/p>\n<p>You need to register each logical type in your applications, but that&#8217;s just a one-liner:<\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/16c8fb3068ecd4b6dfbf7482ad03c889.js\"><\/script><\/p>\n<p>Once you&#8217;ve registered it, you can use your own custom logical types like this:<\/p>\n<p><script src=\"\/\/gist.github.com\/dalelane\/c7afc88551152e13aa68a73edf3910d9.js\"><\/script><\/p>\n<h2>Using an Avro schema with Kafka<\/h2>\n<p>Now you know how to write a schema to define the data that you will be putting on your Kafka topics. So where will you store your schemas? <\/p>\n<p>The easiest way to manage them is to use a schema registry. <\/p>\n<p>If you use <a href=\"https:\/\/www.ibm.com\/cloud\/event-streams\">IBM Event Streams<\/a> as your Kafka solution, one of the many things it adds to <a href=\"https:\/\/kafka.apache.org\">Apache Kafka<\/a> to make it easier to use is a <a href=\"https:\/\/ibm.github.io\/event-streams\/schemas\/overview\/\">Schema Registry<\/a>, that not only <a href=\"https:\/\/ibm.github.io\/event-streams\/schemas\/creating\/\">stores schemas for you<\/a>, it lets you manage different versions of schemas, <a href=\"https:\/\/ibm.github.io\/event-streams\/schemas\/overview\/#versions-and-compatibility\">enforce compatibility<\/a> of schema versions, <a href=\"https:\/\/ibm.github.io\/event-streams\/schemas\/setting-java-apps\/\">integrate with your client apps<\/a> so they automatically fetch the schema definitions on demand, and much much more. <\/p>\n<p><img decoding=\"async\" src=\"\/\/dalelane.co.uk\/blog\/post-images\/190721-schemaregistry.png\"\/><\/p>\n<p>If you upload your schemas into the Event Streams schema registry, it will generate all the Java classes you need, with helpful getters and setters &#8211; so you can start using them as I did in the examples above. You can download JAR files for the schemas you want through the UI, or use the integrated maven server to download them automatically on demand.<\/p>\n<p>If you&#8217;re not using Event Streams, you can manually generate the Java classes you need based on Avro schema files yourself <a href=\"https:\/\/avro.apache.org\/docs\/1.8.1\/gettingstartedjava.html\">as described in the Avro documentation<\/a>.<\/p>\n<p>Putting your schemas in a Schema Registry that&#8217;s integrated with the rest of your Kafka management is very helpful. Instead of thinking of the records on your Kafka topics as raw data, it means you can start to treat them as semantic data.  <\/p>\n<p><img decoding=\"async\" src=\"\/\/dalelane.co.uk\/blog\/post-images\/190721-eventstreams.png\"\/><\/p>\n<p>Whatever Kafka distribution you choose, the idea of a schema registry is worth considering. The schemas they store form a contract between your different developers, letting your producers promise what they will send to topics, and let your consumers know what to expect.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Any time there is more than one developer using a Kafka topic, they will need a way to agree on the shape of the data that will go into messages. The most common way to document the schema of messages in Kafka is to use the Apache Avro serialization system. This post is a beginner&#8217;s [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":3782,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7],"tags":[595,593,594,583,584],"class_list":["post-3781","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-code","tag-apacheavro","tag-apachekafka","tag-avro","tag-ibmeventstreams","tag-kafka"],"_links":{"self":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/3781","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3781"}],"version-history":[{"count":0,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/3781\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/media\/3782"}],"wp:attachment":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3781"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3781"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3781"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}