{"id":3832,"date":"2019-08-06T01:02:34","date_gmt":"2019-08-06T01:02:34","guid":{"rendered":"https:\/\/dalelane.co.uk\/blog\/?p=3832"},"modified":"2019-08-06T01:02:34","modified_gmt":"2019-08-06T01:02:34","slug":"sql-queries-on-kafka-topics-using-apache-hive","status":"publish","type":"post","link":"https:\/\/dalelane.co.uk\/blog\/?p=3832","title":{"rendered":"SQL queries on Kafka topics using Apache Hive"},"content":{"rendered":"<p><a href=\"https:\/\/hive.apache.org\/\">Apache Hive<\/a> is open source data warehouse software built on top of Hadoop. It gives you an SQL-like interface to a wide variety of databases, filesystems, and other systems.<\/p>\n<p>One of the Hive storage handlers is a <a href=\"https:\/\/github.com\/apache\/hive\/tree\/master\/kafka-handler\">Kafka storage handler<\/a>, which lets you create a Hive &#8220;external table&#8221; based on a Kafka topic.<\/p>\n<p>And once you&#8217;ve created a Hive table based on a Kafka topic, you can run SQL queries based on attributes of the messages on that topic.<\/p>\n<p><strong>I was having a play with Hive this evening, as a way of running SQL queries against messages on my Kafka topics. In this post, I&#8217;ll share a few queries that I tried.<\/strong><\/p>\n<p><img decoding=\"async\" src=\"\/\/dalelane.co.uk\/blog\/wp-content\/uploads\/2019\/08\/190805-kafka-hive.png\" width=\"450\"\/><br \/>\n<!--more--><\/p>\n<h3>Setting up Hive<\/h3>\n<p>If you want to run Hive properly, follow the <a href=\"https:\/\/cwiki.apache.org\/confluence\/display\/Hive\/GettingStarted\">Getting Started instructions in the Hive documentation<\/a>.<\/p>\n<p>But if you just want to give it a quick try:<\/p>\n<h4>Get Hadoop<\/h4>\n<ol>\n<li>Download a <a href=\"https:\/\/hadoop.apache.org\/releases.html\">release of Hadoop<\/a><\/li>\n<li>Unzip<\/li>\n<li><code>export HADOOP_HOME=\/location\/you\/unzipped\/hadoop<\/code><\/li>\n<\/ol>\n<h4>Get Hive<\/h4>\n<ol>\n<li>Download a <a href=\"https:\/\/hive.apache.org\/downloads.html\">release of Hive<\/a><\/li>\n<li>Unzip<\/li>\n<li><code>export HIVE_HOME=\/location\/you\/unzipped\/hive<\/code><\/li>\n<li>Download the <a href=\"https:\/\/mvnrepository.com\/artifact\/org.apache.hive\/kafka-handler\">kafka-handler jar<\/a> and save it in <code>$HIVE_HOME\/lib\/<\/code><\/li>\n<\/ol>\n<h4>Setup Hive<\/h4>\n<p>Prepare folders for use by Hadoop.<\/p>\n<pre style=\"border: thin solid silver; background-color: #eeeeee; padding: 0.8em; color: black; white-space: pre-wrap; word-wrap: normal;\">$ $HADOOP_HOME\/bin\/hadoop fs -mkdir       \/tmp\n$ $HADOOP_HOME\/bin\/hadoop fs -mkdir       \/user\/hive\/warehouse\n$ $HADOOP_HOME\/bin\/hadoop fs -chmod g+w   \/tmp\n$ $HADOOP_HOME\/bin\/hadoop fs -chmod g+w   \/user\/hive\/warehouse\n<\/pre>\n<p>Initialise the server.<\/p>\n<pre style=\"border: thin solid silver; background-color: #eeeeee; padding: 0.8em; color: black; white-space: pre; word-wrap: normal;\">$HIVE_HOME\/bin\/schematool -dbType derby -initSchema\n<\/pre>\n<h4>Run Hive<\/h4>\n<p>For quick development\/testing purposes, running the Hive server and the CLI all in a single process is the simplest:<\/p>\n<pre style=\"border: thin solid silver; background-color: #eeeeee; padding: 0.8em; color: black; white-space: pre; word-wrap: normal;\">$HIVE_HOME\/bin\/beeline -u jdbc:hive2:\/\/\n<\/pre>\n<h3>Prepare a Kafka topic<\/h3>\n<p>I&#8217;m using <a href=\"https:\/\/www.ibm.com\/cloud\/event-streams\">IBM Event Streams<\/a> because I work on it, but you could use plain Apache Kafka.<\/p>\n<p>First, I created a topic <code>DALE.TOPIC<\/code>, and I produced five quick test messages to it.<\/p>\n<p><a href=\"\/\/dalelane.co.uk\/blog\/post-images\/190805-hive\/190805-hive-test-topic.png\" border=\"0\"><img decoding=\"async\" src=\"\/\/dalelane.co.uk\/blog\/post-images\/190805-hive\/190805-hive-test-topic-small.png\" style=\"border: thin black solid\"\/><\/a><br \/>\n<a href=\"\/\/dalelane.co.uk\/blog\/post-images\/190805-hive\/190805-hive-test-topic.png\" border=\"0\"><em><small>Click to enlarge<\/small><\/em> <\/a><\/p>\n<p>Each message is a JSON string, with a key called <code>message<\/code> and a different value.<\/p>\n<pre style=\"border: thin solid silver; background-color: #eeeeee; padding: 0.8em; color: black; white-space: pre; word-wrap: normal;\">{\"message\":\"Hello World\"}\n{\"message\":\"This is a test\"}\n{\"message\":\"These are messages\"}\n{\"message\":\"Lorem Ipsum\"}\n{\"message\":\"Lorem ipsum dolor sit amet\"}\n<\/pre>\n<p>Next, I downloaded a Java truststore, and created an API key, to let Hive make a secure connection to my Kafka cluster.<\/p>\n<p><a href=\"\/\/dalelane.co.uk\/blog\/post-images\/190805-hive\/190805-hive-connection-auth.png\" border=\"0\"><img decoding=\"async\" src=\"\/\/dalelane.co.uk\/blog\/post-images\/190805-hive\/190805-hive-connection-auth-small.png\" style=\"border: thin black solid\"\/><\/a><br \/>\n<a href=\"\/\/dalelane.co.uk\/blog\/post-images\/190805-hive\/190805-hive-connection-auth.png\" border=\"0\"><em><small>Click to enlarge<\/small><\/em> <\/a><\/p>\n<p>Then, I got the set of connection properties needed to connect an application to my topic.<\/p>\n<p><a href=\"\/\/dalelane.co.uk\/blog\/post-images\/190805-hive\/190805-hive-connection-properties.png\" border=\"0\"><img decoding=\"async\" src=\"\/\/dalelane.co.uk\/blog\/post-images\/190805-hive\/190805-hive-connection-properties-small.png\" style=\"border: thin black solid\"\/><\/a><br \/>\n<a href=\"\/\/dalelane.co.uk\/blog\/post-images\/190805-hive\/190805-hive-connection-properties.png\" border=\"0\"><em><small>Click to enlarge<\/small><\/em> <\/a><\/p>\n<p>That was enough info to create the external table in Hive, using Hive&#8217;s <code>beehive<\/code> CLI:<\/p>\n<pre style=\"border: thin solid silver; background-color: #eeeeee; padding: 0.8em; color: black; overflow-x: scroll\"><font color=\"#777777\">Connected to: Apache Hive (version 3.1.1)\nDriver: Hive JDBC (version 3.1.1)\nTransaction isolation: TRANSACTION_REPEATABLE_READ\nBeeline version 3.1.1 by Apache Hive\n0: jdbc:hive2:\/\/&gt; <\/font><strong>CREATE EXTERNAL TABLE test_kafka<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>(`message` string)<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>STORED BY 'org.apache.hadoop.hive.kafka.KafkaStorageHandler'<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>TBLPROPERTIES<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>(\"kafka.topic\" = \"DALE.TOPIC\",<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong> \"kafka.bootstrap.servers\"=\"9.30.245.82:32342\",<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong> \"kafka.consumer.sasl.jaas.config\"=\"org.apache.kafka.common.security.plain.PlainLoginModule required username='token' password='9UfXe0KkabBqjXiU9NdGnvtEPVrqJ_tvLo-ZqnQy8Gww';\",<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong> \"kafka.consumer.sasl.mechanism\"=\"PLAIN\",<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong> \"kafka.consumer.security.protocol\"=\"SASL_SSL\",<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong> \"kafka.consumer.ssl.protocol\"=\"TLSv1.2\",<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong> \"kafka.consumer.ssl.truststore.location\"=\"\/root\/es-cert.jks\",<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong> \"kafka.consumer.ssl.truststore.password\"=\"password\"<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong> );<\/strong>\n<font color=\"#777777\">OK\nNo rows affected<\/font>\n\n<\/pre>\n<p>Notice that the first two table properties I&#8217;ve used are <code>kafka.topic<\/code> with my topic name, and <code>kafka.bootstrap.servers<\/code> with the bootstrap address for my cluster.<\/p>\n<p>The rest are the properties I copied from the Event Streams UI, prefixed with <code>kafka.consumer.<\/code> because I&#8217;m using <code>SELECT<\/code> queries to read from my topic.<\/p>\n<h3>Simple SQL queries<\/h3>\n<p>I can use a <code>SELECT<\/code> query to fetch my five messages from the topic.<\/p>\n<pre style=\"border: thin solid silver; background-color: #eeeeee; padding: 0.8em; color: black; overflow-x: scroll\"><font color=\"#777777\">0: jdbc:hive2:\/\/&gt; <\/font><strong>SELECT * from test_kafka;<\/strong>\n<font color=\"#777777\">OK<\/font>\n<strong>+-----------------------------+-------------------+-------------------------+----------------------+-------------------------+\n|     test_kafka.message      | test_kafka.__key  | test_kafka.__partition  | test_kafka.__offset  | test_kafka.__timestamp  |\n+-----------------------------+-------------------+-------------------------+----------------------+-------------------------+\n| Hello World                 |                   | 0                       | 0                    | 1565042407338           |\n| This is a test              |                   | 1                       | 0                    | 1565042419083           |\n| These are messages          |                   | 2                       | 0                    | 1565042429234           |\n| Lorem Ipsum                 |                   | 0                       | 1                    | 1565042458894           |\n| Lorem ipsum dolor sit amet  |                   | 1                       | 1                    | 1565042612579           |\n+-----------------------------+-------------------+-------------------------+----------------------+-------------------------+<\/strong>\n<font color=\"#777777\">5 rows selected<\/font>\n\n<\/pre>\n<p>Hive gives me the field I gave it for the table (the &#8220;message&#8221; string attribute that I put in every Kafka message), and metadata fields for the key, partition, offset and timestamp.<\/p>\n<p>I can use SQL <code>WHERE<\/code> clauses to filter based on this metadata.<\/p>\n<p>For example, to get messages between two timestamps:<\/p>\n<pre style=\"border: thin solid silver; background-color: #eeeeee; padding: 0.8em; color: black; overflow-x: scroll\"><font color=\"#777777\">0: jdbc:hive2:\/\/&gt;<\/font> <strong>SELECT * FROM test_kafka <\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>WHERE <\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>  `__timestamp` &gt; 1565042420000 <\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>AND <\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>  `__timestamp` &lt; 1565042520000;<\/strong>\n<font color=\"#777777\">OK<\/font>\n<strong>+---------------------+-------------------+-------------------------+----------------------+-------------------------+\n| test_kafka.message  | test_kafka.__key  | test_kafka.__partition  | test_kafka.__offset  | test_kafka.__timestamp  |\n+---------------------+-------------------+-------------------------+----------------------+-------------------------+\n| These are messages  |                   | 2                       | 2                    | 1565042429234           |\n| Lorem Ipsum         |                   | 0                       | 1                    | 1565042458894           |\n+---------------------+-------------------+-------------------------+----------------------+-------------------------+<\/strong>\n<font color=\"#777777\">2 rows selected<\/font>\n\n<\/pre>\n<h3>A more interesting Kafka topic<\/h3>\n<p>Those queries worked because I&#8217;d used the same field <code>message<\/code> in every Kafka message.<\/p>\n<p>I&#8217;ve written before that a better way of <a href=\"https:\/\/dalelane.co.uk\/blog\/?p=3781\">ensuring consistency in Kafka messages is to use a schema<\/a>.<\/p>\n<p>And Avro schemas work well with Hive queries.<\/p>\n<p>I created a schema to define messages about Mario games, and uploaded it to the Event Streams Schema Registry.<\/p>\n<p><a href=\"\/\/dalelane.co.uk\/blog\/post-images\/190805-hive\/190805-hive-schema.png\" border=\"0\"><img decoding=\"async\" src=\"\/\/dalelane.co.uk\/blog\/post-images\/190805-hive\/190805-hive-schema-small.png\" style=\"border: thin black solid\"\/><\/a><br \/>\n<a href=\"\/\/dalelane.co.uk\/blog\/post-images\/190805-hive\/190805-hive-schema.png\" border=\"0\"><em><small>Click to enlarge<\/small><\/em> <\/a><\/p>\n<p>Each Kafka message on my topic will be about a Mario game, and have attributes with the name of the game, the year it was released, and the number of copies sold.<\/p>\n<p>I created a new <code>MARIO.TOPIC<\/code> topic to use the schema with.<\/p>\n<p>Then I produced messages to represent <a href=\"https:\/\/nintendo.fandom.com\/wiki\/List_of_best-selling_Mario_games\">stats about a bunch of Mario games<\/a>.<\/p>\n<p><a href=\"\/\/dalelane.co.uk\/blog\/post-images\/190805-hive\/190805-hive-mario-topic.png\" border=\"0\"><img decoding=\"async\" src=\"\/\/dalelane.co.uk\/blog\/post-images\/190805-hive\/190805-hive-mario-topic-small.png\" style=\"border: thin black solid\"\/><\/a><br \/>\n<a href=\"\/\/dalelane.co.uk\/blog\/post-images\/190805-hive\/190805-hive-mario-topic.png\" border=\"0\"><em><small>Click to enlarge<\/small><\/em> <\/a><\/p>\n<p>The Event Streams UI integrates with the Schema Registry, so if I click on a couple of the messages, you can see the parsed message fields.<\/p>\n<p><a href=\"\/\/dalelane.co.uk\/blog\/post-images\/190805-hive\/190805-hive-mario-kart.png\" border=\"0\"><img decoding=\"async\" src=\"\/\/dalelane.co.uk\/blog\/post-images\/190805-hive\/190805-hive-mario-kart-small.png\" style=\"border: thin black solid\"\/><\/a><br \/>\n<a href=\"\/\/dalelane.co.uk\/blog\/post-images\/190805-hive\/190805-hive-mario-kart.png\" border=\"0\"><em><small>Click to enlarge<\/small><\/em> <\/a><\/p>\n<p><a href=\"\/\/dalelane.co.uk\/blog\/post-images\/190805-hive\/190805-hive-mario-odyssey.png\" border=\"0\"><img decoding=\"async\" src=\"\/\/dalelane.co.uk\/blog\/post-images\/190805-hive\/190805-hive-mario-odyssey-small.png\" style=\"border: thin black solid\"\/><\/a><br \/>\n<a href=\"\/\/dalelane.co.uk\/blog\/post-images\/190805-hive\/190805-hive-mario-odyssey.png\" border=\"0\"><em><small>Click to enlarge<\/small><\/em> <\/a><\/p>\n<p>Now I had some test data suitable for trying out more interesting queries with.<\/p>\n<h3>Setting up Hive to use Avro schemas<\/h3>\n<p>To use the schema, I needed to download a serdes jar that can fetch schemas on-demand from the Event Streams Schema Registry. I used the Event Streams UI to download the jars that I need.<\/p>\n<p><a href=\"\/\/dalelane.co.uk\/blog\/post-images\/190805-hive\/190805-hive-serdes.png\" border=\"0\"><img decoding=\"async\" src=\"\/\/dalelane.co.uk\/blog\/post-images\/190805-hive\/190805-hive-serdes-small.png\" style=\"border: thin black solid\"\/><\/a><br \/>\n<a href=\"\/\/dalelane.co.uk\/blog\/post-images\/190805-hive\/190805-hive-serdes.png\" border=\"0\"><em><small>Click to enlarge<\/small><\/em> <\/a><\/p>\n<p>I copied the Event Streams serdes jars to <code>$HIVE_HOME\/auxlib<\/code>. (I needed to create that folder first, and then restart my Hive process).<\/p>\n<p>And I needed to add a <code>kafka.serdes.class<\/code> table property when creating the external table in Hive.<\/p>\n<pre style=\"border: thin solid silver; background-color: #eeeeee; padding: 0.8em; color: black; overflow-x: scroll\"><font color=\"#777777\">Connected to: Apache Hive (version 3.1.1)\nDriver: Hive JDBC (version 3.1.1)\nTransaction isolation: TRANSACTION_REPEATABLE_READ\nBeeline version 3.1.1 by Apache Hive\n0: jdbc:hive2:\/\/&gt; <\/font><strong>CREATE EXTERNAL TABLE mario_kafka<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>(`title` string, `releaseYear` int, `salesInMillions` float)<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>STORED BY 'org.apache.hadoop.hive.kafka.KafkaStorageHandler'<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>TBLPROPERTIES<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>(\"kafka.topic\" = \"MARIO.TOPIC\",<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong> \"kafka.bootstrap.servers\"=\"9.30.245.82:32342\",<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong> \"kafka.consumer.sasl.jaas.config\"=\"org.apache.kafka.common.security.plain.PlainLoginModule required username='token' password='9UfXe0KkabBqjXiU9NdGnvtEPVrqJ_tvLo-ZqnQy8Gww';\",<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong> \"kafka.consumer.sasl.mechanism\"=\"PLAIN\",<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong> \"kafka.consumer.security.protocol\"=\"SASL_SSL\",<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong> \"kafka.consumer.ssl.protocol\"=\"TLSv1.2\",<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong> \"kafka.consumer.ssl.truststore.location\"=\"\/root\/es-cert.jks\",<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong> \"kafka.consumer.ssl.truststore.password\"=\"password\",<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong> \"kafka.serdes.class\"=\"com.ibm.eventstreams.serdes.EventStreamsSerdes\"<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong> );<\/strong>\n<font color=\"#777777\">OK\nNo rows affected<\/font>\n\n<\/pre>\n<h3>SQL queries with filters<\/h3>\n<p>That means I can do queries like getting all Mario games from the last five years.<\/p>\n<pre style=\"border: thin solid silver; background-color: #eeeeee; padding: 0.8em; color: black; overflow-x: scroll\"><font color=\"#777777\">0: jdbc:hive2:\/\/&gt;<\/font> <strong>SELECT * FROM mario_kafka <\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>WHERE <\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong> `releaseYear` &gt; 2013;<\/strong>\n<font color=\"#777777\">OK<\/font>\n<strong>+-------------------------------------+--------------------------+------------------------------+--------------------+--------------------------+-----------------------+--------------------------+\n|          mario_kafka.title          | mario_kafka.releaseyear  | mario_kafka.salesinmillions  | mario_kafka.__key  | mario_kafka.__partition  | mario_kafka.__offset  | mario_kafka.__timestamp  |\n+-------------------------------------+--------------------------+------------------------------+--------------------+--------------------------+-----------------------+--------------------------+\n| Super Mario Odyssey                 | 2017                     | 14.44                        | NULL               | 0                        | 3                     | 1565041149793            |\n| Mario Kart 8                        | 2014                     | 8.44                         | NULL               | 0                        | 7                     | 1565041149795            |\n| Super Mario Maker                   | 2015                     | 4.0                          | NULL               | 0                        | 13                    | 1565041149798            |\n| Mario Party 10                      | 2015                     | 2.21                         | NULL               | 0                        | 19                    | 1565041149800            |\n| Super Mario Party                   | 2018                     | 6.4                          | NULL               | 2                        | 8                     | 1565041149796            |\n| New Super Mario Bros. U Deluxe      | 2019                     | 3.31                         | NULL               | 2                        | 13                    | 1565041149798            |\n| Mario Tennis Aces                   | 2018                     | 2.64                         | NULL               | 2                        | 15                    | 1565041149799            |\n| Mario Kart 8 Deluxe (NS)            | 2017                     | 16.69                        | NULL               | 1                        | 3                     | 1565041149793            |\n| Super Mario Maker for Nintendo 3DS  | 2016                     | 2.01                         | NULL               | 1                        | 21                    | 1565041149801            |\n+-------------------------------------+--------------------------+------------------------------+--------------------+--------------------------+-----------------------+--------------------------+<\/strong>\n<font color=\"#777777\">9 rows selected<\/font>\n\n<\/pre>\n<p>Or all Mario games that have sold more than 20 million copies.<\/p>\n<pre style=\"border: thin solid silver; background-color: #eeeeee; padding: 0.8em; color: black; overflow-x: scroll\"><font color=\"#777777\">0: jdbc:hive2:\/\/&gt;<\/font> <strong>SELECT * FROM mario_kafka<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>WHERE<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong> `salesInMillions` &gt; 20;<\/strong>\n<font color=\"#777777\">OK<\/font>\n<strong>+----------------------------+--------------------------+------------------------------+--------------------+--------------------------+-----------------------+--------------------------+\n|     mario_kafka.title      | mario_kafka.releaseyear  | mario_kafka.salesinmillions  | mario_kafka.__key  | mario_kafka.__partition  | mario_kafka.__offset  | mario_kafka.__timestamp  |\n+----------------------------+--------------------------+------------------------------+--------------------+--------------------------+-----------------------+--------------------------+\n| Mario Kart Wii             | 2008                     | 37.2                         | NULL               | 0                        | 0                     | 1565041149791            |\n| Mario Kart DS              | 2005                     | 23.6                         | NULL               | 0                        | 1                     | 1565041149792            |\n| New Super Mario Bros.      | 2006                     | 30.8                         | NULL               | 2                        | 0                     | 1565041149791            |\n| Super Mario World          | 1990                     | 20.61                        | NULL               | 2                        | 1                     | 1565041149792            |\n| Super Mario Bros           | 1985                     | 40.24                        | NULL               | 1                        | 0                     | 1565041149783            |\n| New Super Mario Bros. Wii  | 2009                     | 30.26                        | NULL               | 1                        | 1                     | 1565041149791            |\n+----------------------------+--------------------------+------------------------------+--------------------+--------------------------+-----------------------+--------------------------+<\/strong>\n<font color=\"#777777\">6 rows selected<\/font>\n\n<\/pre>\n<h3>SQL queries with aggregates<\/h3>\n<p>Or get the total number of games sold for Mario games grouped by the year of release.<\/p>\n<p>Notice how Hive is handling this by setting up a map-reduce job to handle the <code>SUM<\/code>.<\/p>\n<pre style=\"border: thin solid silver; background-color: #eeeeee; padding: 0.8em; color: black; overflow-x: scroll\"><font color=\"#777777\">0: jdbc:hive2:\/\/&gt;<\/font> <strong>SELECT <\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>  `releaseYear`, <\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>  SUM(`salesInMillions`) AS totalSales <\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>FROM mario_kafka <\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>  GROUP BY `releaseYear`;<\/strong>\n<font color=\"#777777\">Query ID = root_20190805145028_58130aad-b433-4dbe-aed5-8b7fbfea57bb\nTotal jobs = 1\nLaunching Job 1 out of 1\nNumber of reduce tasks not specified. Estimated from input data size: 1\nIn order to change the average load for a reducer (in bytes):\n  set hive.exec.reducers.bytes.per.reducer=&lt;number&gt;\nIn order to limit the maximum number of reducers:\n  set hive.exec.reducers.max=&lt;number&gt;\nIn order to set a constant number of reducers:\n  set mapreduce.job.reduces=&lt;number&gt;\nJob running in-process (local Hadoop)\n2019-08-05 14:50:31,181 Stage-1 map = 100%,  reduce = 0%\n2019-08-05 14:50:33,194 Stage-1 map = 100%,  reduce = 100%\nEnded Job = job_local1519121693_0002\nMapReduce Jobs Launched:\nStage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS\nTotal MapReduce CPU Time Spent: 0 msec\nOK<\/font>\n<strong>+--------------+---------------------+\n| releaseyear  |     totalsales      |\n+--------------+---------------------+\n| 1983         | 2.2799999713897705  |\n| 1985         | 40.2400016784668    |\n| 1986         | 2.6500000953674316  |\n| 1988         | 24.74000072479248   |\n| 1989         | 18.139999389648438  |\n| 1990         | 30.80000066757202   |\n| 1992         | 22.250000476837158  |\n| 1993         | 10.550000190734863  |\n| 1994         | 5.190000057220459   |\n| 1995         | 4.119999885559082   |\n| 1996         | 23.919999837875366  |\n| 1998         | 2.700000047683716   |\n| 1999         | 9.020000219345093   |\n| 2000         | 6.779999852180481   |\n| 2001         | 17.170000076293945  |\n| 2002         | 11.320000171661377  |\n| 2003         | 17.759999990463257  |\n| 2004         | 18.59000015258789   |\n| 2005         | 28.050000429153442  |\n| 2006         | 33.53999924659729   |\n| 2007         | 35.00000071525574   |\n| 2008         | 38.46000075340271   |\n| 2009         | 34.390000343322754  |\n| 2010         | 9.18999981880188    |\n| 2011         | 34.540000200271606  |\n| 2012         | 23.869999885559082  |\n| 2013         | 13.190000057220459  |\n| 2014         | 8.4399995803833     |\n| 2015         | 6.210000038146973   |\n| 2016         | 2.009999990463257   |\n| 2017         | 31.130000114440918  |\n| 2018         | 9.040000200271606   |\n| 2019         | 3.309999942779541   |\n+--------------+---------------------+<\/strong>\n<font color=\"#777777\">33 rows selected<\/font>\n\n<\/pre>\n<p>If I count how many Mario games were released in each year, and sort that by the number of releases, this causes Hive to need a second map-reduce job.<\/p>\n<pre style=\"border: thin solid silver; background-color: #eeeeee; padding: 0.8em; color: black; overflow-x: scroll\"><font color=\"#777777\">0: jdbc:hive2:\/\/&gt;<\/font> <strong>SELECT <\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>  `releaseYear`, <\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>  COUNT(*) AS numReleases <\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>FROM mario_kafka <\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>GROUP BY `releaseYear` <\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>ORDER BY numReleases DESC;<\/strong>\n<font color=\"#777777\">Query ID = root_20190805145411_dcf88f12-49ec-4a61-aea3-299fb9d964f6\nTotal jobs = 2\nLaunching Job 1 out of 2\nNumber of reduce tasks not specified. Estimated from input data size: 1\nIn order to change the average load for a reducer (in bytes):\n  set hive.exec.reducers.bytes.per.reducer=&lt;number&gt;\nIn order to limit the maximum number of reducers:\n  set hive.exec.reducers.max=&lt;number&gt;\nIn order to set a constant number of reducers:\n  set mapreduce.job.reduces=&lt;number&gt;\nJob running in-process (local Hadoop)\n2019-08-05 14:54:14,191 Stage-1 map = 100%,  reduce = 0%\n2019-08-05 14:54:16,197 Stage-1 map = 100%,  reduce = 100%\nEnded Job = job_local1092517259_0006\nLaunching Job 2 out of 2\nNumber of reduce tasks determined at compile time: 1\nIn order to change the average load for a reducer (in bytes):\n  set hive.exec.reducers.bytes.per.reducer=&lt;number&gt;\nIn order to limit the maximum number of reducers:\n  set hive.exec.reducers.max=&lt;number&gt;\nIn order to set a constant number of reducers:\n  set mapreduce.job.reduces=&lt;number&gt;\nJob running in-process (local Hadoop)\n2019-08-05 14:54:17,511 Stage-2 map = 100%,  reduce = 100%\nEnded Job = job_local116294658_0007\nMapReduce Jobs Launched:\nStage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS\nStage-Stage-2:  HDFS Read: 0 HDFS Write: 0 SUCCESS\nTotal MapReduce CPU Time Spent: 0 msec\nOK<\/font>\n<strong>+--------------+--------------+\n| releaseyear  | numreleases  |\n+--------------+--------------+\n| 2004         | 5            |\n| 2003         | 5            |\n| 2013         | 5            |\n| 2007         | 5            |\n| 2000         | 4            |\n| 2012         | 4            |\n| 2011         | 4            |\n| 2005         | 4            |\n| 1999         | 3            |\n| 2006         | 3            |\n| 1992         | 3            |\n| 1990         | 3            |\n| 1996         | 3            |\n| 2002         | 3            |\n| 2001         | 3            |\n| 2008         | 2            |\n| 2015         | 2            |\n| 2017         | 2            |\n| 1988         | 2            |\n| 2010         | 2            |\n| 2009         | 2            |\n| 2018         | 2            |\n| 2019         | 1            |\n| 2016         | 1            |\n| 2014         | 1            |\n| 1998         | 1            |\n| 1995         | 1            |\n| 1994         | 1            |\n| 1993         | 1            |\n| 1989         | 1            |\n| 1986         | 1            |\n| 1985         | 1            |\n| 1983         | 1            |\n+--------------+--------------+<\/strong>\n<font color=\"#777777\">33 rows selected<\/font>\n\n<\/pre>\n<h3>Creating views<\/h3>\n<p>I can also create a view to make it easier to refer to subsets of the messages on my Kafka topic.<\/p>\n<pre style=\"border: thin solid silver; background-color: #eeeeee; padding: 0.8em; color: black; overflow-x: scroll\"><font color=\"#777777\">0: jdbc:hive2:\/\/&gt;<\/font> <strong>SELECT * from mario_kafka <\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>WHERE `title` LIKE '%Party%';<\/strong>\n<font color=\"#777777\">OK<\/font>\n<strong>+---------------------------+--------------------------+------------------------------+--------------------+--------------------------+-----------------------+--------------------------+\n|     mario_kafka.title     | mario_kafka.releaseyear  | mario_kafka.salesinmillions  | mario_kafka.__key  | mario_kafka.__partition  | mario_kafka.__offset  | mario_kafka.__timestamp  |\n+---------------------------+--------------------------+------------------------------+--------------------+--------------------------+-----------------------+--------------------------+\n| Mario Party DS            | 2007                     | 9.31                         | NULL               | 0                        | 6                     | 1565041149795            |\n| Mario Party 10            | 2015                     | 2.21                         | NULL               | 0                        | 19                    | 1565041149800            |\n| Mario Party 7             | 2005                     | 1.86                         | NULL               | 0                        | 21                    | 1565041149801            |\n| Mario Party 8             | 2007                     | 8.85                         | NULL               | 2                        | 6                     | 1565041149795            |\n| Super Mario Party         | 2018                     | 6.4                          | NULL               | 2                        | 8                     | 1565041149796            |\n| Mario Party 9             | 2012                     | 2.73                         | NULL               | 2                        | 14                    | 1565041149798            |\n| Mario Party 4             | 2002                     | 2.21                         | NULL               | 2                        | 17                    | 1565041149800            |\n| Mario Party 5             | 2003                     | 2.0                          | NULL               | 2                        | 19                    | 1565041149800            |\n| Mario Party 3             | 2000                     | 1.91                         | NULL               | 2                        | 21                    | 1565041149801            |\n| Mario Party 6             | 2004                     | 1.65                         | NULL               | 2                        | 22                    | 1565041149801            |\n| Mario Party: Island Tour  | 2013                     | 1.14                         | NULL               | 2                        | 26                    | 1565041149803            |\n| Mario Party               | 1998                     | 2.7                          | NULL               | 1                        | 15                    | 1565041149799            |\n| Mario Party 2             | 1999                     | 2.48                         | NULL               | 1                        | 16                    | 1565041149799            |\n+---------------------------+--------------------------+------------------------------+--------------------+--------------------------+-----------------------+--------------------------+<\/strong>\n<font color=\"#777777\">13 rows selected\n0: jdbc:hive2:\/\/&gt;\n0: jdbc:hive2:\/\/&gt;<\/font>\n<font color=\"#777777\">0: jdbc:hive2:\/\/&gt;<\/font> <strong>CREATE VIEW mario_party_games<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>AS <\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>  SELECT `title`, `releaseYear`, `salesInMillions`<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>  ADDED FROM mario_kafka<\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>  WHERE `title` LIKE '%Party%';<\/strong>\n<font color=\"#777777\">OK\nNo rows affected\n0: jdbc:hive2:\/\/&gt;<\/font>\n<font color=\"#777777\">0: jdbc:hive2:\/\/&gt;<\/font> <strong> SELECT * FROM mario_party_games;<\/strong>\n<font color=\"#777777\">OK<\/font>\n<strong>+---------------------------+--------------------------------+--------------------------+\n|  mario_party_games.title  | mario_party_games.releaseyear  | mario_party_games.added  |\n+---------------------------+--------------------------------+--------------------------+\n| Mario Party DS            | 2007                           | 9.31                     |\n| Mario Party 10            | 2015                           | 2.21                     |\n| Mario Party 7             | 2005                           | 1.86                     |\n| Mario Party 8             | 2007                           | 8.85                     |\n| Super Mario Party         | 2018                           | 6.4                      |\n| Mario Party 9             | 2012                           | 2.73                     |\n| Mario Party 4             | 2002                           | 2.21                     |\n| Mario Party 5             | 2003                           | 2.0                      |\n| Mario Party 3             | 2000                           | 1.91                     |\n| Mario Party 6             | 2004                           | 1.65                     |\n| Mario Party: Island Tour  | 2013                           | 1.14                     |\n| Mario Party               | 1998                           | 2.7                      |\n| Mario Party 2             | 1999                           | 2.48                     |\n+---------------------------+--------------------------------+--------------------------+<\/strong>\n<font color=\"#777777\">13 rows selected\n0: jdbc:hive2:\/\/&gt;\n0: jdbc:hive2:\/\/&gt;\n0: jdbc:hive2:\/\/&gt;\n0: jdbc:hive2:\/\/&gt; <\/font><strong>SELECT * FROM mario_party_games <\/strong>\n<font color=\"#777777\">. . . . . . . . &gt;<\/font> <strong>ORDER BY `releaseYear`;<\/strong>\n<font color=\"#777777\">Query ID = root_20190805152257_eb7a4e36-36b4-4e1d-86c2-9bbf5999115b\nTotal jobs = 1\nLaunching Job 1 out of 1\nNumber of reduce tasks determined at compile time: 1\nIn order to change the average load for a reducer (in bytes):\n  set hive.exec.reducers.bytes.per.reducer=&lt;number&gt;\nIn order to limit the maximum number of reducers:\n  set hive.exec.reducers.max=&lt;number&gt;\nIn order to set a constant number of reducers:\n  set mapreduce.job.reduces=&lt;number&gt;\nJob running in-process (local Hadoop)\n2019-08-05 15:23:00,367 Stage-1 map = 100%,  reduce = 0%\n2019-08-05 15:23:02,373 Stage-1 map = 100%,  reduce = 100%\nEnded Job = job_local1254056661_0008\nMapReduce Jobs Launched:\nStage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS\nTotal MapReduce CPU Time Spent: 0 msec\nOK<\/font>\n<strong>+---------------------------+--------------------------------+--------------------------+\n|  mario_party_games.title  | mario_party_games.releaseyear  | mario_party_games.added  |\n+---------------------------+--------------------------------+--------------------------+\n| Mario Party               | 1998                           | 2.7                      |\n| Mario Party 2             | 1999                           | 2.48                     |\n| Mario Party 3             | 2000                           | 1.91                     |\n| Mario Party 4             | 2002                           | 2.21                     |\n| Mario Party 5             | 2003                           | 2.0                      |\n| Mario Party 6             | 2004                           | 1.65                     |\n| Mario Party 7             | 2005                           | 1.86                     |\n| Mario Party DS            | 2007                           | 9.31                     |\n| Mario Party 8             | 2007                           | 8.85                     |\n| Mario Party 9             | 2012                           | 2.73                     |\n| Mario Party: Island Tour  | 2013                           | 1.14                     |\n| Mario Party 10            | 2015                           | 2.21                     |\n| Super Mario Party         | 2018                           | 6.4                      |\n+---------------------------+--------------------------------+--------------------------+<\/strong>\n<font color=\"#777777\">13 rows selected\n0: jdbc:hive2:\/\/&gt;<\/font>\n\n<\/pre>\n<h3>Next steps<\/h3>\n<p>There are lots more things that Hive can do with Kafka. For example, I&#8217;ve only been using it as a consumer and haven&#8217;t even started getting Hive queries to produce to my topics.<\/p>\n<p>But this was a good first intro to the kinds of thing that Hive can enable.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Apache Hive is open source data warehouse software built on top of Hadoop. It gives you an SQL-like interface to a wide variety of databases, filesystems, and other systems. One of the Hive storage handlers is a Kafka storage handler, which lets you create a Hive &#8220;external table&#8221; based on a Kafka topic. And once [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":3849,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7],"tags":[597,593,582,596,583,584],"class_list":["post-3832","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-code","tag-apachehive","tag-apachekafka","tag-eventstreams","tag-hive","tag-ibmeventstreams","tag-kafka"],"_links":{"self":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/3832","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3832"}],"version-history":[{"count":0,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/3832\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/media\/3849"}],"wp:attachment":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3832"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3832"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3832"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}