Kevin Boone

Using Apache Avro for passing Java objects through a message broker

Introduction

Avro logo Passing structured data, such as instances of Java or C++ classes, across network infrastructure has long been a significant problem in information technology. We have programming languages that allow us to represent highly complex structured data but, in the end, we communicate this data by flattening it into a byte stream, and then converting it back again. Doing this in an efficient, secure way has long been problematic, and adding the requirement for cross-platform and cross-language compatibility makes for a very difficult software engineering exercise.

Probably the best-known, general-purpose, structurerd data representation format is XML. XML is human-readable -- up to a point -- and portable, and can be validated for correctness against a schema. However, it's very inefficient in its use of storage and network bandwidth. JSON is also widely-used; like XML, it's human readable, and not hugely efficient. At the opposite end of the portability and security spectrum is Java's built-in serialization mechanism. It's easy to use, and is built right into the JVM (for the time being, at least) but it's raised many security concerns over the years. Java serialization has no means of validation, and it's very difficult to handle situations where Java applications change, but they need to continue to exchange the same data.

JMS is a Java API for passing messages through some sort of message broker or router. It has built-in support for Java primitive data types -- String, int, etc. -- but no support for structured data. To pass Java objects using JMS, the objects must be flattened to a byte array. So using JMS does nothing to help with passing structured data, even in an all-Java environment, unless you want to rely on the JVM's built-in serialization; and nobody does.

Apache Avro is a set of tools for serializing structured objects into a flattened representation, suitable for storing on disk or passing over a network. There are implementations for several different programming languages, although Java is probably the most widely-used. Avro is schema-based, and offers a well-defined way to pass common data between different applications that might themselves change. Avro is very compact, compared to XML, but not at all readable -- Avro data can really only be interpreted by Avro-compatible applications.

Avro offers an effective method for passing structured Java objects through a message broker using JMS. The basic principle, as I'll explain, is to use Avro to serialize Java objects into arrays of bytes, that can be stored in a JMS BytesMessage object. Avro is also widely used with Apache Kafka, but that's for another day.

There are many published examples of using Avro to serialize data to files using its built-in I/O handlers. However, there's very little documentation about using it with messaging systems other than Kafka. Kafka's built-in support for Avro hides much of the complexity, and that doesn't make Avro itself any easier to understand.

In this article, I'll explain how to use Avro in what seems to me to be the simplest possible way, to pass objects of a single Java class from a sender to a receiver via a message broker. I'll explain what's going on in the Avro API, how the schema is handled, and what data ends up being passed over the network. Full source code of the example is available in my GitHub repository.

To run the example you'll need an AMQP (1.x)-compatible message broker, like Apache Artemis, or the Red Hat product based on it, AMQ 7. The example uses the Qpid JMS library to provide AMQP support. It's not very difficult to modify the code to use a different messaging protocol than AMQP, but I've chosen AMQP 1.x because it's now so widespread.

I'm assuming that the reader is reasonably familiar with Java middleware development, and particularly with the use of Maven. As in all my articles, I'll be using nothing more complex than command-line tools and a text editor to create the application.

About the example

For the purposes of demonstration, my example uses a Java class that models the properties of cartoon bears (because why not?) The Bear class itself is generated from an Avro schema, which I'll describe later. Avro schemas can be very complex, but mine is about as simple as I can make it. The sender application creates various instances of class Bear, serializes then using Avro, and posts them to a named queue in the message broker.

The receiver application waits for messages from the same queue on the broker and, when each is received, it deserializes it back to an instance of class Bear.

Both the sender and receiver have to work with the same schema (or, at least, compatible schemas) and, for simplicity, mine is stored in a simple JSON file. In practice, it's increasingly common to use some sort of registry to hold these schemas, so that distributed applications always have access to matching schemas. Note that, unlike XML or JSON, Avro won't work at all without a schema -- not only is this important from a security point of view, it allows for a very compact data representation, as I'll demonstrate later.

Basic principles

There are many ways to use Avro. In this article I'll be using two particular Avro classes: SpecificDatumWriter and SpecificDatumReader. These serialize and deserialize individual instances of a particular Java class. From a JMS perspective, I'll be using instances of javax.jms.BytesMessage to hold the flattened byte arrays. BytesMessage is a generic message type, that is not interpreted or understood by the JMS system itself; it's just a carrier for application-specific data.

My application uses Maven to build the executables, as most Java applications now do. There is a specific Maven plug-in that generates Java classes from the Avro schema, as I'll explain later. However, I do need to stress that this code generation is not a necessary step -- Avro requires that the Java classes match the schema, but it doesn't require that they were generated specifically from the schema.

The schema

My application works with objects of class Bear, which is structured essentially like this:

public class Bear 
  {
  CharSequence name;
  CharSequence location;
  //...
  public CharSequence getName() {...}
  public void setName (CharSequence name) {...}

  public CharSequence getLocation() {...}
  public void setLocation(CharSequence location) {...}
  //...
  }

However, my example generates the class from an Avro schema, which looks like this:

{
"namespace": "net.kevinboone.apacheintegration.avrotest",
 "type": "record",
 "name": "Bear",
 "fields":
    [
    {"name": "name", "type": "string"},
    {"name": "location", "type": ["string", "null"]}
   ]
}

I hope that the connection between the schema and the Java class is relatively clear. Obviously, both will probably be much more complex in a real application.

I think it's somewhat ironic that Avro, a serialization tool, uses JSON -- a competing serialization format -- for its schemas. That's the disadvantage, I guess, of Avro's use of a serialized format that is not human-readable.

My application uses a Maven plug-in to turn the schema into a Java class (or, in a real application, a set of classes). The plug-in is set up in pom.xml like this:

<plugin>
  <groupId>org.apache.avro</groupId>
  <artifactId>avro-maven-plugin</artifactId>
  <version>1.10.1</version>
  <executions>
    <execution>
      <phase>generate-sources</phase>
      <goals>
        <goal>schema</goal>
      </goals>
      <configuration>
        <sourceDirectory>${project.basedir}/..</sourceDirectory>
        <outputDirectory>${project.basedir}/src/main/java/</outputDirectory>
      </configuration>
    </execution>
  </executions>
</plugin>

Because I'm using the same schema for both the sender and receiver modules, the schema file bear.avsc is place in the directory above both the modules. Therefore the sourceDirectory element in the plug-in points at the directory above the one that contains the module's code. The plug-in will read .avsc files from this directory, and write java to src/main/java, so it will end up alongside the main Java code for the module. The schema namespace ends up as a Java package name, and the schema name as a class name. So my generated class will be net.kevinboone.apacheintegration.avrotest.Bear.

The sender

In the sender, I create various instances of class Bear, like this.

 Bear paddington = new Bear();
 paddington.setName ("Paddington");
 paddington.setLocation ("32 Windsor Gardens");

Let's see how to convert paddington to a byte array.

The first step is to load the schema from its JSON file. In practice, we might be loading it from some kind of registry, but in this example it's just a file on disk.

Schema schema = new Schema.Parser().parse (new File (SCHEMA_FILE));

From the schema we create an Avro SpecificDatumWriter, intended for a particular class:

bearDatumWriter = new SpecificDatumWriter<Bear>(schema);

The SpecificDatumWriter knows how to convert the Java object, but it doesn't, by itself, know how to write the result. For that we need to create a BinaryEncoder, that knows how to write binary data. The encoder is linked to an OutputStream, like this:

ByteArrayOutputStream out = new ByteArrayOutputStream();
BinaryEncoder encoder = null;
encoder = EncoderFactory.get().binaryEncoder (out, encoder);

Notice that the factory method that creates a BinaryEncoder can take an existing instance of BinaryEncoder as input. If this is null, as it will be when the program starts, then a new instance is created. Thereafter, the factory will just re-initialize the existing encoder with a new stream, rather than creating a new one. This is important in a practical application, because creating these artefacts is costly, and they can use substantial memory.

Now, we use the encoder and the datum writer together, to encode each object:

bearDatumWriter.write (bear, encoder);
encoder.flush();
byte[] bytes = out.toByteArray();
out.close();

What we're left with is a byte array bytes that contains the Avro representation of the data in the paddington object. We can pass this to the message broker by putting the byte array in a BytesMessage:

MessageProducer producer = //... 
BytesMessage message = session.createBytesMessage ();
message.writeBytes (bytes);
producer.send (message);

and the message is away. I'm not describing the JMS part of the process in much detail here, because this article is primarily about Avro. There are many examples of sending simple messages using JMS.

One point to note, though: the Avro SpecificDatumWriter works most efficiently when it's fed a stream of objects to convert. In my example, we have to flush the encoder and re-initialize it for each message. I'm assuming that we want to pass objects one-at-a-time over JMS, with each object being a separate object. However, there's no problem with placing multiple objects in the same JMS message, if the application allows for this, and it will use Avro more efficiently.

The receiver

You probably won't be surprised to find that receiving is the mirror opposite of sending. We read the schema as before, and then create a datum reader object from it:

bearDatumReader = new SpecificDatumReader<Bear>(schema);

This, of course, is the counterpart of SpecificDatumWriter. Then we wait for a JMS message, which arrives in the form of a BytesMessage. We unpack the byte array from it like this:

BytesMessage message = (BytesMessage)consumer.receive();
byte[] bytes = new byte[(int)message.getBodyLength()];
message.readBytes (bytes);

Now we'll decode the byte array, using a BinaryDecoder, which is the counterpart of the BinaryEncoder we used for writing.

decoder = DecoderFactory.get().binaryDecoder (message, decoder);
Bear bear = bearDatumReader.read (null, decoder);
// Process the Java object...

As with the sender implementation, we re-initialize the decoder for each new message. Re-initializing it is more efficient than creating a new one but, if the application allows, passing a number of objects in the same message is more efficient still.

Avro on the wire

Here's a hex dump of the actual binary data that is encoded by Avro, and sent to the message broker.

0000000  14 50 61 64 64 69 6e 67  74 6f 6e 00 24 33 32 20  |.Paddington.$32 |
00000010  57 69 6e 64 73 6f 72 20  47 61 72 64 65 6e 73     |Windsor Gardens|

The text strings are understandable -- they are just UTF-8 encoded. There's a small amount of binary data to indicate the type and role of the data fields. A very small amount. In a more complex data structure, more scaffolding information will needed, but the Avro representation will always be more compact than XML or JSON.

To a large extent, it's the use of the schema that makes Avro so compact. When we write complex data using XML or JSON, the data carries its own structure, to a large extent. Each object that is sent will contain information about field names, and the nesting of the elements will reflect the nesting of the original data. There's no straightforward way to tell, just from an Avro byte dump, what the structure of the underlying Java objects is -- that's where the schema is required.

Discussion

That wasn't so bad, was it? The good thing is that it doesn't get any more complicated (for the developer), however complex the data structure is. Moreover, it's relatively straightforward to pass data between different programming languages, because the Avro data format is language-independent.

Unlike XML and JSON, Avro is fully dependent on a schema for all the data it sends and receives -- the data is meaningless without the schema. This makes Avro well-suited for machine-to-machine communication of objects whose structure is known in advance. It also makes it difficult to break a receiving application by sending it carefully-crafted, invalid data. This was certainly not true for Java serialization. That is, Avro is comparatively good at ensuring that the data is meaningful at both ends of the communication channel.

The reliance on a schema, and the need for all parties to a transaction to see the same schema, means that centralizing schema storage is a pressing concern. To do this, we might use a schema registry like Apicurio. I outline how to do that in this article.