Data Serialization + RPC with Avro & Ruby

By Ilya Grigorik on February 16, 2010

Any programmer or project worth their salt needs to invent their own serialization, and if they are serious, an RPC framework - or, at least, that is what it seems like. Between, Protocol Buffers, Thrift, BERT, BSON, or even plain JSON, there is no shortage of choices and architectural decisions packed into each one. For that reason, when Doug Cutting (one of the lead developers on Hadoop) first proposed Avro in April of 2009, a healthy dose of skepticism was in order, after all, both Thrift and PB already had thriving communities - why reinvent the wheel? Having said that, the proposal passed and since then Avro has been making good progress.

Reviewing the latest benchmarks shows Avro as fully competitive in both speed and size of the output data to PB and Thrift. Though, neither speed nor size, while critical components, were the motivating reasons for Avro. Interestingly enough, Avro was designed by Doug Cutting with the goal of making it more friendly to dynamic environments (Python, Ruby, Pig, Hive, etc) where code generation is often an unnecessary and an unwanted step. Unlike PB, or Thrift, Avro stores its schema in plain JSON, as part of the output file, which makes it extremely easy to parse (JSON parsers are easy and abundant) and avoid the need for extra IDL definition stubs and compilers (though if you really want to, Avro can generate code stubs as well).

Embedding IDL with Avro

The decision to embed the Avro data schema alongside the binary packed data opens up a number of interesting use cases. First, dynamic frameworks such as Pig, Hive, or any other Hadoop infrastructure (the goal of Avro is to become the standard data exchange and RPC protocol for all of Hadoop), can load and process data on the fly, without looking for or invoking an IDL compiler. Additionally, having the original schema also allow us to do “data projection”: if the reader is only interested in a subset of the data, then it can selectively parse it out of the stream, allowing for faster processing and easy "versioning" support out of the box. Let’s take a look at a simple example:

SCHEMA = <<-JSON
{ "type": "record",
  "name": "User",
  "fields" : [
    {"name": "username", "type": "string"},
    {"name": "age", "type": "int"},
    {"name": "verified", "type": "boolean", "default": "false"}
  ]}
JSON

file = File.open('data.avr', 'wb')
schema = Avro::Schema.parse(SCHEMA)
writer = Avro::IO::DatumWriter.new(schema)
dw = Avro::DataFile::Writer.new(file, writer, schema)
dw << {"username" => "john", "age" => 25, "verified" => true}
dw << {"username" => "ryan", "age" => 23, "verified" => false}
dw.close

Avro specification provides all the primitives types you would expect (string, bool, double, etc.), and also a number of complex types such as records, enums, arrays, maps, unions, and fixed. Also, default values and sort order can be applied for some of the types. The schema itself is a JSON document, which you can peek at in the header of any serialized Avro file, which means that when it comes to reading the data, we don’t need to know anything about the data itself, or alternatively, only read an available subset:

# read all data from avro file 
file = File.open('data.avr', 'r+')
dr = Avro::DataFile::Reader.new(file, Avro::IO::DatumReader.new)
dr.each { |record| p record }

# extract the username only from the avro serialized file
READER_SCHEMA = <<-JSON
{ "type": "record",
  "name": "User",
  "fields" : [
    {"name": "username", "type": "string"}
 ]}
JSON

reader = Avro::IO::DatumReader.new(nil, Avro::Schema.parse(READER_SCHEMA))
dr = Avro::DataFile::Reader.new(file, reader)
dr.each { |record| p record }

RPC with Ruby and Avro

The RPC piece of Avro is also pretty straight forward: the protocol is defined as an Avro schema, where both the inputs and the methods (along side with the request / response input and output parameters) are provided inline. In principle, this also means that given the right framework, different clients could easily negotiate different data-formatting on the fly, without having to worry about versioning or conditional code paths. A simple Mail protocol with Avro:

{
  "namespace": "example.proto",
  "protocol": "Mail",

  "types": [{"name": "Message", "type": "record", "fields": [
        {"name": "to", "type": "string"},
        {"name": "from", "type": "string"},
        {"name": "body", "type": "string"}]
       }],

   "messages": {
      "replay": { "response": "string", "request": [] },
      "send": { "response": "string", "request": [{"name": "message", "type": "Message"}] }
  }
}

Here we defined a "Mail" protocol, which takes as input a record of type “Message”, which in turn, includes three strings: to, from, and body. Additionally, we defined two available methods (send and replay), which our server and client can use, as well as, their input and output parameters. Check out the full Ruby implementations of the server and client in the repo.

Avro, Ruby and Hadoop

The Ruby implementation of Avro still needs a lot of love and polish (it currently has a distinct Python smell to it), but given the growing adoption of Hadoop and the rising popularity of all the adjacent frameworks, Avro is definitely here to stay and for good reasons. So, before you invent another serialization framework, grab the source, build the gem, and give it a try.

Ilya Grigorik is a web ecosystem engineer, author of High Performance Browser Networking (O'Reilly), and Principal Engineer at Shopify — follow on Twitter.

Data Serialization + RPC with Avro & Ruby

Embedding IDL with Avro

RPC with Ruby and Avro

Avro, Ruby and Hadoop

High Performance Browser Networking - O'Reilly