Avro: Intro

This lesson explains the Avro serialization system.

We'll cover the following

Avro

Avro is a data serialization system that allows for remote procedure calls and the fast and compact serialization of data. The defining feature of Avro is a schema always embedded within an avro file. It allows one to read the file without knowing the schema before-hand. The name Avro is borrowed from a defunct British aircraft manufacturer.

A producer that writes records to an avro file must specify the schema that describes the structure of each record. That avro schema is expressed in JSON. However, a higher abstraction, the Avro IDL language, lets developers specify schemas in a form readable for people. We’ll explore IDL in a later lesson. For now, let’s see an example using the JSON schema.

Example

We will create an avro file consisting of records representing a car. But first, let’s define the schema for each record. Avro schemas are defined using JSON. They can contain primitive (boolean, int, long) or complex types ( map, array, enum). The JSON schema for our car record looks like this:

{
  "namespace": "datajek.io.avro",
  "type": "record",
  "name": "Car",
  "fields": [
    {
      "name": "make",
      "type": "string"
    },
    {
      "name": "model",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "year",
      "type": [
        "int",
        "null"
      ]
    },
    {
      "name": "horsepower",
      "type": [
        "int",
        "null"
      ]
    }
  ]
}

Using the above schema, we can create an .avro file consisting of car records. Each record is an instance of the interface GenericRecord; the fields are written as key/value pairs. Using the GenericRecord interface to represent records we lose type-safety and also need casts when accessing the record fields on a read-back from the avro file. Alternatively, we can generate Java classes representing the record from the schema using avro tools. We can work with the generated code, instead of generic records.

Get hands-on with 1400+ tech skills courses.