Avro – A Serialization Framework – Part 1

Avro – A Serialization Framework – Part 1

Introduction

This is the first part in multi part series that talks about Apache Avro, a language-neutral data serialization system that defines a data format designed to support data-intensive applications, and provides support for this format in a variety of programming languages (currently C, C++, C#, Java, PHP, Python, and Ruby). Its functionality is similar to the other marshaling systems such as Thrift, Protocol Buffers, and so on. The main differentiators of Avro include the following:

Dynamic typing: The Avro implementation always keeps data and its corresponding schema together. As a result, marshaling/unmarshaling operations do not require either code generation or static data types. This also allows generic data processing.

Untagged data: Because it keeps data and schema together, Avro marshaling/unmarshaling does not require type/size information or manually assigned IDs to be encoded in data. As a result, Avro serialization produces a smaller output.

Enhanced versioning support: In the case of schema changes, Avro contains both schemas, which enables you to resolve differences symbolically based on the field names.

There are so many articles, blogs, and tutorials explains about Avro within Hadoop ecosystem but Avro is also widely used in non-Hadoop ecosystems such as binary message exchange format in the real-time streaming applications that involve Apache Kafka where the producer and consumer can agree upon a message format.

The focus of this series is to understand Avro’s capabilities and interoperability in non-Hadoop ecosystem. This series is broken down into the following multi-part blogs:

  • Avro – A Serialization Framework (this blog)
  • Avro with Python
  • Avro Interoperability with Java and Python
  • Avro Schema Design
  • Avro Schema Resolution & Projection

Use Case

Let’s have a use case starting from the basics and develop it over in this multi-part series. The use case is a well-known e-commerce shopping cart example that contains users, products, category, orders, order line items, and shipping address. This use case is used to explain the different data types of Avro and other advanced concepts like Schema design, resolution, and projection in the later blogs in this series.

What we want to do:

  • Create a simple Avro Schema and a corresponding data file in JSON format.
  • Convert the JSON file into binary Avro, and from binary Avro to JSON file using Avro Tools.
  • Create a Java program that reads the CSV file, convert into binary Avro, and use Avro Tools to create the JSON file.

Solution

Before solving our use case, let’s get some pre-requisites satisfied. Fortunately, Avro needs only couple of jar files to start with.

Pre-requisites:

  • JDK 1.6 +
  • Avro Jars: This blog series uses Avro 1.7.5

Avro Core: http://repo1.maven.org/maven2/org/apache/avro/avro/1.7.5/avro-1.7.5.jar

Avro Tools: http://repo1.maven.org/maven2/org/apache/avro/avro-tools/1.7.5/avro-tools-1.7.5.jar

Avro Downloads Page: http://avro.apache.org/releases.html#Download

  • Verify:
    • Download the Avro core and tools into a directory
    • Run Avro Tools jar to see what’s included

Create an Avro Schema and a corresponding data file in JSON format

  • Create an Avro schema: The below schema defines a simple e-commerce product structure with different Avro complex data types. The Avro data types are highlighted in bold that include both primitive and complex data types.
    • Primitive Types: long, string, int, null, Boolean, float, and double
    • Complex Types: record, enum, array, unions, fixed, and maps (not specified in the below schema).
      • The Product schema is of type record which is a collection of named fields of any type
      • product_description field is defined as an union so it can have either string or null values and the default is empty string
      • product_status field should contain one of the values in the enum and the default value is “AVAILABLE”
      • product_category field contains array of categories of type string
      • product_hash field is a Fixed complex type which is of string data type and should be of length 5.
      • More details of primitive and complex types can be found at http://avro.apache.org/docs/1.7.6/spec.html#schema_complex
  • Create  JSON data that corresponds to the above schema:

    • The below JSON data corresponds to the above schema and shows how each of the complex type is expressed.
    • One particular challenge with the below data will be covered in the challenges section.

Challenges:

  • When converting JSON to Avro and vice versa using Java JSON Encoder, Avro expects the JSON data to be in name value pair as specified in the product_description field as follows:

Valid: “product_description”: {“string”: “Hugo Xy Men 100 ml”}
Invalid: “product_description”: “Hugo Xy Men 100 ml”

  • Avro Java Tool will throw the following error if JSON data contains Invalid data

Convert the JSON file into binary Avro, and from binary Avro to JSON file using Avro Tools

  • Convert from JSON to Binary Avro:
    • Use Avro Tools to convert the JSON data to Binary Avro
      • Put product.avsc in schemas directory under any working directory
      • Put product.json in input directory
      • Create an output directory for the tool to create Avro file
      • Execute the below Java command which is fairly self explanatory
    • Verify the above command successfully executed by checking the output directory for the Avro file. The above command should return without any errors if it executed successfully.
  • Peek into Binary Avro:
    • Let’s take a look into the Binary Avro to understand what’s in there. Avro always writes the schema that’s used while creating the binary avro.
    • Avro binary data is always serialized with its schema. Because the schema used to write data is always available when the data is read, Avro data itself is not tagged with type information. This enables the “Dynamic Typing” and “Untagged Data” feature of Avro unlike other data serialization system that makes code generation is optional.
    • This also results in very compact encoding, since encoded values do not need to be tagged with a field identifier.
  • Convert from Binary Avro to JSON:
    • It’s time to do the reverse to get the JSON data back from the binary Avro that we created. Use Avro Tools to convert the Binary Avro to JSON data by specifying the binary Avro file location and the location of the JSON data.
    • The above command will create the json file on successful execution which is very similar to the previous JSON input file.

Create a Java program that reads the CSV file, convert into binary Avro, and use Avro Tools to create the JSON file.

  • Create a CSV file that contains product information:

Note: The product_category is specified within double quotes so that to ignore commas in between

  • Create a Java Program to read CSV and Avro Schema, and writes to an Avro file:
      • Chose OpenCsv library to process the CSV file and it is available from http://repo1.maven.org/maven2/net/sf/opencsv/opencsv/2.3/opencsv-2.3.jar
      • The Java program needs the location of Avro schema and the CSV file as input, and the location of output to store the binary Avro file. These locations are hard coded within the program but can be passed as an argument if desired.
      • The below program uses Generic Record from Avro that uses Dynamic Tagging functionality of Avro where there is no need to create classes from the Schema and no code generation is needed.
      • DatumWriter, DatumFileWriter, and GenericRecord are some of the important Avro classes used to associate the Avro Schema, the output file to write to, and finally appends the data from the CSV to the writer.
      • Observe how product_hash is split into collections, and a special Avro class “Fixed” to use when the schema contains fields with Fixed type.
      • To run the program, put Avro core jar and OpenCSV jar file in the classpath.
      •  Avro Java Tools can be used to deserialize the generated binary Avro into Json file as shown below
  •  Challenges:
    • Avro expects the fields to be written with proper data type. Otherwise, ClassCastException will be thrown. For example, the product_id is specified as “long” in the schema but if used “String” during the serialization then the following exception will be thrown.
    •  Fields of type arrays should be converted into collections to avoid the following exceptions.
    • Fields of type Fixed should use GenericRecord.Fixed class and convert the String into bytes to avoid the following exceptions.

Conclusion

  • Avro has lot of potential and use outside of Hadoop ecosystem, and is a good solution for language interoperability.
  • Avro’s compact binary size and the availability of the Schema at the top of the binary file is a perfect solution for message payloads that can be easily transported over different types of Message Queues. It’s widely used as a Message Payload in Apache Kafka.
  • Avro has basic RPC functionality with higher level of Schema abstraction.
  • The later series will focus on Avro’s Schema resolution where writer and reader can use different versions of Schema and still be able to process the message properly resulting in Schema Evolution, Resolution, and Projection capabilities.

References

 

10579 Views 3 Views Today