Avro with Python – Part 2

Avro with Python – Part 2

Introduction

This is the second part in multi part series that talks about Apache Avro in Python landscape. In many organizations, it is common to research, prototype, and test new ideas using a more domain-specific computing language like MATLAB or R then later port those ideas to be part of a larger production system written in, say, Java, C#, or C++. Many python libraries like NumPy, Pandas, matplotlib, Scipy, and others are all well-suited for the Big Data Analysis, and are gaining popularity. That said, without much ado; let’s see how Python and Avro play together. It’s important to have a basic understanding of Avro which is detailed in our first blog in this series.

Use Case

Let’s reuse the same e-commerce shopping cart example discussed in the first blog and understand how the same can be done in Python.

What we want to do:

  • Ensure Python is set up
  • Install Snappy Compression and Avro Tools
  • Write a Python program that creates the binary Avro
  • Write a Python program that reads the binary Avro and outputs JSON
  • Use Python Avro Tools to filter and read the binary Avro

Solution

Ensure Python is set up:

  • There are so many blogs and articles explaining about how to install Python, and the links are referenced below. For this use case, we use Ubuntu and Python 2.7.5

http://askubuntu.com/questions/101591/how-do-i-install-python-2-7-2-on-ubuntu

http://heliumhq.com/docs/installing_python_2.7.5_on_ubuntu

Install Snappy Compression and Avro Tools:

  • Avro for Python requires Snappy for compression process. Installing Snappy and Avro Tools are a breeze.
  • Install Snappy Compression Toolkit:
  • Install Avro Tools:
    • Option 1:
    • Option 2: To install using pip or easy_install:
    • Verify:

Write a Python program that creates the binary Avro:

  • Verify
  • Peek into Binary Avro:
    • Let’s take a look into the Binary Avro to understand what’s in there. Avro always writes the schema that’s used while creating the binary avro.
    • Avro binary data is always serialized with its schema. Because the schema used to write data is always available when the data is read, Avro data itself is not tagged with type information. This enables the “Dynamic Typing” and “Untagged Data” feature of Avro unlike other data serialization system that makes code generation is optional.
    • This also results in very compact encoding, since encoded values do not need to be tagged with a field identifier.

Write a Python program that reads the binary Avro and outputs JSON

  • Verify

Use Python Avro Tools to filter and read the binary Avro

  • Avro has handy “avro cat” Python executable which allows to select only certain fields from the binary avro to use and there is a great filtering option in the command line that can be used to query against the fields in the binary avro to select certain records.
  • The below command displays all products that are available:
  • The below command displays all products whose price is greater than 10

Conclusion

  •  Avro has good support for Python and serialization/deserialization without code generation is a big productivity boost for the developers.
  • Avro’s schema resolution and interoperability is a perfect choice for Java and Python to interoperate.
  • The later series will focus on Avro’s Schema resolution where writer and reader can use different versions of Schema and still be able to process the message properly resulting in Schema Evolution, Resolution, and Projection capabilities.

References

15139 Views 8 Views Today
  • aloo

    No such file or directory: ‘schemas/product.avsc’