Making a faster Avro Python library with eval

I work on astronomical data pipelines, and a lot of our data gets shipped around in Avro. It's often a lot of data, like many terabytes, and with complex schemas. And because I'm working on astronomy software, nearly everything is written in Python, the lingua franca of the scientific computing world.

I was surprised initially at how slow Avro libraries are. Deserializing a single message of some of the data I work with took 10ms with the official Avro Python library, which is truly glacial. An alternative library, fastavro performed much better, around 1.5ms - but that still seemed way too slow.

avro's serialization is simple

Avro's specification is refreshingly straightforward and clear: An Avro record is encoded as a simple concatenation of each of its fields. Those fields could be other records - in which case, again, simple concatenation is the rule. Or they could be "primitive" fields, like strings (length-prefixed UTF-8), or ints or longs (zig-zag varints, like in Protobuf), or floats or doubles (IEEE 754 floats in "single format" layout, little-endian). There are a few more other cases (maps, arrays, enums, fixeds) but believe me - they are all simple.

In the end, all you really need to know is that Avro messages are generally simple concatenations of their fields ¹. There are primitive types with clear rules for deserialization.

So what could be slowing us down?

how avro libraries generally work

Avro data always includes the schema used to write it, so existing Avro libraries deal with schema updates by always using the local schema of the writer - that way they always know how to deserialize data.

The libraries generally do this by holding the (parsed) schema in memory while deserializing data. They walk through the schema's structure, keeping track of what the next expected type from a byte stream is, and march forwards. And this works, but it's ultimately a real drag on performance.

There's a lot of overhead because libraries tend to work by using big switches over all the possible branches that a schema could take as they step through the schema's graph, like this:

cpdef _read_data(fo, writer_schema, ...):
    record_type = extract_record_type(writer_schema)
    # [...]
    try:
        if record_type == "null":
            data = read_null(fo)
        elif record_type == "string":
            data = read_utf8(fo)
        elif record_type == "int" or record_type == "long":
            data = read_long(fo)
        elif record_type == "float":
            data = read_float(fo)
        elif record_type == "double":
            data = read_double(fo)
        elif record_type == "boolean":
            data = read_boolean(fo)
        elif record_type == "bytes":
            data = read_bytes(fo)
        elif record_type == "fixed":
            data = read_fixed(fo, writer_schema)
    # ... you get the idea. It keeps going.

This happens every time for every message - even when the schema contains no ambiguity, libraries re-analyze the schema on every message they deserialize. This is probably the dominant drag on performance.

Now, if you happened to have a schema in front of you, and you were hand-writing an Avro decoder, you wouldn't have this problem. You could just decode each value in sequence, right in a row - read_int, read_string, read_int, read_bool - without switching on anything, since you already know the right order:

def decode_astro_message(fo):
    record = {}
    record["candidateId"] = read_string(fo)
    record["diaSourceId"] = read_long(fo)
    record["ra"] = read_double(fo)
    record["ra_err"] = read_float(fo)
    record["dec"] = read_double(fo)
    # ... etc etc
    return record

But of course, we can't do that in general for an arbitrary schema that's user-provided, that we discover during runtime while unpacking an Avro file... or can we?

the evil thought

You could construct a string sequence of the function calls you want to make - read_int(src), then read_string(src), etc. You could construct that in a single pass from the Avro schema document the first time you encounter it. You could pass that entire big string you have constructed to eval, making a Python function on the fly, and you could just call that dynamic, runtime-created function for each message. Voila!

This seemed to be too foolish to work, but it performed pretty well and was surprisingly simple to implement a sketch (it's under 100 lines).

This actually does shockingly well, really. It handily beat fastavro's Cythonized code, about 20-50% faster. So I decided to run with the idea, and make a real library which can compile Avro schemas into executable serializers and deserializers.

from prototype to library

Of course, using eval seemed pretty weird. More significantly, manipulating strings got unwieldy. I had to deal with careful indendation, and manage variable names very carefully in my generated Python code. I decided to switch to generating a Python AST with the standard library's ast module. Structures created with ast can still be evaluated directly during runtime to create executable function objects, so this worked just as well - but was a lot easier to work on.

I started out just focusing on deserialization, but decided early on to also support serialization, as well as Avro's Schema Resolution rules which let a reader reshape data into a (compatible) schema of its own choosing.

The writer wasn't too hard, but the resolving reader took a lot more thought than I expected, resulting in over 1,000 lines of (heavily commented) code.

I also learned that Avro supports referencing types by name, which permits recursive schemas. Of course, naive compilation of these recursive schemas would blow up my compiler - it would never end its "simple concatenation" of decode calls. These are rare, but I wound up writing code to detect recursive schema definitions, and compile separate encoders and decoders for those recursive components; this made my code generators a lot more complex to handle a rare case.

concluding thoughts

The name of the library I made is avroc². You can pip install avroc today to give it a shot. I think it's pretty good - I was careful when writing it to follow the spec closely, and it has several hundred tests.

Mostly, though, I thought this was an eye-opening programming experience for me. I had never really used eval to actually do anything useful - I mostly knew it as, well, evil. And while I remember vaguely from SICP that "code is data, data is code", I had never really felt it before quite so clearly; the duality between Avro Schemas and Python ASTs became extremely apparent as I worked on this, in a fascinating way.

Finally, it opened my eyes to a particular power of interpreted languages which is, I think, underappreciated. If I had wanted to do something similar in, say, Go, I would have needed to build an entire virtual machine. I would have needed to come up with my own instruction definitions, and fed them into a machine which chomped through "instructions".

I'm basically doing the same thing here... but my VM is the Python interpreter, and all I need to feed it is Python code. Pretty cool.

Well, almost all simple. The glaring exception is unions, which are encoded in a way like tagged unions: the union's definition lists a sequence of possible types, and the encoding is the index of the actual type that was used, followed immediately by an encoded value. Decoding this, then, always requires branching. ↩
Because it's an avro compiler, so avroc? I'm open to naming suggestions... ↩