2021-05-03
I work on astronomical data pipelines, and a lot of our data gets shipped around in Avro. It’s often a lot of data, like many terabytes, and with complex schemas. And because I’m working on astronomy software, nearly everything is written in Python, the lingua franca of the scientific computing world.
I was surprised initially at how slow Avro libraries are. Deserializing a single message of some of the data I work with took 10ms with the official Avro Python library, which is truly glacial. An alternative library, fastavro performed much better, around 1.5ms - but that still seemed way too slow.
Avro’s specification is refreshingly straightforward and clear: An
Avro record
is encoded as a simple concatenation of each of
its fields. Those fields could be other records - in which case, again,
simple concatenation is the rule. Or they could be “primitive” fields,
like string
s (length-prefixed UTF-8), or int
s
or long
s (zig-zag varints, like in Protobuf), or
floats
or doubles
(IEEE 754 floats in “single
format” layout, little-endian). There are a few more other cases
(map
s, array
s, enum
s,
fixed
s) but believe me - they are all simple.
In the end, all you really need to know is that Avro messages are generally simple concatenations of their fields 1. There are primitive types with clear rules for deserialization.
So what could be slowing us down?
Avro data always includes the schema used to write it, so existing Avro libraries deal with schema updates by always using the local schema of the writer - that way they always know how to deserialize data.
The libraries generally do this by holding the (parsed) schema in memory while deserializing data. They walk through the schema’s structure, keeping track of what the next expected type from a byte stream is, and march forwards. And this works, but it’s ultimately a real drag on performance.
There’s a lot of overhead because libraries tend to work by using big switches over all the possible branches that a schema could take as they step through the schema’s graph, like this:
cpdef _read_data(fo, writer_schema, ...):= extract_record_type(writer_schema)
record_type # [...]
try:
if record_type == "null":
= read_null(fo)
data elif record_type == "string":
= read_utf8(fo)
data elif record_type == "int" or record_type == "long":
= read_long(fo)
data elif record_type == "float":
= read_float(fo)
data elif record_type == "double":
= read_double(fo)
data elif record_type == "boolean":
= read_boolean(fo)
data elif record_type == "bytes":
= read_bytes(fo)
data elif record_type == "fixed":
= read_fixed(fo, writer_schema)
data # ... you get the idea. It keeps going.
This happens every time for every message - even when the schema contains no ambiguity, libraries re-analyze the schema on every message they deserialize. This is probably the dominant drag on performance.
Now, if you happened to have a schema in front of you, and you were
hand-writing an Avro decoder, you wouldn’t have this problem. You could
just decode each value in sequence, right in a row -
read_int
, read_string
, read_int
,
read_bool
- without switching on anything, since you
already know the right order:
def decode_astro_message(fo):
= {}
record "candidateId"] = read_string(fo)
record["diaSourceId"] = read_long(fo)
record["ra"] = read_double(fo)
record["ra_err"] = read_float(fo)
record["dec"] = read_double(fo)
record[# ... etc etc
return record
But of course, we can’t do that in general for an arbitrary schema that’s user-provided, that we discover during runtime while unpacking an Avro file… or can we?
You could construct a string sequence of the function calls
you want to make - read_int(src)
, then
read_string(src)
, etc. You could construct that in
a single pass from the Avro schema document the first time you encounter
it. You could pass that entire big string you have constructed
to eval
, making a Python function on the fly, and you could
just call that dynamic, runtime-created function for each message.
Voila!
This seemed to be too foolish to work, but it performed pretty well and was surprisingly simple to implement a sketch (it’s under 100 lines).
This actually does shockingly well, really. It handily beat
fastavro
’s Cythonized code, about 20-50% faster. So I
decided to run with the idea, and make a real library which can
compile Avro schemas into executable serializers and deserializers.
Of course, using eval
seemed pretty weird. More
significantly, manipulating strings got unwieldy. I had to deal with
careful indendation, and manage variable names very carefully in my
generated Python code. I decided to switch to generating a Python AST
with the standard library’s ast
module. Structures created with ast
can still be evaluated
directly during runtime to create executable function objects, so this
worked just as well - but was a lot easier to work on.
I started out just focusing on deserialization, but decided early on to also support serialization, as well as Avro’s Schema Resolution rules which let a reader reshape data into a (compatible) schema of its own choosing.
The writer wasn’t too hard, but the resolving reader took a lot more thought than I expected, resulting in over 1,000 lines of (heavily commented) code.
I also learned that Avro supports referencing types by name, which permits recursive schemas. Of course, naive compilation of these recursive schemas would blow up my compiler - it would never end its “simple concatenation” of decode calls. These are rare, but I wound up writing code to detect recursive schema definitions, and compile separate encoders and decoders for those recursive components; this made my code generators a lot more complex to handle a rare case.
The name of the library I made is avroc
2. You can
pip install avroc
today to give it a shot. I think it’s
pretty good - I was careful when writing it to follow the spec closely,
and it has several hundred tests.
Mostly, though, I thought this was an eye-opening programming
experience for me. I had never really used eval
to actually
do anything useful - I mostly knew it as, well, evil
. And
while I remember vaguely from SICP that “code is data, data is code”, I
had never really felt it before quite so clearly; the duality
between Avro Schemas and Python ASTs became extremely apparent as I
worked on this, in a fascinating way.
Finally, it opened my eyes to a particular power of interpreted languages which is, I think, underappreciated. If I had wanted to do something similar in, say, Go, I would have needed to build an entire virtual machine. I would have needed to come up with my own instruction definitions, and fed them into a machine which chomped through “instructions”.
I’m basically doing the same thing here… but my VM is the Python interpreter, and all I need to feed it is Python code. Pretty cool.
Well, almost all simple. The glaring
exception is union
s, which are encoded in a way like tagged
unions: the union’s definition lists a sequence of possible types, and
the encoding is the index of the actual type that was used, followed
immediately by an encoded value. Decoding this, then, always requires
branching.↩︎
Because it’s an
avro c
ompiler, so avroc
? I’m open to naming
suggestions…↩︎