Quivr: Improving DataFrames For Software Engineering

2023-10-03

Over the last four months, I have been working on a Python library for working with data. The library is called quivr, and I think of it as an answer to a bunch of problems that I found when trying to build robust, maintainable systems and services on top of Numpy and Pandas data structures.

quivr provides structured, composable Tables instead of DataFrames or plain Python objects. There are a couple of ways you could think of this thing:

It looks like something like this - this is a real example from a scientific library for work on asteroids which I contribute to:


import quivr as qv

class CartesianCoordinates(qv.Table):
    x = qv.Float64Column()
    y = qv.Float64Column()
    z = qv.Float64Column()
    vx = qv.Float64Column()
    vy = qv.Float64Column()
    vz = qv.Float64Column()
    time = Times.as_column()
    covariance = CoordinateCovariances.as_column()
    origin = Origin.as_column()
    frame = qv.StringAttribute()

    @property
    def r(self) -> np.ndarray:
        """
        Position vector.
        """
        return np.array(self.table.select(["x", "y", "z"]))

    @property
    def r_mag(self) -> np.ndarray:
        """
        Magnitude of the position vector.
        """
        return np.linalg.norm(self.r, axis=1)

    @property
    def r_hat(self) -> np.ndarray:
        """
        Unit vector in the direction of the position vector.
        """
        return self.r / self.r_mag[:, None]
        
    def rotate(
        self, rotation_matrix: np.ndarray, frame_out: str
    ) -> Self:
        """
        Rotate Cartesian coordinates and their covariances by a given rotation 
        matrix into a new reference frame.

        Covariance matrices are also rotated. Rotations will sometimes result
        in covariance matrix elements very near zero but not exactly zero. Any
        elements that are smaller than +-1e-25 are rounded down to 0.
        """
        ... # Math occurs

I’ll explain more of what quivr does, but first, I’ll explain more of the motivation.

DataFrames are hard to build software around

Pandas DataFrames are a wonderful tool for data exploration. But sometimes, after a bunch of exploration, it comes time to engineer a more serious software system on top of your work. In my work, this happens when we take scruffy, prototypes scientific research ideas (like novel algorithms for asteroid detection), and decide they’re worth really pursuing, and set out to build long-running data analysis pipelines or APIs that toughen up the prototypes.

The scruffy prototypes are invariably built with the Python data ecosystem’s tools, which means that Pandas DataFrames are usually central. We’re often taking a collection of Jupyter Notebooks and refactoring them into services or pipelines.

Problem 1: Flexibility

The flexibility of DataFrames (which made them so useful during the exploration phase!) becomes a liability when trying to build a more robust system. It can be hard to establish any invariants, or even be confident in expectations like “when this function runs, the ‘time’ column will not have any nulls.”

It’s common to add columns to dataframes, which are computations of other values. This makes the program’s flow extremely stateful, and there is no way in function signatures to set expectations. You know, this sort of thing:

df["mean_exposure_time"] = df.groupby("exposure_id").mean("mjd")

If any downstream callers want to make use of the mean_exposure_time column on the df DataFrame, they’re forced to either check before each use or live dangerously.

Managing complexity is the main challenge when working on many (all?) sorts of software systems. The more that is possible at any point in your program, the more challenging this becomes. As the possible states multiply, the surface area for bugs grows very quickly, and it gets much harder to modify the code because implicit expectations can turn innocuous changes into vicious bugs.

Problem 2: Namespacing

Namespaces, as the Zen of Python says, are one honking great idea. Encapsulating behavior under namespaces lets us build expressive and clear APIs.

In Python, there are a few genres of namespaces, but one of the most familiar is classes and methods. But DataFrames don’t support subclassing, so you end up with tricky APIs wth no effective namespacing - stuff like this:


def chi2_diff_positions(df1, df2) -> float:
    """
    Returns the chi-squared statistic for the differences in positions
    between two dataframes. The dataframes should have the same length
    and both have "x", "y", and "z" columns.
    """
    ...

def chi2_diff_velocities(df1, df2) -> float:
    """
    Returns the chi-squared statistic for the differences in positions
    between two dataframes. The dataframes should have the same length
    and both have "vx", "vy", and "vz" columns.
    """
    ...

The API here can only be described, really, with docstrings and with function names because of the flexibility of dataframes. Using a package written in this style is challenging and error-prone. At function callsites it can be difficult to detect problems. And composing these functions amplifies the difficulty.

In a Jupyter notebook environment, people usually work around these problems by constantly printing the state of their dataframe. print(df.head()) and print(df.summary()) pepper the analysis to make sure visually that data is shaped properly. Of course, that doesn’t translated well to non-interactive settings.

Problem 3: Memory usage

TODO: Add documentation to explain pandas memory usage I’m not going to belabor this point, because it is well-known and documented elsewhere. But Pandas DataFrames are not designed to be memory efficient. They cache and duplicate data aggressively to make interactivity work well.

This, again, is a design decision that makes sense within DataFrames’ domain of data exploration. But for large datasets, multiplying the memory usage by several factors is very costly, requiring larger host machines to run computation.

There are other projects tackling this particular problem (Vaex and Polars come to mind, as well as Pandas 2.0), which shows that it is a common concern.

Standard Python Objects Aren’t Sufficient

When I first saw the difficulties with Pandas DataFrames, I reached for standard library tools like dataclasses and lists. This sort of thing:

import dataclasses

@dataclasses.dataclass
class Position:
    x: float
    y: float
    z: float
    
    def chi2_diff(self, other: Position) -> float:
        ...

@dataclasses.dataclass
class Velocity:
    vx: float
    vy: float
    vz: float
    
    def chi2_diff(self, other: Velocity) -> float:
        ...
        
p1 = [Position(...), Position(...), Position(...)]
p2 = [Position(...), Position(...), Position(...)]

def chi2_diff_positions(p1: list[Position], p2: list[Position]) -> float:
    if len(p1) != len(p2):
        raise ValueError("must be same length")
    for v1, v2 in zip(p1, p2):
        ...

This is an improvement in some ways over the Pandas structures. But it has one very significant problem: it is dog slow.

Each of the dataclass instances is stored separately from all the others, and overhead from Python objects and lists entirely dominates the execution time for numeric operations when you use this sort of structure. In many domains, that’s totally fine, but when writing data-intensive applications (like scientific algorithms) it’s unacceptable.

Pandas helped us out of this problem by organizing the data in columnar form with Numpy. When organized in this way, the processor can rip through the data to compute very efficiently, and numpy provides vectorized instructions that can make things very quick indeed.

There’s an adage from the data-oriented design world that you should use “structs of arrays, not arrays of structs,” and this is a realization of that insight.

Enter quivr

I’ll cut to the chase and show what a quivr solution to the above would look like:

import quivr as qv
import numpy as np
import numpy.typing as npt


class Positions(qv.Table):
    x = qv.Float64Column()
    y = qv.Float64Column()
    z = qv.Float64Column()
    
    def chi2_diff(self, other: Positions) -> npt.NDArray[np.float64]:
        residuals = [
            self.x.to_numpy() - other.x.to_numpy(),
            self.y.to_numpy() - other.y.to_numpy(),
            self.z.to_numpy() - other.y.to_numpy(),
        ]
        
        return compute_chi2(residuals)
        
class Velocities(qv.Table):
    vx = qv.Float64Column()
    vy = qv.Float64Column() 
    vz = qv.Float64Column() 
    
    def chi2_diff(self, other: Positions) -> npt.NDArray[np.float64]:
        residuals = [
            self.vx.to_numpy() - other.vx.to_numpy(),
            self.vy.to_numpy() - other.vy.to_numpy(),
            self.vz.to_numpy() - other.vy.to_numpy(),
        ]
        
        return compute_chi2(residuals)

These quivr.Table subclasses define their parameters and allow methods to be attached, like dataclasses do. But they hold arrays of data, like DataFrames do. They capture the benefits of both sides!

I think the API feels pretty straightforward to use. You can access a column named “foo” on a table “tab” via tab.foo.

So far, quivr’s Tables have worked very well for us. We’re finding that they reduce complexity greatly. They’re much easier to test than Pandas spaghetti, but without making any performance compromises.

Once we started doing this, we found room for more features that expand upon these benefits.

Composition

Quivr supports composition of tables. This means that tables can be mixed together to build larger structures. For example:


class CartesianCoordinate(qv.Table):
    position = Positions.as_column()
    velocity = Velocities.as_column()
    
    origin = Positions.as_column()

These can be, in turn, embedded in other larger structures:


class Observation(qv.Table):
    coords = CartesianCoordinates.as_column()
    timestamp = qv.TimestampColumn(unit="ms")
    observer = qv.StringColumn()

You get the same dot-delimited attribute access API:


obs1 = Observations.from_parquet(...)
obs2 = Observations.from_parquet(...)

obs1.coords.position.chi2_diff(obs2.coords.position)

Arrow

Quivr’s data is backed by Apache Arrow.

Arrow is a specification for a memory layout that’s designed for efficient computation. Arrow provides Arrays, which are carefully designed structures to hold sequences of values. The values in the sequence can be numbers, strings, or even more complex structures like variable-length lists or structs.

Arrow has been taking the data ecosystem by storm over the last several years. By using a precise specification, Arrow has opened the door to much better interoperability, which helps when implementing tight numerical kernels as non-Python extensions, like with Rust or C.

Arrow’s Python implementation, pyarrow, is a relatively low-level library for manipulating these Arrow Arrays. Compared with Numpy, Arrow’s arrays have a less fluid computational API, but can represent a broader range of types, and are easier to serialize and work with across languages.

The Columns associated with a quivr Table are Arrow Arrays. Arrow’s support for compound data types, like structs, is how composition works internally while still maintaining columnar, efficient data.

Serialization/Deserialization

Because Arrow is the backend for quivr, quivr’s tables inherit powerful tools for serialization and deserialization.

The best supported of these is Parquet, since the official Python library for Parquet is actually a submodule of Pyarrow.

Arrow’s design minimizes copies when loading data, which makes it feasible as well to use quivr to work with data that is much larger than would fit in memory by loading data in from disk in batches.

And more!

There’s a bunch more quivr features that we’ve found useful, including - pandera-like validation of input data - Scalar attributes that can be attached to Tables - An indexing system called “Linkages” for querying cross-Table relationships efficiently

Try it out

quivr is ready to use today. To install, just pip install quivr. The docs are online and the code is BSD-licensed and on Github.

I’d love to hear any feedback you have. If you encounter issues or have ideas, please make a Github Issue on the quivr repository. Or go ahead and drop me a line, you can reach me at [email protected].