2023-10-03

Over the last four months, I have been working on a Python library for working with data. The library is called quivr, and I think of it as an answer to a bunch of problems that I found when trying to build robust, maintainable systems and services on top of Numpy and Pandas data structures.

`quivr`

provides structured, composable Tables instead of
DataFrames or plain Python objects. There are a couple of ways you could
think of this thing:

- Quivr’s Tables are like DataFrames, but with predefined schemas and method attachment.
- Quivr is like
`dataclasses`

, but for tabular datasets, giving you “struct-of-array”-style computation. - Quivr is like an ORM for Apache Arrow data tables.

It looks like something like this - this is a real example from a scientific library for work on asteroids which I contribute to:

```
import quivr as qv
class CartesianCoordinates(qv.Table):
= qv.Float64Column()
x = qv.Float64Column()
y = qv.Float64Column()
z = qv.Float64Column()
vx = qv.Float64Column()
vy = qv.Float64Column()
vz = Times.as_column()
time = CoordinateCovariances.as_column()
covariance = Origin.as_column()
origin = qv.StringAttribute()
frame
@property
def r(self) -> np.ndarray:
"""
Position vector.
"""
return np.array(self.table.select(["x", "y", "z"]))
@property
def r_mag(self) -> np.ndarray:
"""
Magnitude of the position vector.
"""
return np.linalg.norm(self.r, axis=1)
@property
def r_hat(self) -> np.ndarray:
"""
Unit vector in the direction of the position vector.
"""
return self.r / self.r_mag[:, None]
def rotate(
self, rotation_matrix: np.ndarray, frame_out: str
-> Self:
) """
Rotate Cartesian coordinates and their covariances by a given rotation
matrix into a new reference frame.
Covariance matrices are also rotated. Rotations will sometimes result
in covariance matrix elements very near zero but not exactly zero. Any
elements that are smaller than +-1e-25 are rounded down to 0.
"""
# Math occurs ...
```

I’ll explain more of what `quivr`

does, but first, I’ll
explain more of the motivation.

Pandas DataFrames are a wonderful tool for data exploration. But sometimes, after a bunch of exploration, it comes time to engineer a more serious software system on top of your work. In my work, this happens when we take scruffy, prototypes scientific research ideas (like novel algorithms for asteroid detection), and decide they’re worth really pursuing, and set out to build long-running data analysis pipelines or APIs that toughen up the prototypes.

The scruffy prototypes are invariably built with the Python data ecosystem’s tools, which means that Pandas DataFrames are usually central. We’re often taking a collection of Jupyter Notebooks and refactoring them into services or pipelines.

The flexibility of DataFrames (which made them so useful during the exploration phase!) becomes a liability when trying to build a more robust system. It can be hard to establish any invariants, or even be confident in expectations like “when this function runs, the ‘time’ column will not have any nulls.”

It’s common to add columns to dataframes, which are computations of other values. This makes the program’s flow extremely stateful, and there is no way in function signatures to set expectations. You know, this sort of thing:

`"mean_exposure_time"] = df.groupby("exposure_id").mean("mjd") df[`

If any downstream callers want to make use of the
`mean_exposure_time`

column on the `df`

DataFrame,
they’re forced to either check before each use or live dangerously.

Managing complexity is the main challenge when working on many (all?) sorts of software systems. The more that is possible at any point in your program, the more challenging this becomes. As the possible states multiply, the surface area for bugs grows very quickly, and it gets much harder to modify the code because implicit expectations can turn innocuous changes into vicious bugs.

Namespaces, as the Zen of Python says, are one honking great idea. Encapsulating behavior under namespaces lets us build expressive and clear APIs.

In Python, there are a few genres of namespaces, but one of the most familiar is classes and methods. But DataFrames don’t support subclassing, so you end up with tricky APIs wth no effective namespacing - stuff like this:

```
def chi2_diff_positions(df1, df2) -> float:
"""
Returns the chi-squared statistic for the differences in positions
between two dataframes. The dataframes should have the same length
and both have "x", "y", and "z" columns.
"""
...
def chi2_diff_velocities(df1, df2) -> float:
"""
Returns the chi-squared statistic for the differences in positions
between two dataframes. The dataframes should have the same length
and both have "vx", "vy", and "vz" columns.
"""
...
```

The API here can only be described, really, with docstrings and with function names because of the flexibility of dataframes. Using a package written in this style is challenging and error-prone. At function callsites it can be difficult to detect problems. And composing these functions amplifies the difficulty.

In a Jupyter notebook environment, people usually work around these
problems by constantly printing the state of their dataframe.
`print(df.head())`

and `print(df.summary())`

pepper the analysis to make sure visually that data is shaped properly.
Of course, that doesn’t translated well to non-interactive settings.

TODO: Add documentation to explain pandas memory usage I’m not going to belabor this point, because it is well-known and documented elsewhere. But Pandas DataFrames are not designed to be memory efficient. They cache and duplicate data aggressively to make interactivity work well.

This, again, is a design decision that makes sense within DataFrames’ domain of data exploration. But for large datasets, multiplying the memory usage by several factors is very costly, requiring larger host machines to run computation.

There are other projects tackling this particular problem (Vaex and Polars come to mind, as well as Pandas 2.0), which shows that it is a common concern.

When I first saw the difficulties with Pandas DataFrames, I reached for standard library tools like dataclasses and lists. This sort of thing:

```
import dataclasses
@dataclasses.dataclass
class Position:
float
x: float
y: float
z:
def chi2_diff(self, other: Position) -> float:
...
@dataclasses.dataclass
class Velocity:
float
vx: float
vy: float
vz:
def chi2_diff(self, other: Velocity) -> float:
...
= [Position(...), Position(...), Position(...)]
p1 = [Position(...), Position(...), Position(...)]
p2
def chi2_diff_positions(p1: list[Position], p2: list[Position]) -> float:
if len(p1) != len(p2):
raise ValueError("must be same length")
for v1, v2 in zip(p1, p2):
...
```

This is an improvement in some ways over the Pandas structures. But
it has one very significant problem: it is *dog* slow.

Each of the dataclass instances is stored separately from all the others, and overhead from Python objects and lists entirely dominates the execution time for numeric operations when you use this sort of structure. In many domains, that’s totally fine, but when writing data-intensive applications (like scientific algorithms) it’s unacceptable.

Pandas helped us out of this problem by organizing the data in columnar form with Numpy. When organized in this way, the processor can rip through the data to compute very efficiently, and numpy provides vectorized instructions that can make things very quick indeed.

There’s an adage from the data-oriented design world that you should use “structs of arrays, not arrays of structs,” and this is a realization of that insight.

I’ll cut to the chase and show what a quivr solution to the above would look like:

```
import quivr as qv
import numpy as np
import numpy.typing as npt
class Positions(qv.Table):
= qv.Float64Column()
x = qv.Float64Column()
y = qv.Float64Column()
z
def chi2_diff(self, other: Positions) -> npt.NDArray[np.float64]:
= [
residuals self.x.to_numpy() - other.x.to_numpy(),
self.y.to_numpy() - other.y.to_numpy(),
self.z.to_numpy() - other.y.to_numpy(),
]
return compute_chi2(residuals)
class Velocities(qv.Table):
= qv.Float64Column()
vx = qv.Float64Column()
vy = qv.Float64Column()
vz
def chi2_diff(self, other: Positions) -> npt.NDArray[np.float64]:
= [
residuals self.vx.to_numpy() - other.vx.to_numpy(),
self.vy.to_numpy() - other.vy.to_numpy(),
self.vz.to_numpy() - other.vy.to_numpy(),
]
return compute_chi2(residuals)
```

These `quivr.Table`

subclasses define their parameters and
allow methods to be attached, like dataclasses do. But they hold arrays
of data, like DataFrames do. They capture the benefits of both
sides!

I think the API feels pretty straightforward to use. You can access a
column named “`foo`

” on a table “`tab`

” via
`tab.foo`

.

So far, `quivr`

’s Tables have worked very well for us.
We’re finding that they reduce complexity greatly. They’re much easier
to test than Pandas spaghetti, but without making any performance
compromises.

Once we started doing this, we found room for more features that expand upon these benefits.

Quivr supports composition of tables. This means that tables can be mixed together to build larger structures. For example:

```
class CartesianCoordinate(qv.Table):
= Positions.as_column()
position = Velocities.as_column()
velocity
= Positions.as_column() origin
```

These can be, in turn, embedded in other larger structures:

```
class Observation(qv.Table):
= CartesianCoordinates.as_column()
coords = qv.TimestampColumn(unit="ms")
timestamp = qv.StringColumn() observer
```

You get the same dot-delimited attribute access API:

```
= Observations.from_parquet(...)
obs1 = Observations.from_parquet(...)
obs2
obs1.coords.position.chi2_diff(obs2.coords.position)
```

Quivr’s data is backed by Apache Arrow.

Arrow is a specification for a memory layout that’s designed for
efficient computation. Arrow provides *Arrays*, which are
carefully designed structures to hold sequences of values. The values in
the sequence can be numbers, strings, or even more complex structures
like variable-length lists or structs.

Arrow has been taking the data ecosystem by storm over the last several years. By using a precise specification, Arrow has opened the door to much better interoperability, which helps when implementing tight numerical kernels as non-Python extensions, like with Rust or C.

Arrow’s Python implementation, `pyarrow`

, is a relatively
low-level library for manipulating these Arrow Arrays. Compared with
Numpy, Arrow’s arrays have a less fluid computational API, but can
represent a broader range of types, and are easier to serialize and work
with across languages.

The Columns associated with a `quivr`

Table are Arrow
Arrays. Arrow’s support for compound data types, like structs, is how
composition works internally while still maintaining columnar, efficient
data.

Because Arrow is the backend for `quivr`

,
`quivr`

’s tables inherit powerful tools for serialization and
deserialization.

The best supported of these is Parquet, since the official Python library for Parquet is actually a submodule of Pyarrow.

Arrow’s design minimizes copies when loading data, which makes it
feasible as well to use `quivr`

to work with data that is
much larger than would fit in memory by loading data in from disk in
batches.

There’s a bunch more `quivr`

features that we’ve found
useful, including - pandera-like
validation of input data - Scalar attributes that can be attached to
Tables - An indexing system called “Linkages”
for querying cross-Table relationships efficiently

`quivr`

is ready to use today. To install, just
`pip install quivr`

. The docs are online and the code is
BSD-licensed and on
Github.

I’d love to hear any feedback you have. If you encounter issues or have ideas, please make a Github Issue on the quivr repository. Or go ahead and drop me a line, you can reach me at [email protected].