100× Aggregates

TL;DR

Removed Python object materialisation from common aggregates (SUM/MIN/MAX).
Implemented type-specific native methods in Cython over Draken buffers.
Dense numeric columns now run ~10–100× faster (1M-row SUM: ~100–500 ms → ~1–5 ms).
Rolled out incrementally: fast native paths with safe Python fallbacks.

The problem

As we're working through refactoring Opteryx as we remove Arrow, the first pass was 'get it to work'; we're now looking at performance stats and seeing where working isn't good enough.

Aggregates stood out as we looked at how we performed compared to other engines.

It turns out we were burning and churning CPU and memory turning raw bytes into Python objects so we could aggregate them.

Say that out loud. It sounds ridiculous. It felt worse when we measured it.

For a 1M-row int64 column the old flow was effectively: materialise a million Python integers, then iterate over them under the GIL.

The fix

We moved aggregation into native, type-specific code.

Each vector now exposes sum(), min() and max() where it makes sense. The implementations live in Cython and operate directly on Draken buffers — no Python objects, no extra allocations, just tight loops over memory.

Conceptually, the inner loop looks like:

for i in range(n):
    if null_bitmap and not bitmap_is_valid(null_bitmap, i):
        continue
    total += data[i]

Nothing clever — just doing the obvious thing in the right place.

What made it work

A few patterns made this both fast and safe.

Nulls Nulls are tracked with a packed bitmap (one bit per row). The check is cheap: a couple of pointer reads and bit ops. It stays in cache and doesn’t disrupt the loop too much.

Encoding-aware execution We don’t force everything into a dense representation:

DENSE → direct pointer loop (fastest)
CONSTANT → multiply once, no loop
DICTIONARY → aggregate values, weight by codes
RLE → operate on runs without expanding to rows

Dense numeric columns take the fast path, but everything else still behaves correctly without materialising Python objects.

Being explicit about types

Not every aggregate makes sense for every type. We leaned into that.

BoolVector.sum() → counts true values
TimestampVector.sum() → NotImplementedError
StringVector.min() → lexicographic minimum
ArrayVector.sum() → NotImplementedError

No guessing, no implicit behaviour. If it doesn’t make sense, we say so.

10–100×? Really?

For a 1M-row int64 column:

Old path: 1M allocations + interpreter + GIL → ~100–500 ms
New path: one tight loop, no allocations → ~1–5 ms

Exact numbers vary with nulls and encoding, but the pattern holds:

dense numeric → ~100×
encoded / sparse → ~10×

The big win is removing object creation and interpreter overhead entirely.

What this actually is

This isn’t a clever algorithm.

It’s removing a very expensive, repeated translation - this was one of the key drivers for writing Draken.

Same lesson as the memory model work: if you keep converting representations in the hot path, you’ll keep paying for it. Stop doing that, and things get fast quickly.

Engineering takeaways

Measure first — the hot path isn’t always where you expect
Keep tight loops out of Python
Be explicit about types and encodings
Prefer simple, direct implementations
If something doesn’t make sense for a type, fail loudly

-- Justin