100× Aggregates
TL;DR
- Removed Python object materialisation from common aggregates (SUM/MIN/MAX).
- Implemented type-specific native methods in Cython over Draken buffers.
- Dense numeric columns now run ~10–100× faster (1M-row SUM: ~100–500 ms → ~1–5 ms).
- Rolled out incrementally: fast native paths with safe Python fallbacks.
The problem
As we're working through refactoring Opteryx as we remove Arrow, the first pass was 'get it to work'; we're now looking at performance stats and seeing where working isn't good enough.
Aggregates stood out as we looked at how we performed compared to other engines.
It turns out we were burning and churning CPU and memory turning raw bytes into Python objects so we could aggregate them.
Say that out loud. It sounds ridiculous. It felt worse when we measured it.
For a 1M-row int64 column the old flow was effectively: materialise a million Python integers, then iterate over them under the GIL.
The fix
We moved aggregation into native, type-specific code.
Each vector now exposes sum(), min() and max() where it makes sense. The implementations live in Cython and operate directly on Draken buffers — no Python objects, no extra allocations, just tight loops over memory.
Conceptually, the inner loop looks like:
for i in range(n):
if null_bitmap and not bitmap_is_valid(null_bitmap, i):
continue
total += data[i]Nothing clever — just doing the obvious thing in the right place.
What made it work
A few patterns made this both fast and safe.
Nulls Nulls are tracked with a packed bitmap (one bit per row). The check is cheap: a couple of pointer reads and bit ops. It stays in cache and doesn’t disrupt the loop too much.
Encoding-aware execution We don’t force everything into a dense representation:
- DENSE → direct pointer loop (fastest)
- CONSTANT → multiply once, no loop
- DICTIONARY → aggregate values, weight by codes
- RLE → operate on runs without expanding to rows
Dense numeric columns take the fast path, but everything else still behaves correctly without materialising Python objects.
Being explicit about types
Not every aggregate makes sense for every type. We leaned into that.
BoolVector.sum()→ countstruevaluesTimestampVector.sum()→NotImplementedErrorStringVector.min()→ lexicographic minimumArrayVector.sum()→NotImplementedError
No guessing, no implicit behaviour. If it doesn’t make sense, we say so.
10–100×? Really?
For a 1M-row int64 column:
- Old path: 1M allocations + interpreter + GIL → ~100–500 ms
- New path: one tight loop, no allocations → ~1–5 ms
Exact numbers vary with nulls and encoding, but the pattern holds:
- dense numeric → ~100×
- encoded / sparse → ~10×
The big win is removing object creation and interpreter overhead entirely.
What this actually is
This isn’t a clever algorithm.
It’s removing a very expensive, repeated translation - this was one of the key drivers for writing Draken.
Same lesson as the memory model work: if you keep converting representations in the hot path, you’ll keep paying for it. Stop doing that, and things get fast quickly.
Engineering takeaways
- Measure first — the hot path isn’t always where you expect
- Keep tight loops out of Python
- Be explicit about types and encodings
- Prefer simple, direct implementations
- If something doesn’t make sense for a type, fail loudly
-- Justin