10x Faster Memory Management: Optimising Opteryx's Core Memory Pool

TL;DR

A small, surgical change to the memory pool produced a 10x improvement in allocation/commit throughput. We moved metadata tracking out of Python and into a compact C++ structure, preserved the public API, and avoided a large rewrite. The result: much higher throughput, lower variance, and no behavioural changes for users.

The problem

The MemoryPool is central to query execution: it allocates buffers, manages lifetime, supports zero‑copy reads, and compacts segments. For years the pool tracked segment metadata using Python dicts. That was simple and readable — but slow.

In a tight allocate→read→release loop (the exact pattern used across query plans and streaming workloads) Python hash-table lookups and object overhead dominated the hot path. The metadata lookups were the bottleneck.

The change

We did three things, incrementally and carefully:

Replaced the Python dict used for metadata with a C++ unordered_map<int64_t, SegmentMetadata>.
Moved metadata into a compact C struct (SegmentMetadata) with no Python object overhead.
Kept the public Python API identical; used_segments remains a lazily-evaluated Python dict for compatibility.

The key principle was minimalism: replace just the slow part and keep everything else stable.

Why this works

Metadata access is performance‑critical but implementation‑local. Users call the same APIs; they do not rely on Python dict semantics for internal bookkeeping.
Moving metadata to C++ removes Python interpreter and object costs from the hot path.
Keeping the public API stable means tests, consumers, and integrations continue to work without change.

We also retained Python RLock for synchronization because C++ template types cannot be embedded in Cython classes in our current layout — a pragmatic compromise that keeps thread-safety intact.

Results

Benchmarks (small allocations: 50k commits of 100 bytes):

Old implementation: 12,839 ops/sec
New implementation: 134,104 ops/sec

Improvement: 10.4x faster

This is a meaningful change, not a micro‑tweak — it shifts the envelope for memory‑bound workloads and reduces variance introduced by the Python runtime.

Where it matters

Read cache: we're planning to use the MemoryPool as a read-caching layer as part of continual IO-stack improvements — enabling hot-block reuse, reducing physical IO, and improving tail latency for common queries.
Morsel exchange: during the execution-engine rewrite the pool will act as the morsel exchange between operators, enabling efficient, zero-copy morsel handoffs and clearer ownership boundaries for execution stages.
Zero‑copy flows: lower latency between producers and consumers when memory handoffs are fast and predictable.
Classic Opteryx: historically the MemoryPool served as the buffer pool; these planned uses extend that role into caching and operator exchange while preserving the same minimal, native hot path and public API.

How we approached it

This was not a rewrite. The steps were:

Profile to confirm the real bottleneck (dict lookups and object churn).
Design a minimal C++ metadata representation and chosen container (unordered_map<int64_t, SegmentMetadata>).
Implement the C++ layer behind the existing Cython/Python bindings.
Preserve the Python-facing API and lazy compatibility layers.
Run the full test-suite and benchmark under representative loads.

The result was surgical: small, reviewable changes with a large impact.

The broader lesson

Optimising a mature codebase usually works best as a targeted, incremental effort. Identify the true hot path, replace the implementation with a low‑overhead equivalent, and keep the surrounding behaviour stable. You get the performance gains without the risk and cost of a full rewrite.

If you’re struggling with latency or throughput in a Python project, look for implementation details that are purely internal state — those are often the best places to move into faster languages without changing your public contract.