10x Faster Memory Management: Optimising Opteryx's Core Memory Pool
TL;DR
A small, surgical change to the memory pool produced a 10x improvement in allocation/commit throughput. We moved metadata tracking out of Python and into a compact C++ structure, preserved the public API, and avoided a large rewrite. The result: much higher throughput, lower variance, and no behavioural changes for users.
The problem
The MemoryPool is central to query execution: it allocates buffers, manages lifetime, supports zero‑copy reads, and compacts segments. For years the pool tracked segment metadata using Python dicts. That was simple and readable — but slow.
In a tight allocate→read→release loop (the exact pattern used across query plans and streaming workloads) Python hash-table lookups and object overhead dominated the hot path. The metadata lookups were the bottleneck.
The change
We did three things, incrementally and carefully:
- Replaced the Python
dictused for metadata with a C++unordered_map<int64_t, SegmentMetadata>. - Moved metadata into a compact C struct (
SegmentMetadata) with no Python object overhead. - Kept the public Python API identical;
used_segmentsremains a lazily-evaluated Python dict for compatibility.
The key principle was minimalism: replace just the slow part and keep everything else stable.
Why this works
- Metadata access is performance‑critical but implementation‑local. Users call the same APIs; they do not rely on Python
dictsemantics for internal bookkeeping. - Moving metadata to C++ removes Python interpreter and object costs from the hot path.
- Keeping the public API stable means tests, consumers, and integrations continue to work without change.
We also retained Python RLock for synchronization because C++ template types cannot be embedded in Cython classes in our current layout — a pragmatic compromise that keeps thread-safety intact.
Results
Benchmarks (small allocations: 50k commits of 100 bytes):
Old implementation: 12,839 ops/sec
New implementation: 134,104 ops/sec
Improvement: 10.4x faster
This is a meaningful change, not a micro‑tweak — it shifts the envelope for memory‑bound workloads and reduces variance introduced by the Python runtime.
Where it matters
- Read cache: we're planning to use the MemoryPool as a read-caching layer as part of continual IO-stack improvements — enabling hot-block reuse, reducing physical IO, and improving tail latency for common queries.
- Morsel exchange: during the execution-engine rewrite the pool will act as the morsel exchange between operators, enabling efficient, zero-copy morsel handoffs and clearer ownership boundaries for execution stages.
- Zero‑copy flows: lower latency between producers and consumers when memory handoffs are fast and predictable.
- Classic Opteryx: historically the MemoryPool served as the buffer pool; these planned uses extend that role into caching and operator exchange while preserving the same minimal, native hot path and public API.
How we approached it
This was not a rewrite. The steps were:
- Profile to confirm the real bottleneck (dict lookups and object churn).
- Design a minimal C++ metadata representation and chosen container (
unordered_map<int64_t, SegmentMetadata>). - Implement the C++ layer behind the existing Cython/Python bindings.
- Preserve the Python-facing API and lazy compatibility layers.
- Run the full test-suite and benchmark under representative loads.
The result was surgical: small, reviewable changes with a large impact.
The broader lesson
Optimising a mature codebase usually works best as a targeted, incremental effort. Identify the true hot path, replace the implementation with a low‑overhead equivalent, and keep the surrounding behaviour stable. You get the performance gains without the risk and cost of a full rewrite.
If you’re struggling with latency or throughput in a Python project, look for implementation details that are purely internal state — those are often the best places to move into faster languages without changing your public contract.