Redesigning the String Header — Prefix or Hash?
TL;DR
- The textbook long-string layout (length + prefix + pointer) is elegant, but not ideal for our workload.
- Our hot paths are equality-heavy (
JOIN,GROUP BY,DISTINCT,IN) and often hit shared prefixes. - We now store long strings as
[length][xxhash32][prefix][arena offset]. - That gives us better early rejection for inequality checks, without widening the header.
The context
Opteryx stores every string in a fixed 16-byte slot.
This is the format often called Umbra, or German Strings: short values inline, longer values out-of-line.
That fixed-width slot is non-negotiable. It keeps vector operations predictable and cache behaviour sane.
The real design work is deciding what information those 16 bytes should carry.
The textbook long-string layout
The usual long-string layout is straightforward:
[length 4B][prefix 4B][pointer 8B]
It’s a good default.
You get:
- length for cheap mismatch checks
- a prefix shortcut for early divergence
- a pointer to the full bytes
If your data diverges early, prefix checks avoid lots of arena reads.
If your data shares common starts (https://, CVE-, or repeated tenant IDs), prefix checks stop helping quickly.
That is exactly where our workloads spend time.
Why we didn’t keep the textbook version
Because our bottleneck is not usually "find lexical order quickly".
Our bottleneck is "prove equality quickly, millions of times".
In Opteryx, the dominant string work is:
- hash join probes
- grouping key comparisons
- dedup checks
INmembership checks
Those are equality decisions.
So I feel the right question is: what metadata helps us reject non-matches earliest, with the fewest memory touches?
A prefix helps sometimes.
A hash helps more often.
And we had one more lever: our arena reference can be a 32-bit offset (morsel-local), not a full 64-bit pointer.
That gives us an opportunity to free 4 bytes in the same 16-byte budget.
What we changed
For long strings, the header is now:
[length 4B][xxhash32 4B][prefix 4B][arena offset 4B]
So, same 16 bytes. Different payload.
At comparison time, we can reject in stages:
- length mismatch → reject
- hash mismatch → reject
- prefix mismatch → reject
- only then touch arena bytes
That sequence is cheap, branch-light, and friendly to batch execution.
This is the key difference from the textbook implementation: we use the fixed-width slot to carry two inequality filters (hash + prefix), not one.
Short strings stay simple
For length <= 12, nothing changes:
[length 4B][inline bytes 12B]
No arena access.
No indirection.
No drama.
Where the gain appears
This pays off where we expected:
JOINon string keysGROUP BYstring dimensionsDISTINCTon string columnsIN (...)filters over string literals
These paths do massive numbers of inequality checks.
The redesigned long-string header lets us eliminate most expensive byte fetches before they start.
The broader point
The structure didn’t change.
The information density did.
Textbook layouts are great starting points. They’re not sacred.
When workload shape is clear, you should spend your 16 bytes on the questions your engine asks most often.
For us, that question is equality.
So that’s what we optimised.
— Justin