Building a specialized hash table to beat Abseil

TL;DR

Our custom hash table achieves faster build times than Abseil (8333.33 Mops/s vs 5000.00 Mops/s), but lags behind on the probe path (500.00 Mops/s vs 641.03 Mops/s). The goal remains clear: beat Abseil end-to-end through targeted micro-optimisations.

Motivation

In query engines, hash sets are used for DISTINCT, ANTI JOIN and SEMI JOIN clauses. The performance profile varies significantly:

Build-heavy workloads: One large ingest followed by many smaller probes
Probe-heavy workloads: Many lookups on a static set

Our query engine is single-threaded (mostly), and we wanted control over the memory layout and allocation patterns that dominate real workloads. We built a custom hash table with the goal of beating Abseil — both in build time and overall latency.

What we built

Our custom hash table design centres on three key primitives:

Specialized build path: Uses bulk allocation patterns and batch inserts to reduce per-element overhead
Compact memory layout: Designed to speed construction with reduced per-element overhead and better allocation locality during bulk inserts
Probe path: A tight loop with explicit memory accesses; currently favours simplicity and build-friendly layout

The build path benefits from reduced branching and better allocation locality during bulk inserts. The probe path, however, remains the bottleneck — it's where our implementation still lags behind Abseil.

Benchmark methodology

Hardware & environment

Machine: M5 Pro Mac, 18 cores, 64GB RAM
OS: macOS (development environment)

Principles

Run in development environment (results subject to noise)
Fix CPU frequency scaling where possible
Use pinning/affinity to avoid cross-core migration noise
Separate build and probe measurements to avoid interference
Repeat runs (N ≥ 10) and report median + IQR; keep raw outputs

Benchmark results

Implementation	Build mean	Build Mops/s	Probe mean	Probe Mops/s	Build/Probe ratio
Our impl	0.12 ms	8333.33	2.00 ms	500.00	~16.7
Abseil	0.20 ms	5000.00	1.56 ms	641.03	~7.8

Observations:

Our impl beats Abseil on build time (40% faster), but Abseil is faster on probe performance (~28% faster).
The build/probe ratio shows our impl is more optimized for build-heavy workloads than Abseil
All results are from a development environment and subject to noise; full reproducibility will require dedicated hardware

Probe analysis

The probe path is where the performance gap persists. Profiling revealed:

Hotspot: probe loop memory access at offset X — high L1/L2 misses
Branch mispredicts occur in fallback path after N probes
Abseil's layout provides better locality for the probe path in current inputs

Interpretation: Cache locality and prefetching matter much more for the probe path than aggressive build optimisations. Small changes to layout or prefetch strategy could close most of the gap.

Optimisations attempted

We've tested several approaches:

Memory layout tweaks: move frequently-accessed probe metadata adjacent to keys — modest gains
Prefetching: explicit software prefetch of candidate buckets before access — moderate improvements
Probing strategy: test linear vs quadratic vs hopscotch-like — quadratic remained fastest
Hash-function tuning: align hash to power-of-two table sizes — negligible impact
Cache-friendly batching: small batch probes to amortise misses — mixed results

Best so far: Layout adjustments + prefetching combination gave ~30% probe latency reduction

Conclusion

For query-engine workloads that favour build-heavy and repeated-use patterns, a specialized hash table is a practical and high-impact optimisation. Our implementation delivers faster build times but lags on the probe path. With targeted micro-optimisations — layout adjustments, prefetching, and probing strategy tuning — we're closing the probe gap. The goal is clear: beat Abseil end-to-end.

What's next

Immediate (this week)

Finalise the benchmark harness and run automated sweeps over representative workloads
Validate the measurement harness and ensure reproducibility
Implement the best optimisation (layout + prefetch) and re-benchmark

Short-term (1–2 weeks)

Add stress tests, varied input shapes, and memory pressure tests
Document results and publish the blog post with graphs

Medium-term

Remove Abseil dependency in the hash set code path after parity/benefit is verified
Explore whether probe-heavy use cases warrant a different optimisation strategy

Call to action for readers

If you have experience with hash table benchmarks or are working on similar performance issues, we'd love to hear about it. Share results, PRs, and contributions if you find improvements.