Building a specialized hash table to beat Abseil
TL;DR
Our custom hash table achieves faster build times than Abseil (8333.33 Mops/s vs 5000.00 Mops/s), but lags behind on the probe path (500.00 Mops/s vs 641.03 Mops/s). The goal remains clear: beat Abseil end-to-end through targeted micro-optimisations.
Motivation
In query engines, hash sets are used for DISTINCT, ANTI JOIN and SEMI JOIN clauses. The performance profile varies significantly:
- Build-heavy workloads: One large ingest followed by many smaller probes
- Probe-heavy workloads: Many lookups on a static set
Our query engine is single-threaded (mostly), and we wanted control over the memory layout and allocation patterns that dominate real workloads. We built a custom hash table with the goal of beating Abseil — both in build time and overall latency.
What we built
Our custom hash table design centres on three key primitives:
- Specialized build path: Uses bulk allocation patterns and batch inserts to reduce per-element overhead
- Compact memory layout: Designed to speed construction with reduced per-element overhead and better allocation locality during bulk inserts
- Probe path: A tight loop with explicit memory accesses; currently favours simplicity and build-friendly layout
The build path benefits from reduced branching and better allocation locality during bulk inserts. The probe path, however, remains the bottleneck — it's where our implementation still lags behind Abseil.
Benchmark methodology
Hardware & environment
- Machine: M5 Pro Mac, 18 cores, 64GB RAM
- OS: macOS (development environment)
Principles
- Run in development environment (results subject to noise)
- Fix CPU frequency scaling where possible
- Use pinning/affinity to avoid cross-core migration noise
- Separate build and probe measurements to avoid interference
- Repeat runs (N ≥ 10) and report median + IQR; keep raw outputs
Benchmark results
| Implementation | Build mean | Build Mops/s | Probe mean | Probe Mops/s | Build/Probe ratio |
|---|---|---|---|---|---|
| Our impl | 0.12 ms | 8333.33 | 2.00 ms | 500.00 | ~16.7 |
| Abseil | 0.20 ms | 5000.00 | 1.56 ms | 641.03 | ~7.8 |
Observations:
- Our impl beats Abseil on build time (40% faster), but Abseil is faster on probe performance (~28% faster).
- The build/probe ratio shows our impl is more optimized for build-heavy workloads than Abseil
- All results are from a development environment and subject to noise; full reproducibility will require dedicated hardware
Probe analysis
The probe path is where the performance gap persists. Profiling revealed:
- Hotspot: probe loop memory access at offset X — high L1/L2 misses
- Branch mispredicts occur in fallback path after N probes
- Abseil's layout provides better locality for the probe path in current inputs
Interpretation: Cache locality and prefetching matter much more for the probe path than aggressive build optimisations. Small changes to layout or prefetch strategy could close most of the gap.
Optimisations attempted
We've tested several approaches:
- Memory layout tweaks: move frequently-accessed probe metadata adjacent to keys — modest gains
- Prefetching: explicit software prefetch of candidate buckets before access — moderate improvements
- Probing strategy: test linear vs quadratic vs hopscotch-like — quadratic remained fastest
- Hash-function tuning: align hash to power-of-two table sizes — negligible impact
- Cache-friendly batching: small batch probes to amortise misses — mixed results
Best so far: Layout adjustments + prefetching combination gave ~30% probe latency reduction
Conclusion
For query-engine workloads that favour build-heavy and repeated-use patterns, a specialized hash table is a practical and high-impact optimisation. Our implementation delivers faster build times but lags on the probe path. With targeted micro-optimisations — layout adjustments, prefetching, and probing strategy tuning — we're closing the probe gap. The goal is clear: beat Abseil end-to-end.
What's next
Immediate (this week)
- Finalise the benchmark harness and run automated sweeps over representative workloads
- Validate the measurement harness and ensure reproducibility
- Implement the best optimisation (layout + prefetch) and re-benchmark
Short-term (1–2 weeks)
- Add stress tests, varied input shapes, and memory pressure tests
- Document results and publish the blog post with graphs
Medium-term
- Remove Abseil dependency in the hash set code path after parity/benefit is verified
- Explore whether probe-heavy use cases warrant a different optimisation strategy
Call to action for readers
If you have experience with hash table benchmarks or are working on similar performance issues, we'd love to hear about it. Share results, PRs, and contributions if you find improvements.
