Hydra DB · BEAM 1M Benchmark Report

Introduction

This report benchmarks Hydra DB on BEAM 1M13, a purpose-built evaluation of long-term AI memory at the one-million-token scale. BEAM tests ten distinct memory capabilities - including temporal reasoning, cross-session coherence, and contradiction resolution - representing the most comprehensive publicly available evaluation at this context length.

Hydra DB achieved an overall average of 82% across all ten dimensions, with standout results in Temporal Reasoning (91%), Event Ordering (92%), and Preference Following (96%). For context, these results are benchmarked against Hindsight, which has publicly claimed state-of-the-art performance on BEAM, making it the natural reference point.

The results validate Hydra DB's core thesis: modeling knowledge as a versioned, time-aware graph produces measurable gains in the memory dimensions that matter most in production.

The BEAM benchmark

BEAM (Benchmark for Evaluation of AI Memory) is a structured dataset of 100 conversations distributed across four context-length tiers - 128K, 500K, 1M, and 10M tokens. Conversations span general tasks, coding, and mathematics, each constructed to include multi-turn reasoning chains, cross-session dependencies, follow-up questions, and information that must be updated, reconciled, or retrieved from temporally distant points.

2.1 · Ten memory dimensions

Each dimension targets a specific failure mode observed in long-context AI systems:

Abstention - correctly acknowledging when information is unavailable.
Contradiction Resolution - resolving conflicting information appropriately.
Event Ordering - accurately tracking the sequence of events.
Information Extraction - retrieving relevant details accurately from large contexts.
Instruction Following - adhering to user-specified instructions over time.
Knowledge Update - updating knowledge when new facts supersede older ones.
Multi-Session Reasoning - reasoning coherently across distinct sessions.
Preference Following - retaining and applying user preferences.
Summarization - accurately compressing information from long contexts.
Temporal Reasoning - reasoning correctly about time-dependent information.

Architecture

Hydra DB models stored knowledge as a versioned, relational, time-aware graph rather than isolated text fragments. Three integrated components drive its BEAM results:

Three integrated components

Sliding Window Inference

A lightweight model resolves references and extracts meaning, converting ambiguous statements into self-contained, independently retrievable facts.

Git-Style Versioned Graph

Updates appended as new edges rather than overwriting. All historical states preserved with timestamps, enabling accurate time-sensitive answers.

Multi-Stage Retrieval

Expands queries, combines dense and sparse retrieval, traverses graph relationships, and applies multi-stage reranking.

The same architecture evaluated on LongMemEval-s, applied here at the million-token scale.

Methodology

To enable a meaningful external comparison, we adopted the same evaluation configuration used in Hindsight's published results - answer-generation and LLM-judge prompts taken directly from Hindsight's original benchmark configuration, without modification. GPT-5.4 served as the evaluation judge across all ten dimensions. Using a shared configuration means differences reflect genuine performance rather than methodological variation.

Results

5.1 · Multi-dimensional profile

The shaded area for Hydra DB extends beyond Hindsight across the majority of dimensions, with the most pronounced expansion in Temporal Reasoning, Information Extraction, and Multi-Session Reasoning.

110-dimension profile

Hydra DB · 82%Hindsight · 74%

Radar comparison across all ten BEAM 1M memory dimensions.

Memory dimension	Hindsight	Hydra DB	Δ
Temporal Reasoning	60	91	+31
Information Extraction	61	80	+19
Multi-Session Reasoning	46	60	+14
Event Ordering	81	92	+11
Contradiction Resolution	59	66	+7
Summarization	84	88	+4
Abstention	90	89	−1
Instruction Following	93	92	−1
Preference Following	97	96	−1
Knowledge Update	66	63	−3
Overall average	74	82	+8

Table 1. Side-by-side scores. Positive Δ indicates a Hydra DB advantage.

5.2 · Analysis of key findings

Hydra DB's most substantial advantages appear in dimensions requiring reasoning over time or across structurally distant information:

Temporal Reasoning (+31pp). The versioned knowledge graph natively preserves the temporal context of every stored fact, enabling accurate answers about when information was introduced or changed - something flat retrieval cannot replicate.
Information Extraction (+19pp). The sliding-window inference pipeline ensures retrieved facts are self-contained and contextually enriched, reducing failures from ambiguous or underspecified fragments.
Multi-Session Reasoning (+14pp). By preserving graph relationships across sessions rather than treating each as an isolated pool, coherence is maintained over long time horizons.
Event Ordering (+11pp). Timestamped graph edges reconstruct event sequences with high fidelity, even when events are distributed across distant parts of the context.

On five dimensions the gap is within one to three points (Abstention, Instruction Following, Knowledge Update, Preference Following) - capabilities less sensitive to temporal and structural reasoning, where both systems perform comparably.

Conclusion

Hydra DB achieves a state-of-the-art overall score of 82%on BEAM 1M, outperforming the Hindsight baseline by 8 points. The advantage is concentrated in memory dimensions that require temporal awareness, cross-session coherence, and retrieval of structurally related information - precisely the capabilities Hydra DB's architecture is designed to address.

Treating stored knowledge as a versioned, time-aware graph yields measurable improvements over flat or stateless retrieval at scale.

References

BEAM: Benchmark for Evaluation of AI Memory. Tavakoli, M. et al. (2025). arXiv:2510.27246
BEAM Dataset Repository. Tavakoli, M. et al. - GitHub (2025). github.com/mohammadtavakoli78/BEAM
Hydra DB: A Context Engine for Long-Term AI Memory. Technical White Paper (2026). benchmarks.hydradb.com/HydraDB.pdf
Hindsight: Long-Term Memory for AI Systems. (2026). arXiv:2512.12818 · agentmemorybenchmark.ai

Cite this work

@techreport{hydradb-beam2026,
  title  = {BEAM 1M: Memory at the Million-Token Scale},
  author = {{Hydra DB Research Team}},
  institution = {Hydra DB},
  address = {San Francisco, California, USA},
  year   = {2026},
  note   = {BEAM 1M overall 82\% (state of the art)},
  url    = {https://benchmarks.hydradb.com/beam.pdf}
}

BEAM 1M - Memory at the Million-Token Scale

Introduction

The BEAM benchmark

2.1 · Ten memory dimensions

Architecture

Methodology

Results

5.1 · Multi-dimensional profile

5.2 · Analysis of key findings

Conclusion

Related research

References

Cite this work