HYDRADBResearch
Benchmark Report · Long-Term Memory

BEAM 1M - Memory at the Million-Token Scale

A ten-dimension evaluation of long-term AI memory at one million tokens. Hydra DB scores 82% overall against Hindsight's 74%, led by the temporal and cross-session dimensions its architecture is built for.

82%
Overall average
across 10 dimensions
+8
Points over
Hindsight (74%)
+31
Temporal reasoning
vs Hindsight
1M
Token context
tier evaluated
01

Introduction

This report benchmarks Hydra DB on BEAM 1M13, a purpose-built evaluation of long-term AI memory at the one-million-token scale. BEAM tests ten distinct memory capabilities - including temporal reasoning, cross-session coherence, and contradiction resolution - representing the most comprehensive publicly available evaluation at this context length.

Hydra DB achieved an overall average of 82% across all ten dimensions, with standout results in Temporal Reasoning (91%), Event Ordering (92%), and Preference Following (96%). For context, these results are benchmarked against Hindsight, which has publicly claimed state-of-the-art performance on BEAM, making it the natural reference point.

The results validate Hydra DB's core thesis: modeling knowledge as a versioned, time-aware graph produces measurable gains in the memory dimensions that matter most in production.
02

The BEAM benchmark

BEAM (Benchmark for Evaluation of AI Memory) is a structured dataset of 100 conversations distributed across four context-length tiers - 128K, 500K, 1M, and 10M tokens. Conversations span general tasks, coding, and mathematics, each constructed to include multi-turn reasoning chains, cross-session dependencies, follow-up questions, and information that must be updated, reconciled, or retrieved from temporally distant points.

2.1 · Ten memory dimensions

Each dimension targets a specific failure mode observed in long-context AI systems:

  • Abstention - correctly acknowledging when information is unavailable.
  • Contradiction Resolution - resolving conflicting information appropriately.
  • Event Ordering - accurately tracking the sequence of events.
  • Information Extraction - retrieving relevant details accurately from large contexts.
  • Instruction Following - adhering to user-specified instructions over time.
  • Knowledge Update - updating knowledge when new facts supersede older ones.
  • Multi-Session Reasoning - reasoning coherently across distinct sessions.
  • Preference Following - retaining and applying user preferences.
  • Summarization - accurately compressing information from long contexts.
  • Temporal Reasoning - reasoning correctly about time-dependent information.
03

Architecture

Hydra DB models stored knowledge as a versioned, relational, time-aware graph rather than isolated text fragments. Three integrated components drive its BEAM results:

Three integrated components
Sliding Window Inference
A lightweight model resolves references and extracts meaning, converting ambiguous statements into self-contained, independently retrievable facts.
Git-Style Versioned Graph
Updates appended as new edges rather than overwriting. All historical states preserved with timestamps, enabling accurate time-sensitive answers.
Multi-Stage Retrieval
Expands queries, combines dense and sparse retrieval, traverses graph relationships, and applies multi-stage reranking.
The same architecture evaluated on LongMemEval-s, applied here at the million-token scale.
04

Methodology

To enable a meaningful external comparison, we adopted the same evaluation configuration used in Hindsight's published results - answer-generation and LLM-judge prompts taken directly from Hindsight's original benchmark configuration, without modification. GPT-5.4 served as the evaluation judge across all ten dimensions. Using a shared configuration means differences reflect genuine performance rather than methodological variation.

05

Results

5.1 · Multi-dimensional profile

The shaded area for Hydra DB extends beyond Hindsight across the majority of dimensions, with the most pronounced expansion in Temporal Reasoning, Information Extraction, and Multi-Session Reasoning.

110-dimension profile
Hydra DB · 82%Hindsight · 74%
Radar comparison across all ten BEAM 1M memory dimensions.
Memory dimensionHindsightHydra DBΔ
Temporal Reasoning6091+31
Information Extraction6180+19
Multi-Session Reasoning4660+14
Event Ordering8192+11
Contradiction Resolution5966+7
Summarization8488+4
Abstention9089−1
Instruction Following9392−1
Preference Following9796−1
Knowledge Update6663−3
Overall average7482+8
Table 1. Side-by-side scores. Positive Δ indicates a Hydra DB advantage.

5.2 · Analysis of key findings

Hydra DB's most substantial advantages appear in dimensions requiring reasoning over time or across structurally distant information:

  • Temporal Reasoning (+31pp). The versioned knowledge graph natively preserves the temporal context of every stored fact, enabling accurate answers about when information was introduced or changed - something flat retrieval cannot replicate.
  • Information Extraction (+19pp). The sliding-window inference pipeline ensures retrieved facts are self-contained and contextually enriched, reducing failures from ambiguous or underspecified fragments.
  • Multi-Session Reasoning (+14pp). By preserving graph relationships across sessions rather than treating each as an isolated pool, coherence is maintained over long time horizons.
  • Event Ordering (+11pp). Timestamped graph edges reconstruct event sequences with high fidelity, even when events are distributed across distant parts of the context.

On five dimensions the gap is within one to three points (Abstention, Instruction Following, Knowledge Update, Preference Following) - capabilities less sensitive to temporal and structural reasoning, where both systems perform comparably.

06

Conclusion

Hydra DB achieves a state-of-the-art overall score of 82%on BEAM 1M, outperforming the Hindsight baseline by 8 points. The advantage is concentrated in memory dimensions that require temporal awareness, cross-session coherence, and retrieval of structurally related information - precisely the capabilities Hydra DB's architecture is designed to address.

Treating stored knowledge as a versioned, time-aware graph yields measurable improvements over flat or stateless retrieval at scale.

References

  1. BEAM: Benchmark for Evaluation of AI Memory. Tavakoli, M. et al. (2025). arXiv:2510.27246
  2. BEAM Dataset Repository. Tavakoli, M. et al. - GitHub (2025). github.com/mohammadtavakoli78/BEAM
  3. Hydra DB: A Context Engine for Long-Term AI Memory. Technical White Paper (2026). benchmarks.hydradb.com/HydraDB.pdf
  4. Hindsight: Long-Term Memory for AI Systems. (2026). arXiv:2512.12818 · agentmemorybenchmark.ai

Cite this work

@techreport{hydradb-beam2026,
  title  = {BEAM 1M: Memory at the Million-Token Scale},
  author = {{Hydra DB Research Team}},
  institution = {Hydra DB},
  address = {San Francisco, California, USA},
  year   = {2026},
  note   = {BEAM 1M overall 82\% (state of the art)},
  url    = {https://benchmarks.hydradb.com/beam.pdf}
}