Introduction
This report benchmarks Hydra DB on BEAM 1M13, a purpose-built evaluation of long-term AI memory at the one-million-token scale. BEAM tests ten distinct memory capabilities - including temporal reasoning, cross-session coherence, and contradiction resolution - representing the most comprehensive publicly available evaluation at this context length.
Hydra DB achieved an overall average of 82% across all ten dimensions, with standout results in Temporal Reasoning (91%), Event Ordering (92%), and Preference Following (96%). For context, these results are benchmarked against Hindsight, which has publicly claimed state-of-the-art performance on BEAM, making it the natural reference point.
The BEAM benchmark
BEAM (Benchmark for Evaluation of AI Memory) is a structured dataset of 100 conversations distributed across four context-length tiers - 128K, 500K, 1M, and 10M tokens. Conversations span general tasks, coding, and mathematics, each constructed to include multi-turn reasoning chains, cross-session dependencies, follow-up questions, and information that must be updated, reconciled, or retrieved from temporally distant points.
2.1 · Ten memory dimensions
Each dimension targets a specific failure mode observed in long-context AI systems:
- Abstention - correctly acknowledging when information is unavailable.
- Contradiction Resolution - resolving conflicting information appropriately.
- Event Ordering - accurately tracking the sequence of events.
- Information Extraction - retrieving relevant details accurately from large contexts.
- Instruction Following - adhering to user-specified instructions over time.
- Knowledge Update - updating knowledge when new facts supersede older ones.
- Multi-Session Reasoning - reasoning coherently across distinct sessions.
- Preference Following - retaining and applying user preferences.
- Summarization - accurately compressing information from long contexts.
- Temporal Reasoning - reasoning correctly about time-dependent information.
Architecture
Hydra DB models stored knowledge as a versioned, relational, time-aware graph rather than isolated text fragments. Three integrated components drive its BEAM results:
Methodology
To enable a meaningful external comparison, we adopted the same evaluation configuration used in Hindsight's published results - answer-generation and LLM-judge prompts taken directly from Hindsight's original benchmark configuration, without modification. GPT-5.4 served as the evaluation judge across all ten dimensions. Using a shared configuration means differences reflect genuine performance rather than methodological variation.
Results
5.1 · Multi-dimensional profile
The shaded area for Hydra DB extends beyond Hindsight across the majority of dimensions, with the most pronounced expansion in Temporal Reasoning, Information Extraction, and Multi-Session Reasoning.
| Memory dimension | Hindsight | Hydra DB | Δ |
|---|---|---|---|
| Temporal Reasoning | 60 | 91 | +31 |
| Information Extraction | 61 | 80 | +19 |
| Multi-Session Reasoning | 46 | 60 | +14 |
| Event Ordering | 81 | 92 | +11 |
| Contradiction Resolution | 59 | 66 | +7 |
| Summarization | 84 | 88 | +4 |
| Abstention | 90 | 89 | −1 |
| Instruction Following | 93 | 92 | −1 |
| Preference Following | 97 | 96 | −1 |
| Knowledge Update | 66 | 63 | −3 |
| Overall average | 74 | 82 | +8 |
5.2 · Analysis of key findings
Hydra DB's most substantial advantages appear in dimensions requiring reasoning over time or across structurally distant information:
- Temporal Reasoning (+31pp). The versioned knowledge graph natively preserves the temporal context of every stored fact, enabling accurate answers about when information was introduced or changed - something flat retrieval cannot replicate.
- Information Extraction (+19pp). The sliding-window inference pipeline ensures retrieved facts are self-contained and contextually enriched, reducing failures from ambiguous or underspecified fragments.
- Multi-Session Reasoning (+14pp). By preserving graph relationships across sessions rather than treating each as an isolated pool, coherence is maintained over long time horizons.
- Event Ordering (+11pp). Timestamped graph edges reconstruct event sequences with high fidelity, even when events are distributed across distant parts of the context.
On five dimensions the gap is within one to three points (Abstention, Instruction Following, Knowledge Update, Preference Following) - capabilities less sensitive to temporal and structural reasoning, where both systems perform comparably.
Conclusion
Hydra DB achieves a state-of-the-art overall score of 82%on BEAM 1M, outperforming the Hindsight baseline by 8 points. The advantage is concentrated in memory dimensions that require temporal awareness, cross-session coherence, and retrieval of structurally related information - precisely the capabilities Hydra DB's architecture is designed to address.
References
- BEAM: Benchmark for Evaluation of AI Memory. Tavakoli, M. et al. (2025). arXiv:2510.27246
- BEAM Dataset Repository. Tavakoli, M. et al. - GitHub (2025). github.com/mohammadtavakoli78/BEAM
- Hydra DB: A Context Engine for Long-Term AI Memory. Technical White Paper (2026). benchmarks.hydradb.com/HydraDB.pdf
- Hindsight: Long-Term Memory for AI Systems. (2026). arXiv:2512.12818 · agentmemorybenchmark.ai
Cite this work
@techreport{hydradb-beam2026,
title = {BEAM 1M: Memory at the Million-Token Scale},
author = {{Hydra DB Research Team}},
institution = {Hydra DB},
address = {San Francisco, California, USA},
year = {2026},
note = {BEAM 1M overall 82\% (state of the art)},
url = {https://benchmarks.hydradb.com/beam.pdf}
}