As Large Language Models transition from ephemeral chat interfaces to persistent agentic systems, two foundational problems remain unsolved. First, agents have no continuity across sessions: every interaction begins from zero, with no awareness of prior decisions, no history of how facts have changed, and no understanding of the user they serve. Second, enterprise knowledge bases remain trapped in flat vector indexes built from fragmented chunks that reduce all knowledge to similarity scores, collapsing the relationships essential for reasoning. We present Hydra DB, a context engine that models knowledge as versioned, relational, time-aware state. Agents ingest operational history and knowledge bases; Hydra DB resolves entities, versions every transition, and preserves the full relational record of how context evolved. At its core it combines a Sliding Window Inference Pipeline for contextually self-contained ingestion with a Git-style versioned knowledge graph that makes every state transition a first-class, addressable record. On LongMemEval-s, Hydra DB reaches a state-of-the-art 90.79%, outperforming all established baselines with robust, model-agnostic performance across Gemini 3.0 Pro, GPT-5.2, and GPT-5 Mini.
Introduction
The current trajectory of AI is defined by a shift from stateless, single-turn interactions to autonomous, multi-turn agentic sessions. While the industry has rapidly expanded LLM context windows, this "brute force" scaling introduces extremely high computational costs1 and remains susceptible to the "lost-in-the-middle" phenomenon2, where information density degrades over long horizons.
Beyond these issues, long-running interactions suffer from context rot3 - a gradual degradation in the usefulness of earlier context as conversations grow. As irrelevant, stale, or weakly related information accumulates, models struggle to distinguish salient facts from noise, leading to diminished recall, incorrect reasoning, and unstable behavior even before hard context limits are reached.
More fundamentally, extending the context window solves the wrong problem. An agent with a longer context window is still stateless: it has no memory of what it learned yesterday, no record of how a user's situation has changed, and no structured model of the relationships that make knowledge useful.
Standard RAG implementations fail in production for two structural reasons:
- Semantic fragmentation. Naive chunking isolates facts - when an entity is introduced in one segment and updated in another, the system fails to synthesize a unified identity, producing "hallucinations of omission."
- Temporal stagnation.A flat, chronology-agnostic index cannot tell an obsolete fact from a current one - a policy updated six months ago and today's version sit side by side, indistinguishable by any similarity metric.
They also flatten sentiment. "I absolutely hate React; it's a nightmare to debug" is stored as a neutral fact, losing the intensity that should shape every future recommendation.
Hydra DB approaches this as a data-modeling problem rather than a context-length problem. We replace the flat vector index of standard RAG5 with a Composite Context protocol that fuses a Git-style temporal graph for relational integrity with a high-dimensional vector substrate for semantic breadth.
Methodology
Standard RAG treats operational history, knowledge-base entries, and structured records as a flat collection of text chunks, indexed by an embedding model and retrieved by cosine similarity alone. This fails in production because agent context is simultaneously semantic (what something means), relational (how it connects to everything else), and chronological (when it was true). Flattening all three into a similarity score produces a system that actively degrades as the knowledge base grows.
2.1 · Ontological structure vs. the flat-index problem
A fundamental assumption in standard RAG is that semantic proximity implies relevance. This is false in production. Two facts can be semantically distant yet causally linked (a user's career switch and their subsequent relocation), or semantically close yet factually orthogonal ("I love Python" and "I used to love Python"). Vector stores cannot distinguish these cases - they reduce all knowledge to a high-dimensional soup where the only retrieval primitive is cosine similarity. Microsoft Research has shown this baseline "struggles to connect the dots" when answers require traversing disparate information9.
Hydra DB indexes knowledge as a typed graph of entities and relationships. Each entity - a person, project, system, preference, or decision - is a first-class node. Each relationship carries a semantic type (WORKS_AT, PREFERS, CAUSED_BY, BLOCKED_BY), a natural-language context string, and temporal metadata. This enables deterministic multi-hop traversal impossible in a flat index:
Because decision traces are encoded as structured edges rather than buried in free text, Hydra DB can synthesize conclusions from graph topology itself. A pattern such as user REJECTED cloud-vendor-A, REJECTED cloud-vendor-B, OPTIMIZES_FOR data-sovereignty lets the system infer a preference that was never explicitly stated. These graph-derived conclusions propagate as enriched signals across the whole pipeline - the more the graph is traversed, the more latent structure it surfaces.
2.2 · Temporally-aware context graph
Standard RAG suffers the State Confusion Problem. A user says in 2022, "I live in New York because I work at startup XYZ." In 2024: "I live in London because I now work at Meta; I moved to be closer to my parents." A vector store either retrieves both without knowing which is current, or overwrites the earlier fact - losing the timeline and the reasoning behind it. Hydra DB treats the knowledge graph as an immutable, append-only ledger, like a Git commit history where every transition is a versioned, addressable commit.
Iterative resolution loops vector-search every incoming chunk and ask an LLM whether to overwrite. Semantic similarity ≠ factual redundancy, so this causes false-positive deletes that purge history - and triggering a reasoning step per chunk is an O(N) latency trapthat doesn't scale.
Hydra DB never mutates. Moving from "NYC" to "London" commits a new edge with fresh temporal metadata. Zero data loss, and the agent gains a temporally-aware decision tree it can query: "Where did I live last year, and why did I move?"
A relationship between entities u and v is not a single static edge but a time-ordered sequence of state changes - versioned commits. Each edge is a tuple:
where rk is the semantic relation, tcommit the ingestion time, tvalid the real-world validity ("in 1999"), and Cmeta the contextual metadata. When a fact changes, a new edge ek is appended rather than overwriting ek−1. The current relational state is the most recent commit valid at query time:
2.3 · Sliding Window Inference Pipeline
Standard chunking creates blind segments, where a chunk loses dependencies on its neighbors.
Recursive character splitting rendered nearly 40% of chunks semantically invisible. "I hate that framework" is useless if "React" was named a few chunks earlier - vector search can never map "that framework" to "React." Larger overlap windows only inflated token cost.
Each segment is enriched against a lookback/lookahead window by a lightweight model that resolves references and extracts persistent preferences, producing a self-contained chunk.
We partition a session into base segments and build a context window Wi with horizons hprev and hnext, then enrich via a transformation fθ:
2.4 · Bio-mimetic context consolidation
The assumption that "more data is better" breaks down: unbounded growth causes retrieval latency and semantic drift, where outdated records surface ahead of current context. We are experimenting with a Bio-Mimetic Decay Engine inspired by synaptic pruning and the Ebbinghaus forgetting curve, augmented with reinforcement. A record's retention score combines initial salience, temporal decay, and a reinforcement boost on each successful retrieval:
High-impact facts (a medical allergy) receive higher salience than low-impact facts (a coffee order). Each retrieval resets and elevates the decay curve, so high-signal records resist eviction regardless of age. Records that fall below threshold demote through a tiered storage architecture before eventual eviction.
2.5 · The high-dimensional vector substrate
While the graph maintains relational integrity, semantic recall relies on a multi-field hybrid schema. For every record we index three representations: raw content (vcontent), sparse keywords (vsparse), and latent context (vlatent). Crucially, vlatent is the vectorization of the enriched output from Section 2.3 - so resolved dependencies are physically embedded into the search space.
A user asks "Why is the app behaving strangely?" but the relevant record says "Error 503: Service Unavailable." Standard embeddings place these far apart - no lexical or immediate semantic overlap.
By embedding the contextual implications of a chunk rather than its raw text, we pre-compute the answer - letting an abstract query latch onto the meaning of an event even when its literal description is obscure.
2.6 · Recall pipeline
At query time, Hydra DB runs a multi-stage pipeline that combines hybrid semantic search with the versioned graph. The query is treated as a semantic seed, expanded into diverse reformulations, scored across complementary signal paths, then fused and reranked.
The query is a semantic seed: Φ(q) projects it into N diverse reformulations, run in parallel.
Weighted rank fusion at the database level over three complementary signal paths.
Query entities are matched, then traversed over bounded variable-length paths.
Pre-linked entities on each vector chunk expand into their graph neighborhood N(c).
Three reranked streams merge into the final context window.
Adaptive query expansion (multi-query)
Hydra DB treats the user query q as a semantic seed rather than a fixed string, applying an LLM-based projection function Φ(q) to generate N semantically diverse reformulations:
Each captures a distinct interpretation of intent - paraphrases, temporal concretizations, domain-specific restatements. "What did I do last week?" may expand to:
- "Projects worked on in the last 7 days"
- "Commits pushed during the previous week"
- "Meetings or tasks completed last week"
All expansions execute in parallel, ensuring high recall even when relevant records differ significantly in surface phrasing from the original query.
Weighted hybrid search · the retrieval equation
Unlike systems that rely on cosine similarity alone, Hydra DB performs weighted rank fusion at the database level, combining three complementary signal paths - primary dense, secondary (inferred) dense, and sparse lexical:
The primary dense signal captures direct semantic similarity; the secondary dense signal captures implicit meaning not explicitly stated; and the sparse signal ensures rare but critical tokens - project IDs, issue numbers, usernames - strongly influence retrieval, preventing drift toward loosely related but incorrect records.
Graph-augmented retrieval · entity-based search
In parallel with hybrid vector retrieval, Hydra DB runs a graph pass over the versioned context graph. Entities E are extracted from q; exact name matching is followed by bounded, variable-length path traversal:
For each path, structured context is built by concatenating node, relation, and time:
and reranked by a cross-encoder applied to the query and that context:
This captures relational dependencies and temporal sequences absent from any single text chunk - e.g. "Project A is blocked by Issue B."
Chunk-level graph expansion
Beyond query-anchored entity search, a second-stage expansion avoids post-hoc entity extraction from vector results. During ingestion each chunk c is pre-linked to its entities E(c); at retrieval the system explores their adjacent neighborhoods to depth n:
Each expanded path becomes structured context and is reranked independently:
This recovers implicit relational context that is semantically adjacent to high-confidence vector chunks but was not surfaced by query-entity matching - and it happens before context assembly, eliminating query-time entity extraction.
Triple-tier reranking with graph-vector fusion
The final window fuses three independently reranked streams: vector-retrieved chunks, query entity-matched graph paths, and chunk-expansion graph paths. For the vector stream:
Graph candidates are already reranked; the final context window merges chunk-expansion results with their vector chunks and presents entity-based graph results separately:
where ⊕ attaches each chunk's expansion context 𝒩(c), and k1/k2 control the number of merged vector-expansion pairs and independent graph paths. Combining these stages, Hydra DB retrieves not merely similar text but the correct factual and relational state - the architecture directly underpinning the 90.79% result on LongMemEval-s.
Results
We evaluate on LongMemEval-s6, a benchmark for long-term interactive memory spanning 500 answerable questions across six capability categories. Gemini 3.0 Pro serves as the primary model and LLM-as-a-judge; we additionally evaluate on GPT-5.2 and GPT-5 Mini to demonstrate model-agnostic performance.
We use the LongMemEval-s variant: 500 question-conversation stacks averaging over 115,000 tokens each (roughly 50 continuous sessions). We chose it over LoCoMo, whose 16k–26k average length does not stress the lost-in-the-middle regime of production histories. Data is ingested session-by-session to mimic asynchronous agent workflows, with Gemini 3.0 Pro as the LLM-as-a-judge under strict question-specific prompting (see the Appendix).
| Category | What it tests | # Q |
|---|---|---|
| Single-session extraction | Recall explicit facts amid noise | 70 |
| Single-session preference | Retain preferences in a session | 30 |
| Single-session assistant | Recall assistant-introduced facts | 56 |
| Multi-session reasoning | Combine facts across sessions | 133 |
| Temporal reasoning | Reason over chronology | 133 |
| Knowledge updates | Overwrite outdated facts | 78 |
| Total answerable | + abstention safety set | 500 |
3.1 · Performance on LongMemEval-s
Using Gemini 3.0 Pro, Hydra DB achieves 90.79% overall - a +5.0 point absolute improvement over the strongest competing system and a +30.0 point gain over full-context baselines. It reaches perfect 100% on single-session user and assistant extraction, and leads every category.
| Category | Hydra DB | Supermemory | Zep | Full-context | Mem0-oss |
|---|---|---|---|---|---|
| Single-session (User) | 100.00 | 98.57 | 92.9 | 81.4 | 38.71 |
| Single-session (Assistant) | 100.00 | 98.21 | 80.4 | 94.6 | 8.93 |
| Single-session (Preference) | 96.67 | 70.00 | 56.7 | 20.0 | 40.00 |
| Knowledge Update | 97.43 | 89.74 | 83.3 | 78.2 | 52.56 |
| Temporal Reasoning | 90.97 | 81.95 | 62.4 | 45.1 | 25.56 |
| Multi-session Reasoning | 76.69 | 76.69 | 57.9 | 44.3 | 20.30 |
| Overall | 90.79 | 85.20 | 71.2 | 60.2 | 29.07 |
3.2 · Cross-model generalization
A central architectural hypothesis is that good context design should reduce dependence on raw model capacity. Evaluated on the compact GPT-5 Mini, Hydra DB maintains 85.80% overall- approaching the Gemini 3.0 Pro reference and matching the best competitor's flagship-model result, while preserving near-perfect single-session extraction (98.59% user, 96.36% assistant).
The intermediate-scale GPT-5.2 lands at 84.73% overall with perfect user recall (100%), and the compact GPT-5 Mini at 85.80% - even exceeding Gemini on preference extraction. Consistency across the capacity spectrum confirms that context quality is governed by ingestion design and temporal indexing, not raw model capacity.
| Category | Gemini 3.0 Pro | GPT-5 Mini | GPT-5.2 |
|---|---|---|---|
| Single-session (User) | 100.00 | 98.59 | 100.00 |
| Single-session (Assistant) | 100.00 | 96.36 | 98.18 |
| Single-session (Preference) | 96.67 | 93.10 | 89.66 |
| Knowledge Update | 97.40 | 92.31 | 91.03 |
| Temporal Reasoning | 90.97 | 85.71 | 83.46 |
| Multi-session Reasoning | 76.69 | 66.37 | 64.60 |
| Overall | 90.79 | 85.80 | 84.73 |
Conclusion
Across three independent benchmarks, the same architectural thesis holds: treating stored knowledge as a versioned, time-aware graph - rather than a flat or stateless index - yields measurable gains precisely where production agents fail today. Hydra DB reaches state-of-the-art 90.79% on LongMemEval-s, 82% on BEAM 1M, and reliably surfaces correct evidence inside dense financial filings, all while remaining robust across backbone models of very different scale.
As AI deployments operate over ever-longer interaction histories, the importance of robust long-term memory architecture will only grow. The advantage is concentrated in temporal awareness, cross-session coherence, and the retrieval of structurally related information - the capabilities Hydra DB was built to deliver.
Appendix AEvaluation prompts
The complete prompt templates used to evaluate Hydra DB on LongMemEval-s. The pipeline has two stages: (1) answer generation from retrieved context, and (2) answer comparison with an LLM-as-a-judge using question-type-specific scoring rubrics.
A.1 · Answer generation
Each question type receives type-specific instructions inside a shared structure:
{TYPE_SPECIFIC_INSTRUCTION}
Question: {question}
Question Date: {question_date}
Context:
{retrieved_context}
Instructions:
- Answer based on the provided context
- If information is insufficient, clearly state "I don't know"
- Be direct and factual
- Do not make up information
Provide your response in the following format:
Reasoning: [Your step-by-step reasoning]
Answer: [Your final answer]Type-specific instructions:
You are answering a question about information the USER mentioned in a previous conversation. Provide a direct, factual answer based on what the user stated.
You are answering a question about information YOU (the assistant) provided in a previous conversation. Recall and provide the specific information you gave to the user.
You are answering a question that requires PERSONALIZATION based on the user's preferences. Generate a response that actively utilizes the user's stated preferences, likes, dislikes, or personal information. Do not just state the preferences - USE them to provide a personalized recommendation or response. The provided context might not directly answer the question but try drawing conclusions based on the context to generate the answer.
You are answering a question that requires synthesizing information from MULTIPLE conversation sessions. Combine, aggregate, or compare information across different sessions to form a complete answer.
You are answering a question about information that may have CHANGED over time. Provide the MOST RECENT/CURRENT information. If there were updates or changes, use the latest state. You may acknowledge previous states but prioritize the current information.
You are answering a question that involves TIME or temporal reasoning. Pay attention to dates, timestamps, durations, and time-relative references. Perform any necessary date/time calculations to arrive at the answer.
You are answering a question where the information may NOT be available in the conversation history. If you cannot find the requested information, clearly state that you don't know or that the information is not available. Do NOT make up or hallucinate an answer.
A.2 · Answer comparison (LLM-as-a-judge)
The comparison stage uses question-type-specific rubrics over a shared base template:
Compare the generated answer with the expected ground truth answer.
Question: {question}
{TYPE_SPECIFIC_NOTE}
Generated Answer: {generated_answer}
Expected Answer (Ground Truth): {expected_answer}
{TYPE_SPECIFIC_SCORING}
**Key Principle:** If the expected answer's core information appears
in the generated answer, mark is_correct=true and score=1.0. Minor
wording differences don't matter.
Respond ONLY with this JSON (no other text):
{
"is_correct": <true or false>,
"correctness_score": <0.0 to 1.0>,
"explanation": "<one sentence explaining score>",
"key_matches": [<list of matched facts>],
"key_misses": [<list of missing facts from expected>]
}Default rubric · factual recall
| Condition | is_correct | score |
|---|---|---|
| Generated contains expected answer (exact or semantic match) | true | 1.0 |
| Generated contains expected answer + extra correct details | true | 1.0 |
| Generated partially correct (some key facts match) | false | 0.3–0.7 |
| Generated says "don’t know" but expected has answer | false | 0.0 |
| Generated gives wrong answer | false | 0.0 |
Preference personalization
| Condition | is_correct | score |
|---|---|---|
| Response satisfies the rubric and uses the user's personal info correctly | true | 1.0 |
| Response acknowledges preferences but doesn't fully utilize them | false | 0.3–0.9 |
| Response ignores user preferences entirely | false | 0.0 |
| Response contradicts user preferences | false | 0.0 |
Knowledge update
| Condition | is_correct | score |
|---|---|---|
| Generated contains the UPDATED/CURRENT answer | true | 1.0 |
| Generated contains updated answer + mentions previous state | true | 1.0 |
| Generated ONLY contains outdated information | false | 0.0 |
| Generated confuses old and new information | false | 0.2–0.4 |
Temporal reasoning
| Condition | is_correct | score |
|---|---|---|
| Generated contains correct temporal answer | true | 1.0 |
| Generated has temporal errors | false | 0.0–0.9 |
Abstention (safety)
| Condition | is_correct | score |
|---|---|---|
| System correctly abstains or says "I don’t know" | true | 1.0 |
| System indicates uncertainty or lack of information | true | 1.0 |
| System explicitly states the information is not available | true | 1.0 |
| System makes up an answer or hallucinates | false | 0.0 |
| System provides a confident but wrong answer | false | 0.0 |
Appendix BModel configuration
For the primary evaluation results (Section 3) we used:
- Answer generation model: Gemini 3.0 Pro
- Judge model: Gemini 3.0 Pro
- Temperature: model-specific optimal temperature (determined empirically)
- JSON mode: enabled for judge responses to ensure structured output
For the robustness analysis (cross-model generalization) we additionally evaluated with:
- GPT-5.2 (answer generation and judging)
- GPT-5 Mini (answer generation and judging)
All evaluations used identical prompt templates with only the backbone model changed, ensuring a fair comparison across model scales.
References
- LLMs: Bigger Is Not Always Better. Rigoni, T. - Ampere Computing Blog (2024). amperecomputing.com
- Lost in the Middle: How Language Models Use Long Contexts. Liu, N.F. et al. (2023). arXiv:2307.03172
- Context Rot: How Increasing Input Tokens Impacts LLM Performance. Hong, K., Troynikov, A., Huber, J. (2025). research.trychroma.com
- Introducing Contextual Retrieval. Ford, D. - Anthropic Engineering (2024). anthropic.com
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Lewis, P. et al. (2021). arXiv:2005.11401
- LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. Wu, D. et al. (2025). arXiv:2410.10813
- Evaluating Very Long-Term Conversational Memory of LLM Agents. Maharana, A. et al. (2024). arXiv:2402.17753
- Lies, Damn Lies, & Statistics: Is Mem0 Really SOTA in Agent Memory? Chalef, D., Rasmussen, P. (2025). blog.getzep.com
- From Local to Global: A Graph RAG Approach to Query-Focused Summarization. Edge, D. et al. (2024). arXiv:2404.16130
- Supermemory: State-of-the-Art Agent Memory on LongMemEval. Daga, S., Sreedhar, S., Shah, D. (2026). supermemory.ai/research
- Zep: A Temporal Knowledge Graph Architecture for Agent Memory. Rasmussen, P. et al. (2025). arXiv:2501.13956
- Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. Chhikara, P. et al. (2025). arXiv:2504.19413
- BEAM: Benchmark for Evaluation of AI Memory. Tavakoli, M. et al. (2025). arXiv:2510.27246
- FinanceBench: A New Benchmark for Financial Question Answering. Islam, P. et al. (2023). arXiv:2311.11944
Cite this work
@techreport{hydradb2026,
title = {Hydra DB: Beyond Flat Embeddings for Production AI Agents},
author = {Ratnaparkhi, Soham and Srivastava, Nishkarsh and
Garg, Aadil and Garg, Pratham},
institution = {Hydra DB},
address = {San Francisco, California, USA},
year = {2026},
note = {LongMemEval-s 90.79\% (SOTA)},
url = {https://benchmarks.hydradb.com/HydraDB.pdf}
}