HYDRADBResearch
Technical Paper · HYDRA DB

Beyond Flat Embeddings for Production AI Agents

A context engine that models knowledge as versioned, relational, time-aware state - so agents can answer not just what is true now, but what was true, when it changed, and why.

90.79%
Overall accuracy
LongMemEval-s · SOTA
+5.0
Points over strongest
competing system
100%
Single-session
extraction recall
85.8%
On GPT-5 Mini
compact-model config
Abstract

As Large Language Models transition from ephemeral chat interfaces to persistent agentic systems, two foundational problems remain unsolved. First, agents have no continuity across sessions: every interaction begins from zero, with no awareness of prior decisions, no history of how facts have changed, and no understanding of the user they serve. Second, enterprise knowledge bases remain trapped in flat vector indexes built from fragmented chunks that reduce all knowledge to similarity scores, collapsing the relationships essential for reasoning. We present Hydra DB, a context engine that models knowledge as versioned, relational, time-aware state. Agents ingest operational history and knowledge bases; Hydra DB resolves entities, versions every transition, and preserves the full relational record of how context evolved. At its core it combines a Sliding Window Inference Pipeline for contextually self-contained ingestion with a Git-style versioned knowledge graph that makes every state transition a first-class, addressable record. On LongMemEval-s, Hydra DB reaches a state-of-the-art 90.79%, outperforming all established baselines with robust, model-agnostic performance across Gemini 3.0 Pro, GPT-5.2, and GPT-5 Mini.

agent context engineversioned knowledge graphtemporal reasoningknowledge base retrievalLongMemEval-sSOTA
01

Introduction

The current trajectory of AI is defined by a shift from stateless, single-turn interactions to autonomous, multi-turn agentic sessions. While the industry has rapidly expanded LLM context windows, this "brute force" scaling introduces extremely high computational costs1 and remains susceptible to the "lost-in-the-middle" phenomenon2, where information density degrades over long horizons.

Beyond these issues, long-running interactions suffer from context rot3 - a gradual degradation in the usefulness of earlier context as conversations grow. As irrelevant, stale, or weakly related information accumulates, models struggle to distinguish salient facts from noise, leading to diminished recall, incorrect reasoning, and unstable behavior even before hard context limits are reached.

More fundamentally, extending the context window solves the wrong problem. An agent with a longer context window is still stateless: it has no memory of what it learned yesterday, no record of how a user's situation has changed, and no structured model of the relationships that make knowledge useful.

Standard RAG implementations fail in production for two structural reasons:

  • Semantic fragmentation. Naive chunking isolates facts - when an entity is introduced in one segment and updated in another, the system fails to synthesize a unified identity, producing "hallucinations of omission."
  • Temporal stagnation.A flat, chronology-agnostic index cannot tell an obsolete fact from a current one - a policy updated six months ago and today's version sit side by side, indistinguishable by any similarity metric.

They also flatten sentiment. "I absolutely hate React; it's a nightmare to debug" is stored as a neutral fact, losing the intensity that should shape every future recommendation.

When agent context is structured at the database level, inference quality becomes largely model-agnostic - the same architecture holds from GPT-5 Mini to Gemini 3.0 Pro.

Hydra DB approaches this as a data-modeling problem rather than a context-length problem. We replace the flat vector index of standard RAG5 with a Composite Context protocol that fuses a Git-style temporal graph for relational integrity with a high-dimensional vector substrate for semantic breadth.

02

Methodology

Standard RAG treats operational history, knowledge-base entries, and structured records as a flat collection of text chunks, indexed by an embedding model and retrieved by cosine similarity alone. This fails in production because agent context is simultaneously semantic (what something means), relational (how it connects to everything else), and chronological (when it was true). Flattening all three into a similarity score produces a system that actively degrades as the knowledge base grows.

Composite Context · architecture
Sliding Window Inference
Segments enriched with surrounding context; references resolved into self-contained, independently retrievable facts.
Git-Style Versioned Graph
Updates appended as new timestamped edges rather than overwriting - every historical state preserved and addressable.
Multi-Stage Retrieval
Query expansion, hybrid dense + sparse search, graph traversal, and multi-stage reranking surface causally related context.
The Composite Context protocol replaces a flat index with a fused graph + vector substrate spanning ingestion, storage, and recall.

2.1 · Ontological structure vs. the flat-index problem

A fundamental assumption in standard RAG is that semantic proximity implies relevance. This is false in production. Two facts can be semantically distant yet causally linked (a user's career switch and their subsequent relocation), or semantically close yet factually orthogonal ("I love Python" and "I used to love Python"). Vector stores cannot distinguish these cases - they reduce all knowledge to a high-dimensional soup where the only retrieval primitive is cosine similarity. Microsoft Research has shown this baseline "struggles to connect the dots" when answers require traversing disparate information9.

Hydra DB indexes knowledge as a typed graph of entities and relationships. Each entity - a person, project, system, preference, or decision - is a first-class node. Each relationship carries a semantic type (WORKS_AT, PREFERS, CAUSED_BY, BLOCKED_BY), a natural-language context string, and temporal metadata. This enables deterministic multi-hop traversal impossible in a flat index:

"Why is the authentication service behaving differently since last month?" - the graph traverses auth-service → user-db → migration-v2 → alice → schema-change-ticket, recovering the full causal chain without any of these hops being co-located in embedding space.

Because decision traces are encoded as structured edges rather than buried in free text, Hydra DB can synthesize conclusions from graph topology itself. A pattern such as user REJECTED cloud-vendor-A, REJECTED cloud-vendor-B, OPTIMIZES_FOR data-sovereignty lets the system infer a preference that was never explicitly stated. These graph-derived conclusions propagate as enriched signals across the whole pipeline - the more the graph is traversed, the more latent structure it surfaces.

2.2 · Temporally-aware context graph

Standard RAG suffers the State Confusion Problem. A user says in 2022, "I live in New York because I work at startup XYZ." In 2024: "I live in London because I now work at Meta; I moved to be closer to my parents." A vector store either retrieves both without knowing which is current, or overwrites the earlier fact - losing the timeline and the reasoning behind it. Hydra DB treats the knowledge graph as an immutable, append-only ledger, like a Git commit history where every transition is a versioned, addressable commit.

Challenge · destructive updates

Iterative resolution loops vector-search every incoming chunk and ask an LLM whether to overwrite. Semantic similarity ≠ factual redundancy, so this causes false-positive deletes that purge history - and triggering a reasoning step per chunk is an O(N) latency trapthat doesn't scale.

Solution · append-only log

Hydra DB never mutates. Moving from "NYC" to "London" commits a new edge with fresh temporal metadata. Zero data loss, and the agent gains a temporally-aware decision tree it can query: "Where did I live last year, and why did I move?"

A relationship between entities u and v is not a single static edge but a time-ordered sequence of state changes - versioned commits. Each edge is a tuple:

ek = ( rk, tcommit, tvalid, Cmeta )
(1)

where rk is the semantic relation, tcommit the ingestion time, tvalid the real-world validity ("in 1999"), and Cmeta the contextual metadata. When a fact changes, a new edge ek is appended rather than overwriting ek−1. The current relational state is the most recent commit valid at query time:

ΔState(u,v) = SortByTime( E(u,v) ),    t tnow
(2)
1Temporal-State graph topology
current_housing_locationdietary_preferenceSubjectAlicecontext: work at startup XYZValue: "NYC"(temporal: 2022)context: now at Meta,closer to parentsValue: "London"(temporal: 2024)Value: "Vegan"(temporal: 2025)context: …Value: "Omnivore"(temporal: 2021)
Each subject relation is a commit chain. Superseded states (greyed) are preserved, not deleted - enabling differential reasoning over how and why context changed.

2.3 · Sliding Window Inference Pipeline

Standard chunking creates blind segments, where a chunk loses dependencies on its neighbors.

Challenge · orphaned pronouns

Recursive character splitting rendered nearly 40% of chunks semantically invisible. "I hate that framework" is useless if "React" was named a few chunks earlier - vector search can never map "that framework" to "React." Larger overlap windows only inflated token cost.

Solution · window enrichment

Each segment is enriched against a lookback/lookahead window by a lightweight model that resolves references and extracts persistent preferences, producing a self-contained chunk.

We partition a session into base segments and build a context window Wi with horizons hprev and hnext, then enrich via a transformation fθ:

ci = fθ( si | Wi ) = { Tres, Pmap, si }
(5)
2Sliding-window enrichment
si−5 … si
"User: Marine Biologist" … "I moved to the office."
fθ · window Wi
Entity resolution (Tres) + preference mapping (Pmap)
c′i · self-contained
"The user (Marine Biologist) moved to the office."
The enriched chunk embeds resolved entities so it is independently retrievable, even when the original statement was ambiguous.

2.4 · Bio-mimetic context consolidation

The assumption that "more data is better" breaks down: unbounded growth causes retrieval latency and semantic drift, where outdated records surface ahead of current context. We are experimenting with a Bio-Mimetic Decay Engine inspired by synaptic pruning and the Ebbinghaus forgetting curve, augmented with reinforcement. A record's retention score combines initial salience, temporal decay, and a reinforcement boost on each successful retrieval:

R(m,t) = Isalience · e−λΔt  +  σ Σ 1ttaccessi
(6)

High-impact facts (a medical allergy) receive higher salience than low-impact facts (a coffee order). Each retrieval resets and elevates the decay curve, so high-signal records resist eviction regardless of age. Records that fall below threshold demote through a tiered storage architecture before eventual eviction.

2.5 · The high-dimensional vector substrate

While the graph maintains relational integrity, semantic recall relies on a multi-field hybrid schema. For every record we index three representations: raw content (vcontent), sparse keywords (vsparse), and latent context (vlatent). Crucially, vlatent is the vectorization of the enriched output from Section 2.3 - so resolved dependencies are physically embedded into the search space.

Challenge · vocabulary mismatch

A user asks "Why is the app behaving strangely?" but the relevant record says "Error 503: Service Unavailable." Standard embeddings place these far apart - no lexical or immediate semantic overlap.

Solution · latent semantic bridging

By embedding the contextual implications of a chunk rather than its raw text, we pre-compute the answer - letting an abstract query latch onto the meaning of an event even when its literal description is obscure.

2.6 · Recall pipeline

At query time, Hydra DB runs a multi-stage pipeline that combines hybrid semantic search with the versioned graph. The query is treated as a semantic seed, expanded into diverse reformulations, scored across complementary signal paths, then fused and reranked.

Recall pipeline · five stages
1
Adaptive query expansion

The query is a semantic seed: Φ(q) projects it into N diverse reformulations, run in parallel.

Φ(q) → N queriesparaphrasetemporal concretization
2
Weighted hybrid search

Weighted rank fusion at the database level over three complementary signal paths.

Dense · v_contentDense · v_inferredBM25 · v_sparse
3
Entity-based graph search

Query entities are matched, then traversed over bounded variable-length paths.

entity matchpath *1..ncross-encoder rerank
4
Chunk-level graph expansion

Pre-linked entities on each vector chunk expand into their graph neighborhood N(c).

pre-linked E(c)neighborhood N(c)
5
Triple-tier reranking & fusion

Three reranked streams merge into the final context window.

graph-vector fusionTopK merge
Query expansion, weighted hybrid retrieval, entity-based graph search, chunk-level expansion, and multi-stream fusion run as a single staged pipeline.

Adaptive query expansion (multi-query)

Hydra DB treats the user query q as a semantic seed rather than a fixed string, applying an LLM-based projection function Φ(q) to generate N semantically diverse reformulations:

Q′ = { q1, q2, …, qN }

Each captures a distinct interpretation of intent - paraphrases, temporal concretizations, domain-specific restatements. "What did I do last week?" may expand to:

  • "Projects worked on in the last 7 days"
  • "Commits pushed during the previous week"
  • "Meetings or tasks completed last week"

All expansions execute in parallel, ensuring high recall even when relevant records differ significantly in surface phrasing from the original query.

Weighted hybrid search · the retrieval equation

Unlike systems that rely on cosine similarity alone, Hydra DB performs weighted rank fusion at the database level, combining three complementary signal paths - primary dense, secondary (inferred) dense, and sparse lexical:

Sretrieval(q,c) = x·sim(q, vcontent) + y·sim(q, vinferred) + α·BM25(q, vsparse)
(7)

The primary dense signal captures direct semantic similarity; the secondary dense signal captures implicit meaning not explicitly stated; and the sparse signal ensures rare but critical tokens - project IDs, issue numbers, usernames - strongly influence retrieval, preventing drift toward loosely related but incorrect records.

Graph-augmented retrieval · entity-based search

In parallel with hybrid vector retrieval, Hydra DB runs a graph pass over the versioned context graph. Entities E are extracted from q; exact name matching is followed by bounded, variable-length path traversal:

Pgraph = Path( Estart *1..n Eend )
(8)

For each path, structured context is built by concatenating node, relation, and time:

contextgraph(p) = concat( nodename, relationcontext, temporaldetails )
(9)

and reranked by a cross-encoder applied to the query and that context:

Sgraph(p) = Ssemantic( q, contextgraph(p) )
(10)

This captures relational dependencies and temporal sequences absent from any single text chunk - e.g. "Project A is blocked by Issue B."

Chunk-level graph expansion

Beyond query-anchored entity search, a second-stage expansion avoids post-hoc entity extraction from vector results. During ingestion each chunk c is pre-linked to its entities E(c); at retrieval the system explores their adjacent neighborhoods to depth n:

𝒩(c) = ⋃e ∈ E(c) Path( e *1..n )
(11)

Each expanded path becomes structured context and is reranked independently:

contextexpansion(p) = concat( nodename, relationcontext, temporaldetails )
(12)
Sexpansion(p) = Ssemantic( q, contextexpansion(p) )
(13)

This recovers implicit relational context that is semantically adjacent to high-confidence vector chunks but was not surfaced by query-entity matching - and it happens before context assembly, eliminating query-time entity extraction.

Triple-tier reranking with graph-vector fusion

The final window fuses three independently reranked streams: vector-retrieved chunks, query entity-matched graph paths, and chunk-expansion graph paths. For the vector stream:

Svsrerank(c) = γ·Ssemantic(c) + (1−γSlexical(c)
(14)
Svsfinal(c) = β·Svs(c) + (1−βSvsrerank(c)
(15)

Graph candidates are already reranked; the final context window merges chunk-expansion results with their vector chunks and presents entity-based graph results separately:

𝒞final = TopK1( CfinalvsCexpansion, k1 ) ∪ TopK2( Cgraph, k2 )
(16)

where attaches each chunk's expansion context 𝒩(c), and k1/k2 control the number of merged vector-expansion pairs and independent graph paths. Combining these stages, Hydra DB retrieves not merely similar text but the correct factual and relational state - the architecture directly underpinning the 90.79% result on LongMemEval-s.

03

Results

We evaluate on LongMemEval-s6, a benchmark for long-term interactive memory spanning 500 answerable questions across six capability categories. Gemini 3.0 Pro serves as the primary model and LLM-as-a-judge; we additionally evaluate on GPT-5.2 and GPT-5 Mini to demonstrate model-agnostic performance.

We use the LongMemEval-s variant: 500 question-conversation stacks averaging over 115,000 tokens each (roughly 50 continuous sessions). We chose it over LoCoMo, whose 16k–26k average length does not stress the lost-in-the-middle regime of production histories. Data is ingested session-by-session to mimic asynchronous agent workflows, with Gemini 3.0 Pro as the LLM-as-a-judge under strict question-specific prompting (see the Appendix).

CategoryWhat it tests# Q
Single-session extractionRecall explicit facts amid noise70
Single-session preferenceRetain preferences in a session30
Single-session assistantRecall assistant-introduced facts56
Multi-session reasoningCombine facts across sessions133
Temporal reasoningReason over chronology133
Knowledge updatesOverwrite outdated facts78
Total answerable+ abstention safety set500
Table 1. LongMemEval-s evaluation categories and question distribution.

3.1 · Performance on LongMemEval-s

Using Gemini 3.0 Pro, Hydra DB achieves 90.79% overall - a +5.0 point absolute improvement over the strongest competing system and a +30.0 point gain over full-context baselines. It reaches perfect 100% on single-session user and assistant extraction, and leads every category.

3Accuracy by category
Hydra DBSupermemoryZepFull-contextMem0-oss
Per-category accuracy on LongMemEval-s. Hydra DB (orange) leads across extraction, preference, temporal, and knowledge-update tasks.
CategoryHydra DBSupermemoryZepFull-contextMem0-oss
Single-session (User)100.0098.5792.981.438.71
Single-session (Assistant)100.0098.2180.494.68.93
Single-session (Preference)96.6770.0056.720.040.00
Knowledge Update97.4389.7483.378.252.56
Temporal Reasoning90.9781.9562.445.125.56
Multi-session Reasoning76.6976.6957.944.320.30
Overall90.7985.2071.260.229.07
Table 2.Performance comparison on LongMemEval-s. Hydra DB & Supermemory10 on Gemini 3.0 Pro; Zep11& full-context on GPT-4o; Mem0-oss12 on Gemini 3.0 Pro.
4Category coverage profile
Hydra DBSupermemoryZepFull-contextMem0-oss
Per-category coverage on LongMemEval-s. Hydra DB (orange) holds a near-maximal, low-variance profile across every axis, while baselines collapse on preference, temporal, and multi-session reasoning.

3.2 · Cross-model generalization

A central architectural hypothesis is that good context design should reduce dependence on raw model capacity. Evaluated on the compact GPT-5 Mini, Hydra DB maintains 85.80% overall- approaching the Gemini 3.0 Pro reference and matching the best competitor's flagship-model result, while preserving near-perfect single-session extraction (98.59% user, 96.36% assistant).

5Stability across backbone models
Across Gemini 3.0 Pro, GPT-5.2, and GPT-5 Mini, Hydra DB stays within ~6 points - the compact-model config meets the strongest competitor's flagship score. The gains come from context design, not model scale.

The intermediate-scale GPT-5.2 lands at 84.73% overall with perfect user recall (100%), and the compact GPT-5 Mini at 85.80% - even exceeding Gemini on preference extraction. Consistency across the capacity spectrum confirms that context quality is governed by ingestion design and temporal indexing, not raw model capacity.

CategoryGemini 3.0 ProGPT-5 MiniGPT-5.2
Single-session (User)100.0098.59100.00
Single-session (Assistant)100.0096.3698.18
Single-session (Preference)96.6793.1089.66
Knowledge Update97.4092.3191.03
Temporal Reasoning90.9785.7183.46
Multi-session Reasoning76.6966.3764.60
Overall90.7985.8084.73
Table 3. Hydra DB across backbone model scales on LongMemEval-s - only modest degradation as capacity decreases.
04

Conclusion

Across three independent benchmarks, the same architectural thesis holds: treating stored knowledge as a versioned, time-aware graph - rather than a flat or stateless index - yields measurable gains precisely where production agents fail today. Hydra DB reaches state-of-the-art 90.79% on LongMemEval-s, 82% on BEAM 1M, and reliably surfaces correct evidence inside dense financial filings, all while remaining robust across backbone models of very different scale.

As AI deployments operate over ever-longer interaction histories, the importance of robust long-term memory architecture will only grow. The advantage is concentrated in temporal awareness, cross-session coherence, and the retrieval of structurally related information - the capabilities Hydra DB was built to deliver.

Appendix A

Evaluation prompts

The complete prompt templates used to evaluate Hydra DB on LongMemEval-s. The pipeline has two stages: (1) answer generation from retrieved context, and (2) answer comparison with an LLM-as-a-judge using question-type-specific scoring rubrics.

A.1 · Answer generation

Each question type receives type-specific instructions inside a shared structure:

Base template
{TYPE_SPECIFIC_INSTRUCTION}

Question: {question}
Question Date: {question_date}

Context:
{retrieved_context}

Instructions:
- Answer based on the provided context
- If information is insufficient, clearly state "I don't know"
- Be direct and factual
- Do not make up information

Provide your response in the following format:
Reasoning: [Your step-by-step reasoning]
Answer: [Your final answer]

Type-specific instructions:

Single-session · user information
You are answering a question about information the USER mentioned
in a previous conversation. Provide a direct, factual answer based
on what the user stated.
Single-session · assistant information
You are answering a question about information YOU (the assistant)
provided in a previous conversation. Recall and provide the specific
information you gave to the user.
Single-session · preference extraction
You are answering a question that requires PERSONALIZATION based on
the user's preferences. Generate a response that actively utilizes
the user's stated preferences, likes, dislikes, or personal information.
Do not just state the preferences - USE them to provide a personalized
recommendation or response. The provided context might not directly
answer the question but try drawing conclusions based on the context
to generate the answer.
Multi-session reasoning
You are answering a question that requires synthesizing information
from MULTIPLE conversation sessions. Combine, aggregate, or compare
information across different sessions to form a complete answer.
Knowledge updates
You are answering a question about information that may have CHANGED
over time. Provide the MOST RECENT/CURRENT information. If there were
updates or changes, use the latest state. You may acknowledge previous
states but prioritize the current information.
Temporal reasoning
You are answering a question that involves TIME or temporal reasoning.
Pay attention to dates, timestamps, durations, and time-relative
references. Perform any necessary date/time calculations to arrive
at the answer.
Abstention (safety)
You are answering a question where the information may NOT be available
in the conversation history. If you cannot find the requested information,
clearly state that you don't know or that the information is not available.
Do NOT make up or hallucinate an answer.

A.2 · Answer comparison (LLM-as-a-judge)

The comparison stage uses question-type-specific rubrics over a shared base template:

Judge base template
Compare the generated answer with the expected ground truth answer.

Question: {question}
{TYPE_SPECIFIC_NOTE}

Generated Answer: {generated_answer}

Expected Answer (Ground Truth): {expected_answer}

{TYPE_SPECIFIC_SCORING}

**Key Principle:** If the expected answer's core information appears
in the generated answer, mark is_correct=true and score=1.0. Minor
wording differences don't matter.

Respond ONLY with this JSON (no other text):
{
  "is_correct": <true or false>,
  "correctness_score": <0.0 to 1.0>,
  "explanation": "<one sentence explaining score>",
  "key_matches": [<list of matched facts>],
  "key_misses": [<list of missing facts from expected>]
}

Default rubric · factual recall

Conditionis_correctscore
Generated contains expected answer (exact or semantic match)true1.0
Generated contains expected answer + extra correct detailstrue1.0
Generated partially correct (some key facts match)false0.3–0.7
Generated says "don’t know" but expected has answerfalse0.0
Generated gives wrong answerfalse0.0
Table A1. Default scoring rubric for factual recall (single-session user/assistant, multi-session reasoning).

Preference personalization

Conditionis_correctscore
Response satisfies the rubric and uses the user's personal info correctlytrue1.0
Response acknowledges preferences but doesn't fully utilize themfalse0.3–0.9
Response ignores user preferences entirelyfalse0.0
Response contradicts user preferencesfalse0.0
Table A2.Preference questions score personalization, not literal matching. The response need not reflect every rubric point - only recall and use the user's info to personalize.

Knowledge update

Conditionis_correctscore
Generated contains the UPDATED/CURRENT answertrue1.0
Generated contains updated answer + mentions previous statetrue1.0
Generated ONLY contains outdated informationfalse0.0
Generated confuses old and new informationfalse0.2–0.4
Table A3. Knowledge-update scoring prioritizes recency - correct as long as the updated answer is present, even alongside the previous state.

Temporal reasoning

Conditionis_correctscore
Generated contains correct temporal answertrue1.0
Generated has temporal errorsfalse0.0–0.9
Table A4. Temporal-reasoning scoring for dates, durations, and time calculations.

Abstention (safety)

Conditionis_correctscore
System correctly abstains or says "I don’t know"true1.0
System indicates uncertainty or lack of informationtrue1.0
System explicitly states the information is not availabletrue1.0
System makes up an answer or hallucinatesfalse0.0
System provides a confident but wrong answerfalse0.0
Table A5.Abstention scoring - any honest form of "I don't know" is correct; the system must not fabricate.
Appendix B

Model configuration

For the primary evaluation results (Section 3) we used:

  • Answer generation model: Gemini 3.0 Pro
  • Judge model: Gemini 3.0 Pro
  • Temperature: model-specific optimal temperature (determined empirically)
  • JSON mode: enabled for judge responses to ensure structured output

For the robustness analysis (cross-model generalization) we additionally evaluated with:

  • GPT-5.2 (answer generation and judging)
  • GPT-5 Mini (answer generation and judging)

All evaluations used identical prompt templates with only the backbone model changed, ensuring a fair comparison across model scales.


References

  1. LLMs: Bigger Is Not Always Better. Rigoni, T. - Ampere Computing Blog (2024). amperecomputing.com
  2. Lost in the Middle: How Language Models Use Long Contexts. Liu, N.F. et al. (2023). arXiv:2307.03172
  3. Context Rot: How Increasing Input Tokens Impacts LLM Performance. Hong, K., Troynikov, A., Huber, J. (2025). research.trychroma.com
  4. Introducing Contextual Retrieval. Ford, D. - Anthropic Engineering (2024). anthropic.com
  5. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Lewis, P. et al. (2021). arXiv:2005.11401
  6. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. Wu, D. et al. (2025). arXiv:2410.10813
  7. Evaluating Very Long-Term Conversational Memory of LLM Agents. Maharana, A. et al. (2024). arXiv:2402.17753
  8. Lies, Damn Lies, & Statistics: Is Mem0 Really SOTA in Agent Memory? Chalef, D., Rasmussen, P. (2025). blog.getzep.com
  9. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. Edge, D. et al. (2024). arXiv:2404.16130
  10. Supermemory: State-of-the-Art Agent Memory on LongMemEval. Daga, S., Sreedhar, S., Shah, D. (2026). supermemory.ai/research
  11. Zep: A Temporal Knowledge Graph Architecture for Agent Memory. Rasmussen, P. et al. (2025). arXiv:2501.13956
  12. Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. Chhikara, P. et al. (2025). arXiv:2504.19413
  13. BEAM: Benchmark for Evaluation of AI Memory. Tavakoli, M. et al. (2025). arXiv:2510.27246
  14. FinanceBench: A New Benchmark for Financial Question Answering. Islam, P. et al. (2023). arXiv:2311.11944

Cite this work

@techreport{hydradb2026,
  title  = {Hydra DB: Beyond Flat Embeddings for Production AI Agents},
  author = {Ratnaparkhi, Soham and Srivastava, Nishkarsh and
            Garg, Aadil and Garg, Pratham},
  institution = {Hydra DB},
  address = {San Francisco, California, USA},
  year   = {2026},
  note   = {LongMemEval-s 90.79\% (SOTA)},
  url    = {https://benchmarks.hydradb.com/HydraDB.pdf}
}