Hydra DB · Beyond Flat Embeddings for Production AI Agents

Abstract

As Large Language Models transition from ephemeral chat interfaces to persistent agentic systems, two foundational problems remain unsolved. First, agents have no continuity across sessions: every interaction begins from zero, with no awareness of prior decisions, no history of how facts have changed, and no understanding of the user they serve. Second, enterprise knowledge bases remain trapped in flat vector indexes built from fragmented chunks that reduce all knowledge to similarity scores, collapsing the relationships essential for reasoning. We present Hydra DB, a context engine that models knowledge as versioned, relational, time-aware state. Agents ingest operational history and knowledge bases; Hydra DB resolves entities, versions every transition, and preserves the full relational record of how context evolved. At its core it combines a Sliding Window Inference Pipeline for contextually self-contained ingestion with a Git-style versioned knowledge graph that makes every state transition a first-class, addressable record. On LongMemEval-s, Hydra DB reaches a state-of-the-art 90.79%, outperforming all established baselines with robust, model-agnostic performance across Gemini 3.0 Pro, GPT-5.2, and GPT-5 Mini.

agent context engineversioned knowledge graphtemporal reasoningknowledge base retrievalLongMemEval-sSOTA

Introduction

The current trajectory of AI is defined by a shift from stateless, single-turn interactions to autonomous, multi-turn agentic sessions. While the industry has rapidly expanded LLM context windows, this "brute force" scaling introduces extremely high computational costs1 and remains susceptible to the "lost-in-the-middle" phenomenon2, where information density degrades over long horizons.

Beyond these issues, long-running interactions suffer from context rot3 - a gradual degradation in the usefulness of earlier context as conversations grow. As irrelevant, stale, or weakly related information accumulates, models struggle to distinguish salient facts from noise, leading to diminished recall, incorrect reasoning, and unstable behavior even before hard context limits are reached.

More fundamentally, extending the context window solves the wrong problem. An agent with a longer context window is still stateless: it has no memory of what it learned yesterday, no record of how a user's situation has changed, and no structured model of the relationships that make knowledge useful.

Standard RAG implementations fail in production for two structural reasons:

Semantic fragmentation. Naive chunking isolates facts - when an entity is introduced in one segment and updated in another, the system fails to synthesize a unified identity, producing "hallucinations of omission."
Temporal stagnation.A flat, chronology-agnostic index cannot tell an obsolete fact from a current one - a policy updated six months ago and today's version sit side by side, indistinguishable by any similarity metric.

They also flatten sentiment. "I absolutely hate React; it's a nightmare to debug" is stored as a neutral fact, losing the intensity that should shape every future recommendation.

When agent context is structured at the database level, inference quality becomes largely model-agnostic - the same architecture holds from GPT-5 Mini to Gemini 3.0 Pro.

Hydra DB approaches this as a data-modeling problem rather than a context-length problem. We replace the flat vector index of standard RAG5 with a Composite Context protocol that fuses a Git-style temporal graph for relational integrity with a high-dimensional vector substrate for semantic breadth.

Methodology

Standard RAG treats operational history, knowledge-base entries, and structured records as a flat collection of text chunks, indexed by an embedding model and retrieved by cosine similarity alone. This fails in production because agent context is simultaneously semantic (what something means), relational (how it connects to everything else), and chronological (when it was true). Flattening all three into a similarity score produces a system that actively degrades as the knowledge base grows.

Composite Context · architecture

Sliding Window Inference

Segments enriched with surrounding context; references resolved into self-contained, independently retrievable facts.

Git-Style Versioned Graph

Updates appended as new timestamped edges rather than overwriting - every historical state preserved and addressable.

Multi-Stage Retrieval

Query expansion, hybrid dense + sparse search, graph traversal, and multi-stage reranking surface causally related context.

The Composite Context protocol replaces a flat index with a fused graph + vector substrate spanning ingestion, storage, and recall.

2.1 · Ontological structure vs. the flat-index problem

A fundamental assumption in standard RAG is that semantic proximity implies relevance. This is false in production. Two facts can be semantically distant yet causally linked (a user's career switch and their subsequent relocation), or semantically close yet factually orthogonal ("I love Python" and "I used to love Python"). Vector stores cannot distinguish these cases - they reduce all knowledge to a high-dimensional soup where the only retrieval primitive is cosine similarity. Microsoft Research has shown this baseline "struggles to connect the dots" when answers require traversing disparate information9.

Hydra DB indexes knowledge as a typed graph of entities and relationships. Each entity - a person, project, system, preference, or decision - is a first-class node. Each relationship carries a semantic type (WORKS_AT, PREFERS, CAUSED_BY, BLOCKED_BY), a natural-language context string, and temporal metadata. This enables deterministic multi-hop traversal impossible in a flat index:

"Why is the authentication service behaving differently since last month?" - the graph traverses auth-service → user-db → migration-v2 → alice → schema-change-ticket, recovering the full causal chain without any of these hops being co-located in embedding space.

Because decision traces are encoded as structured edges rather than buried in free text, Hydra DB can synthesize conclusions from graph topology itself. A pattern such as user REJECTED cloud-vendor-A, REJECTED cloud-vendor-B, OPTIMIZES_FOR data-sovereignty lets the system infer a preference that was never explicitly stated. These graph-derived conclusions propagate as enriched signals across the whole pipeline - the more the graph is traversed, the more latent structure it surfaces.

2.2 · Temporally-aware context graph

Standard RAG suffers the State Confusion Problem. A user says in 2022, "I live in New York because I work at startup XYZ." In 2024: "I live in London because I now work at Meta; I moved to be closer to my parents." A vector store either retrieves both without knowing which is current, or overwrites the earlier fact - losing the timeline and the reasoning behind it. Hydra DB treats the knowledge graph as an immutable, append-only ledger, like a Git commit history where every transition is a versioned, addressable commit.

Challenge · destructive updates

Iterative resolution loops vector-search every incoming chunk and ask an LLM whether to overwrite. Semantic similarity ≠ factual redundancy, so this causes false-positive deletes that purge history - and triggering a reasoning step per chunk is an O(N) latency trapthat doesn't scale.

Solution · append-only log

Hydra DB never mutates. Moving from "NYC" to "London" commits a new edge with fresh temporal metadata. Zero data loss, and the agent gains a temporally-aware decision tree it can query: "Where did I live last year, and why did I move?"

A relationship between entities u and v is not a single static edge but a time-ordered sequence of state changes - versioned commits. Each edge is a tuple:

e_k = ( r_k, t_commit, t_valid, C_meta )

(1)

where r_k is the semantic relation, t_commit the ingestion time, t_valid the real-world validity ("in 1999"), and C_meta the contextual metadata. When a fact changes, a new edge e_k is appended rather than overwriting e_k−1. The current relational state is the most recent commit valid at query time:

ΔState(u,v) = SortByTime( E(u,v) ), t ≤ t_now

(2)

1Temporal-State graph topology

Each subject relation is a commit chain. Superseded states (greyed) are preserved, not deleted - enabling differential reasoning over how and why context changed.

2.3 · Sliding Window Inference Pipeline

Standard chunking creates blind segments, where a chunk loses dependencies on its neighbors.

Challenge · orphaned pronouns

Recursive character splitting rendered nearly 40% of chunks semantically invisible. "I hate that framework" is useless if "React" was named a few chunks earlier - vector search can never map "that framework" to "React." Larger overlap windows only inflated token cost.

Solution · window enrichment

Each segment is enriched against a lookback/lookahead window by a lightweight model that resolves references and extracts persistent preferences, producing a self-contained chunk.

We partition a session into base segments and build a context window W_i with horizons h_prev and h_next, then enrich via a transformation f_θ:

c′_i = f_θ( s_i | W_i ) = { T_res, P_map, s_i }

(5)

2Sliding-window enrichment

s_i−5 … s_i

"User: Marine Biologist" … "I moved to the office."

f_θ · window W_i

Entity resolution (T_res) + preference mapping (P_map)

c′_i · self-contained

"The user (Marine Biologist) moved to the office."

The enriched chunk embeds resolved entities so it is independently retrievable, even when the original statement was ambiguous.

2.4 · Bio-mimetic context consolidation

The assumption that "more data is better" breaks down: unbounded growth causes retrieval latency and semantic drift, where outdated records surface ahead of current context. We are experimenting with a Bio-Mimetic Decay Engine inspired by synaptic pruning and the Ebbinghaus forgetting curve, augmented with reinforcement. A record's retention score combines initial salience, temporal decay, and a reinforcement boost on each successful retrieval:

R(m,t) = I_salience · e^−λΔt + σ Σ 1t − t_{access_i}

(6)

High-impact facts (a medical allergy) receive higher salience than low-impact facts (a coffee order). Each retrieval resets and elevates the decay curve, so high-signal records resist eviction regardless of age. Records that fall below threshold demote through a tiered storage architecture before eventual eviction.

2.5 · The high-dimensional vector substrate

While the graph maintains relational integrity, semantic recall relies on a multi-field hybrid schema. For every record we index three representations: raw content (v_content), sparse keywords (v_sparse), and latent context (v_latent). Crucially, v_latent is the vectorization of the enriched output from Section 2.3 - so resolved dependencies are physically embedded into the search space.

Challenge · vocabulary mismatch

A user asks "Why is the app behaving strangely?" but the relevant record says "Error 503: Service Unavailable." Standard embeddings place these far apart - no lexical or immediate semantic overlap.

Solution · latent semantic bridging

By embedding the contextual implications of a chunk rather than its raw text, we pre-compute the answer - letting an abstract query latch onto the meaning of an event even when its literal description is obscure.

2.6 · Recall pipeline

At query time, Hydra DB runs a multi-stage pipeline that combines hybrid semantic search with the versioned graph. The query is treated as a semantic seed, expanded into diverse reformulations, scored across complementary signal paths, then fused and reranked.

Recall pipeline · five stages

Adaptive query expansion

The query is a semantic seed: Φ(q) projects it into N diverse reformulations, run in parallel.

Φ(q) → N queriesparaphrasetemporal concretization

Weighted hybrid search

Weighted rank fusion at the database level over three complementary signal paths.

Dense · v_contentDense · v_inferredBM25 · v_sparse

Entity-based graph search

Query entities are matched, then traversed over bounded variable-length paths.

entity matchpath *1..ncross-encoder rerank

Chunk-level graph expansion

Pre-linked entities on each vector chunk expand into their graph neighborhood N(c).

pre-linked E(c)neighborhood N(c)

Triple-tier reranking & fusion

Three reranked streams merge into the final context window.

graph-vector fusionTopK merge

Query expansion, weighted hybrid retrieval, entity-based graph search, chunk-level expansion, and multi-stream fusion run as a single staged pipeline.

Adaptive query expansion (multi-query)

Hydra DB treats the user query q as a semantic seed rather than a fixed string, applying an LLM-based projection function Φ(q) to generate N semantically diverse reformulations:

Q′ = { q₁, q₂, …, q_N }

Each captures a distinct interpretation of intent - paraphrases, temporal concretizations, domain-specific restatements. "What did I do last week?" may expand to:

"Projects worked on in the last 7 days"
"Commits pushed during the previous week"
"Meetings or tasks completed last week"

All expansions execute in parallel, ensuring high recall even when relevant records differ significantly in surface phrasing from the original query.

Weighted hybrid search · the retrieval equation

Unlike systems that rely on cosine similarity alone, Hydra DB performs weighted rank fusion at the database level, combining three complementary signal paths - primary dense, secondary (inferred) dense, and sparse lexical:

S_retrieval(q,c) = x·sim(q, v_content) + y·sim(q, v_inferred) + α·BM25(q, v_sparse)

(7)

The primary dense signal captures direct semantic similarity; the secondary dense signal captures implicit meaning not explicitly stated; and the sparse signal ensures rare but critical tokens - project IDs, issue numbers, usernames - strongly influence retrieval, preventing drift toward loosely related but incorrect records.

Graph-augmented retrieval · entity-based search

In parallel with hybrid vector retrieval, Hydra DB runs a graph pass over the versioned context graph. Entities E are extracted from q; exact name matching is followed by bounded, variable-length path traversal:

P_graph = Path( E_start →^*1..n E_end )

(8)

For each path, structured context is built by concatenating node, relation, and time:

context_graph(p) = concat( node_name, relation_context, temporal_details )

(9)

and reranked by a cross-encoder applied to the query and that context:

S_graph(p) = S_semantic( q, context_graph(p) )

(10)

This captures relational dependencies and temporal sequences absent from any single text chunk - e.g. "Project A is blocked by Issue B."

Chunk-level graph expansion

Beyond query-anchored entity search, a second-stage expansion avoids post-hoc entity extraction from vector results. During ingestion each chunk c is pre-linked to its entities E(c); at retrieval the system explores their adjacent neighborhoods to depth n:

𝒩(c) = ⋃_{e ∈ E(c)} Path( e ^*1..n )

(11)

Each expanded path becomes structured context and is reranked independently:

context_expansion(p) = concat( node_name, relation_context, temporal_details )

(12)

S_expansion(p) = S_semantic( q, context_expansion(p) )

(13)

This recovers implicit relational context that is semantically adjacent to high-confidence vector chunks but was not surfaced by query-entity matching - and it happens before context assembly, eliminating query-time entity extraction.

Triple-tier reranking with graph-vector fusion

The final window fuses three independently reranked streams: vector-retrieved chunks, query entity-matched graph paths, and chunk-expansion graph paths. For the vector stream:

S^vs_rerank(c) = γ·S_semantic(c) + (1−γ)·S_lexical(c)

(14)

S^vs_final(c) = β·S_vs(c) + (1−β)·S^vs_rerank(c)

(15)

Graph candidates are already reranked; the final context window merges chunk-expansion results with their vector chunks and presents entity-based graph results separately:

𝒞_final = TopK₁( C^final_vs ⊕ C_expansion, k₁ ) ∪ TopK₂( C_graph, k₂ )

(16)

where ⊕ attaches each chunk's expansion context 𝒩(c), and k₁/k₂ control the number of merged vector-expansion pairs and independent graph paths. Combining these stages, Hydra DB retrieves not merely similar text but the correct factual and relational state - the architecture directly underpinning the 90.79% result on LongMemEval-s.

Results

We evaluate on LongMemEval-s6, a benchmark for long-term interactive memory spanning 500 answerable questions across six capability categories. Gemini 3.0 Pro serves as the primary model and LLM-as-a-judge; we additionally evaluate on GPT-5.2 and GPT-5 Mini to demonstrate model-agnostic performance.

We use the LongMemEval-s variant: 500 question-conversation stacks averaging over 115,000 tokens each (roughly 50 continuous sessions). We chose it over LoCoMo, whose 16k–26k average length does not stress the lost-in-the-middle regime of production histories. Data is ingested session-by-session to mimic asynchronous agent workflows, with Gemini 3.0 Pro as the LLM-as-a-judge under strict question-specific prompting (see the Appendix).

Category	What it tests	# Q
Single-session extraction	Recall explicit facts amid noise	70
Single-session preference	Retain preferences in a session	30
Single-session assistant	Recall assistant-introduced facts	56
Multi-session reasoning	Combine facts across sessions	133
Temporal reasoning	Reason over chronology	133
Knowledge updates	Overwrite outdated facts	78
Total answerable	+ abstention safety set	500

Table 1. LongMemEval-s evaluation categories and question distribution.

3.1 · Performance on LongMemEval-s

Using Gemini 3.0 Pro, Hydra DB achieves 90.79% overall - a +5.0 point absolute improvement over the strongest competing system and a +30.0 point gain over full-context baselines. It reaches perfect 100% on single-session user and assistant extraction, and leads every category.

3Accuracy by category

Hydra DBSupermemoryZepFull-contextMem0-oss

Per-category accuracy on LongMemEval-s. Hydra DB (orange) leads across extraction, preference, temporal, and knowledge-update tasks.

Category	Hydra DB	Supermemory	Zep	Full-context	Mem0-oss
Single-session (User)	100.00	98.57	92.9	81.4	38.71
Single-session (Assistant)	100.00	98.21	80.4	94.6	8.93
Single-session (Preference)	96.67	70.00	56.7	20.0	40.00
Knowledge Update	97.43	89.74	83.3	78.2	52.56
Temporal Reasoning	90.97	81.95	62.4	45.1	25.56
Multi-session Reasoning	76.69	76.69	57.9	44.3	20.30
Overall	90.79	85.20	71.2	60.2	29.07

Table 2.Performance comparison on LongMemEval-s. Hydra DB & Supermemory10 on Gemini 3.0 Pro; Zep11& full-context on GPT-4o; Mem0-oss12 on Gemini 3.0 Pro.

4Category coverage profile

Hydra DBSupermemoryZepFull-contextMem0-oss

Per-category coverage on LongMemEval-s. Hydra DB (orange) holds a near-maximal, low-variance profile across every axis, while baselines collapse on preference, temporal, and multi-session reasoning.

3.2 · Cross-model generalization

A central architectural hypothesis is that good context design should reduce dependence on raw model capacity. Evaluated on the compact GPT-5 Mini, Hydra DB maintains 85.80% overall- approaching the Gemini 3.0 Pro reference and matching the best competitor's flagship-model result, while preserving near-perfect single-session extraction (98.59% user, 96.36% assistant).

5Stability across backbone models

Across Gemini 3.0 Pro, GPT-5.2, and GPT-5 Mini, Hydra DB stays within ~6 points - the compact-model config meets the strongest competitor's flagship score. The gains come from context design, not model scale.

The intermediate-scale GPT-5.2 lands at 84.73% overall with perfect user recall (100%), and the compact GPT-5 Mini at 85.80% - even exceeding Gemini on preference extraction. Consistency across the capacity spectrum confirms that context quality is governed by ingestion design and temporal indexing, not raw model capacity.

Category	Gemini 3.0 Pro	GPT-5 Mini	GPT-5.2
Single-session (User)	100.00	98.59	100.00
Single-session (Assistant)	100.00	96.36	98.18
Single-session (Preference)	96.67	93.10	89.66
Knowledge Update	97.40	92.31	91.03
Temporal Reasoning	90.97	85.71	83.46
Multi-session Reasoning	76.69	66.37	64.60
Overall	90.79	85.80	84.73

Table 3. Hydra DB across backbone model scales on LongMemEval-s - only modest degradation as capacity decreases.

Conclusion

Across three independent benchmarks, the same architectural thesis holds: treating stored knowledge as a versioned, time-aware graph - rather than a flat or stateless index - yields measurable gains precisely where production agents fail today. Hydra DB reaches state-of-the-art 90.79% on LongMemEval-s, 82% on BEAM 1M, and reliably surfaces correct evidence inside dense financial filings, all while remaining robust across backbone models of very different scale.

As AI deployments operate over ever-longer interaction histories, the importance of robust long-term memory architecture will only grow. The advantage is concentrated in temporal awareness, cross-session coherence, and the retrieval of structurally related information - the capabilities Hydra DB was built to deliver.

Appendix A

Evaluation prompts

The complete prompt templates used to evaluate Hydra DB on LongMemEval-s. The pipeline has two stages: (1) answer generation from retrieved context, and (2) answer comparison with an LLM-as-a-judge using question-type-specific scoring rubrics.

A.1 · Answer generation

Each question type receives type-specific instructions inside a shared structure:

Base template

{TYPE_SPECIFIC_INSTRUCTION}

Question: {question}
Question Date: {question_date}

Context:
{retrieved_context}

Instructions:
- Answer based on the provided context
- If information is insufficient, clearly state "I don't know"
- Be direct and factual
- Do not make up information

Provide your response in the following format:
Reasoning: [Your step-by-step reasoning]
Answer: [Your final answer]

Type-specific instructions:

Single-session · user information

You are answering a question about information the USER mentioned
in a previous conversation. Provide a direct, factual answer based
on what the user stated.

Single-session · assistant information

You are answering a question about information YOU (the assistant)
provided in a previous conversation. Recall and provide the specific
information you gave to the user.

Single-session · preference extraction

You are answering a question that requires PERSONALIZATION based on
the user's preferences. Generate a response that actively utilizes
the user's stated preferences, likes, dislikes, or personal information.
Do not just state the preferences - USE them to provide a personalized
recommendation or response. The provided context might not directly
answer the question but try drawing conclusions based on the context
to generate the answer.

Multi-session reasoning

You are answering a question that requires synthesizing information
from MULTIPLE conversation sessions. Combine, aggregate, or compare
information across different sessions to form a complete answer.

Knowledge updates

You are answering a question about information that may have CHANGED
over time. Provide the MOST RECENT/CURRENT information. If there were
updates or changes, use the latest state. You may acknowledge previous
states but prioritize the current information.

Temporal reasoning

You are answering a question that involves TIME or temporal reasoning.
Pay attention to dates, timestamps, durations, and time-relative
references. Perform any necessary date/time calculations to arrive
at the answer.

Abstention (safety)

You are answering a question where the information may NOT be available
in the conversation history. If you cannot find the requested information,
clearly state that you don't know or that the information is not available.
Do NOT make up or hallucinate an answer.

A.2 · Answer comparison (LLM-as-a-judge)

The comparison stage uses question-type-specific rubrics over a shared base template:

Judge base template

Compare the generated answer with the expected ground truth answer.

Question: {question}
{TYPE_SPECIFIC_NOTE}

Generated Answer: {generated_answer}

Expected Answer (Ground Truth): {expected_answer}

{TYPE_SPECIFIC_SCORING}

**Key Principle:** If the expected answer's core information appears
in the generated answer, mark is_correct=true and score=1.0. Minor
wording differences don't matter.

Respond ONLY with this JSON (no other text):
{
  "is_correct": <true or false>,
  "correctness_score": <0.0 to 1.0>,
  "explanation": "<one sentence explaining score>",
  "key_matches": [<list of matched facts>],
  "key_misses": [<list of missing facts from expected>]
}

Default rubric · factual recall

Condition	is_correct	score
Generated contains expected answer (exact or semantic match)	true	1.0
Generated contains expected answer + extra correct details	true	1.0
Generated partially correct (some key facts match)	false	0.3–0.7
Generated says "don’t know" but expected has answer	false	0.0
Generated gives wrong answer	false	0.0

Table A1. Default scoring rubric for factual recall (single-session user/assistant, multi-session reasoning).

Preference personalization

Condition	is_correct	score
Response satisfies the rubric and uses the user's personal info correctly	true	1.0
Response acknowledges preferences but doesn't fully utilize them	false	0.3–0.9
Response ignores user preferences entirely	false	0.0
Response contradicts user preferences	false	0.0

Table A2.Preference questions score personalization, not literal matching. The response need not reflect every rubric point - only recall and use the user's info to personalize.

Knowledge update

Condition	is_correct	score
Generated contains the UPDATED/CURRENT answer	true	1.0
Generated contains updated answer + mentions previous state	true	1.0
Generated ONLY contains outdated information	false	0.0
Generated confuses old and new information	false	0.2–0.4

Table A3. Knowledge-update scoring prioritizes recency - correct as long as the updated answer is present, even alongside the previous state.

Temporal reasoning

Condition	is_correct	score
Generated contains correct temporal answer	true	1.0
Generated has temporal errors	false	0.0–0.9

Table A4. Temporal-reasoning scoring for dates, durations, and time calculations.

Abstention (safety)

Condition	is_correct	score
System correctly abstains or says "I don’t know"	true	1.0
System indicates uncertainty or lack of information	true	1.0
System explicitly states the information is not available	true	1.0
System makes up an answer or hallucinates	false	0.0
System provides a confident but wrong answer	false	0.0

Table A5.Abstention scoring - any honest form of "I don't know" is correct; the system must not fabricate.

Appendix B

Model configuration

For the primary evaluation results (Section 3) we used:

Answer generation model: Gemini 3.0 Pro
Judge model: Gemini 3.0 Pro
Temperature: model-specific optimal temperature (determined empirically)
JSON mode: enabled for judge responses to ensure structured output

For the robustness analysis (cross-model generalization) we additionally evaluated with:

GPT-5.2 (answer generation and judging)
GPT-5 Mini (answer generation and judging)

All evaluations used identical prompt templates with only the backbone model changed, ensuring a fair comparison across model scales.

References

LLMs: Bigger Is Not Always Better. Rigoni, T. - Ampere Computing Blog (2024). amperecomputing.com
Lost in the Middle: How Language Models Use Long Contexts. Liu, N.F. et al. (2023). arXiv:2307.03172
Context Rot: How Increasing Input Tokens Impacts LLM Performance. Hong, K., Troynikov, A., Huber, J. (2025). research.trychroma.com
Introducing Contextual Retrieval. Ford, D. - Anthropic Engineering (2024). anthropic.com
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Lewis, P. et al. (2021). arXiv:2005.11401
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. Wu, D. et al. (2025). arXiv:2410.10813
Evaluating Very Long-Term Conversational Memory of LLM Agents. Maharana, A. et al. (2024). arXiv:2402.17753
Lies, Damn Lies, & Statistics: Is Mem0 Really SOTA in Agent Memory? Chalef, D., Rasmussen, P. (2025). blog.getzep.com
From Local to Global: A Graph RAG Approach to Query-Focused Summarization. Edge, D. et al. (2024). arXiv:2404.16130
Supermemory: State-of-the-Art Agent Memory on LongMemEval. Daga, S., Sreedhar, S., Shah, D. (2026). supermemory.ai/research
Zep: A Temporal Knowledge Graph Architecture for Agent Memory. Rasmussen, P. et al. (2025). arXiv:2501.13956
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. Chhikara, P. et al. (2025). arXiv:2504.19413
BEAM: Benchmark for Evaluation of AI Memory. Tavakoli, M. et al. (2025). arXiv:2510.27246
FinanceBench: A New Benchmark for Financial Question Answering. Islam, P. et al. (2023). arXiv:2311.11944

Cite this work

@techreport{hydradb2026,
  title  = {Hydra DB: Beyond Flat Embeddings for Production AI Agents},
  author = {Ratnaparkhi, Soham and Srivastava, Nishkarsh and
            Garg, Aadil and Garg, Pratham},
  institution = {Hydra DB},
  address = {San Francisco, California, USA},
  year   = {2026},
  note   = {LongMemEval-s 90.79\% (SOTA)},
  url    = {https://benchmarks.hydradb.com/HydraDB.pdf}
}

Beyond Flat Embeddings for Production AI Agents

Introduction

Methodology

2.1 · Ontological structure vs. the flat-index problem

2.2 · Temporally-aware context graph

2.3 · Sliding Window Inference Pipeline

2.4 · Bio-mimetic context consolidation

2.5 · The high-dimensional vector substrate

2.6 · Recall pipeline

Adaptive query expansion (multi-query)

Weighted hybrid search · the retrieval equation

Graph-augmented retrieval · entity-based search

Chunk-level graph expansion

Triple-tier reranking with graph-vector fusion

Results

3.1 · Performance on LongMemEval-s

3.2 · Cross-model generalization

Conclusion

Evaluation prompts

A.1 · Answer generation

A.2 · Answer comparison (LLM-as-a-judge)

Default rubric · factual recall

Preference personalization

Knowledge update

Temporal reasoning

Abstention (safety)

Model configuration

Related research

References

Cite this work