About this report
This report shares the results of running Hydra DB against FinanceBench14, an industry benchmark for financial question answering. The goal is a clear view of how Hydra DB performs on real-world financial documents, so you can judge its fit for your use case.
We evaluated two things that matter for production use: how often it finds the correct supporting evidence in the document, and how much context it sends to the downstream language model.
The benchmark
FinanceBench is a public benchmark built from 10-K, 10-Q, 8-K, and earnings documents of publicly traded companies. Each question has a human-verified answer and a pointer to the exact passage in the source document that supports it. The open-source sample covers a mix of direct metric lookups, domain knowledge, and reasoning-based queries.
Hydra DB offers two retrieval modes, and we evaluated both:
Optimized for low-cost, high-throughput retrieval. Run against the full set of 150 open-source questions.
Performs additional reasoning over candidate passages to improve ranking quality. Run against a 120-question subset.
Retrieval accuracy
Accuracy is measured with Recall@K - the share of questions where the correct supporting passage appears somewhere in the top K results Hydra DB returns.
| Top K results | Fast mode | Thinking mode | Improvement |
|---|---|---|---|
| Top 1 | 44.1% | 50.3% | +6.2 pts |
| Top 3 | 68.1% | 74.3% | +6.2 pts |
| Top 5 | 78.9% | 84.3% | +5.4 pts |
| Top 10 | 89.0% | 91.4% | +2.4 pts |
In fast mode, Hydra DB surfaces the correct evidence within its top 10 results for nearly 9 out of 10 questions. At top 5, the correct answer is present ~79% of the time - a practical working range for passing context into a model. Thinking mode improves recall at every level, with the biggest gains at the top of the ranking: Recall@1 rises by more than 6 points and Recall@5 crosses 84%.
Context size
When Hydra DB returns results, it assembles them into a context package for the downstream model. Smaller, more predictable context sizes mean lower inference cost and faster responses. The numbers below are from the fast-mode run.
tokens
tokens
tokens
model window
On average, Hydra DB produces about 8,000 tokens of context per query. The largest in the benchmark was around 12,000 tokens - comfortably inside the context window of every major language model on the market. The size is predictable across queries, making cost planning straightforward.
Summary
On FinanceBench, Hydra DB retrieves the correct evidence in the top 10 results 89% of the time in fast mode and 91% in thinking mode, with an average context size of roughly 8,000 tokens per query. Thinking mode adds 5–6 points of recall at the top of the ranking - the range that matters most when you want the first result to be correct.
The benchmark shows Hydra DB can reliably find the right information inside long, complex financial filings and deliver it in a form ready to use with any modern language model. The two modes give a direct choice between throughput and ranking precision based on what your application needs.
References
- Hydra DB: Technical Paper. Hydra DB (2026). benchmarks.hydradb.com/HydraDB.pdf
- FinanceBench: A New Benchmark for Financial Question Answering. Islam, P., Kannappan, A., Kiela, D., Qian, R., Scherrer, N., Vidgen, B. (2023). arXiv:2311.11944
- Hydra DB FinanceBench Evaluation Code. GitHub. github.com/usecortex/hydradb-bench
Cite this work
@techreport{hydradb-financebench2026,
title = {FinanceBench: Retrieval over Dense Financial Filings},
author = {{Hydra DB Research Team}},
institution = {Hydra DB},
address = {San Francisco, California, USA},
year = {2026},
note = {FinanceBench Recall@10 91.4\% (thinking mode)},
url = {https://benchmarks.hydradb.com/financebench.pdf}
}