HYDRADBResearch
Benchmark Report · Financial QA

FinanceBench - Retrieval over Dense Filings

How reliably Hydra DB finds the right evidence inside long, complex financial documents - and how compact the context it produces is. Evaluated across two retrieval modes on real 10-K, 10-Q, 8-K, and earnings filings.

91.4%
Recall@10
thinking mode
89%
Recall@10
fast mode
~8K
Avg context
tokens / query
150
Open-source
questions
01

About this report

This report shares the results of running Hydra DB against FinanceBench14, an industry benchmark for financial question answering. The goal is a clear view of how Hydra DB performs on real-world financial documents, so you can judge its fit for your use case.

We evaluated two things that matter for production use: how often it finds the correct supporting evidence in the document, and how much context it sends to the downstream language model.

02

The benchmark

FinanceBench is a public benchmark built from 10-K, 10-Q, 8-K, and earnings documents of publicly traded companies. Each question has a human-verified answer and a pointer to the exact passage in the source document that supports it. The open-source sample covers a mix of direct metric lookups, domain knowledge, and reasoning-based queries.

Hydra DB offers two retrieval modes, and we evaluated both:

Fast mode

Optimized for low-cost, high-throughput retrieval. Run against the full set of 150 open-source questions.

Thinking mode

Performs additional reasoning over candidate passages to improve ranking quality. Run against a 120-question subset.

03

Retrieval accuracy

Accuracy is measured with Recall@K - the share of questions where the correct supporting passage appears somewhere in the top K results Hydra DB returns.

1Recall@K
Thinking modeFast mode
Recall@K across both modes. Both converge near Recall@10; thinking mode adds 5–6 points at the top of the ranking.
Top K resultsFast modeThinking modeImprovement
Top 144.1%50.3%+6.2 pts
Top 368.1%74.3%+6.2 pts
Top 578.9%84.3%+5.4 pts
Top 1089.0%91.4%+2.4 pts
Table 1. Recall@K for fast and thinking modes.

In fast mode, Hydra DB surfaces the correct evidence within its top 10 results for nearly 9 out of 10 questions. At top 5, the correct answer is present ~79% of the time - a practical working range for passing context into a model. Thinking mode improves recall at every level, with the biggest gains at the top of the ranking: Recall@1 rises by more than 6 points and Recall@5 crosses 84%.

Both modes converge at Recall@10 - the underlying retrieval finds the correct evidence in nearly every case. Thinking mode primarily reorders results to put the right one higher up; choose it when the first few results must be correct.
04

Context size

When Hydra DB returns results, it assembles them into a context package for the downstream model. Smaller, more predictable context sizes mean lower inference cost and faster responses. The numbers below are from the fast-mode run.

5,299
Smallest context
tokens
7,997
Average context
tokens
12,175
Largest context
tokens
Fits every major
model window
Table 2. Context size per query (fast mode).

On average, Hydra DB produces about 8,000 tokens of context per query. The largest in the benchmark was around 12,000 tokens - comfortably inside the context window of every major language model on the market. The size is predictable across queries, making cost planning straightforward.

05

Summary

On FinanceBench, Hydra DB retrieves the correct evidence in the top 10 results 89% of the time in fast mode and 91% in thinking mode, with an average context size of roughly 8,000 tokens per query. Thinking mode adds 5–6 points of recall at the top of the ranking - the range that matters most when you want the first result to be correct.

The benchmark shows Hydra DB can reliably find the right information inside long, complex financial filings and deliver it in a form ready to use with any modern language model. The two modes give a direct choice between throughput and ranking precision based on what your application needs.

Reliable evidence retrieval inside dense filings, delivered in a compact, predictable ~8K-token package.

References

  1. Hydra DB: Technical Paper. Hydra DB (2026). benchmarks.hydradb.com/HydraDB.pdf
  2. FinanceBench: A New Benchmark for Financial Question Answering. Islam, P., Kannappan, A., Kiela, D., Qian, R., Scherrer, N., Vidgen, B. (2023). arXiv:2311.11944
  3. Hydra DB FinanceBench Evaluation Code. GitHub. github.com/usecortex/hydradb-bench

Cite this work

@techreport{hydradb-financebench2026,
  title  = {FinanceBench: Retrieval over Dense Financial Filings},
  author = {{Hydra DB Research Team}},
  institution = {Hydra DB},
  address = {San Francisco, California, USA},
  year   = {2026},
  note   = {FinanceBench Recall@10 91.4\% (thinking mode)},
  url    = {https://benchmarks.hydradb.com/financebench.pdf}
}