arXiv 2605.05482v1May 6, 2026

FinRAG-12B: A Production-Validated Recipe for Grounded Question Answering in Banking

Denys Katerenchuk et al.

Brief context

Publication timing, weekly edition context, and source links for this brief.

Published

May 6, 2026, 10:04 PM

Current score

87

Original paper

The executive brief below is grounded in the source paper and linked back to the arXiv abstract.

Large language models (LLMs) are rapidly being adopted across various domains. However, their adoption in banking industry faces resistance due to demands for high accuracy, regulatory compliance, and the need for verifiable and grounded responses. We present a unified, data-efficient framework for training grounded domain-specific LLMs that optimizes answer quality, citation grounding, and calibrated refusal under real-world deployment constraints. First, we describe a data generation pipeline that combines LLM-as-a-Judge filtering, citation annotation, and curriculum learning with only 143M tokens. The resulting 12B model achieves high answer quality outperforming GPT-4.1 on citation grounding, with a modest citation tradeoff versus the untuned base. Second, we propose a calibrated refusal mechanism: training on 22% unanswerable examples yield a 12% "I don't know" rate, substantially improving over the base model's unsafe 4.3% rate while avoiding GPT-4.1's over-refusal (20.2%). Third, we present an end-to-end methodology spanning from data curation to quantized serving. The system is deployed at 40+ financial institutions, achieving a 7.1 percentage point improvement in query resolution (p < 0.001). Additionally, the model delivers 3-5x faster responses at 20-50x lower cost compared to GPT-4.1.

Score 87Full-paper briefmodelstraininginferencedata

Executive brief

A short business-reader brief that explains why the paper matters now and what to watch or do next.

Why this is worth your attention

FinRAG-12B is less a “better chatbot” paper than a recipe for making regulated AI support cheaper to operate: a 12B domain model, tuned on a relatively small corpus, that answers with citations and is trained to say “I don’t know” when the source material is insufficient. The authors claim this is already running at 40+ financial institutions, improving query resolution by 7.1 percentage points while responding 3–5x faster and at 20–50x lower cost than GPT-4.1. If those production numbers hold up, procurement and operations teams should stop treating frontier API access as the default answer for grounded banking QA; the open question is how much of the result depends on proprietary data, narrow retail-banking workflows, and evaluation choices.

  • The paper’s direct claim is that a smaller, domain-tuned model can deliver cited answers for banking with far less training data than a general-purpose frontier model. The business implication is that high-compliance QA may be more about data curation, retrieval, citation discipline, and refusal tuning than buying the largest model available.
  • For regulated workflows, the key buying question is not just accuracy; it is whether the system refuses when the evidence is missing without becoming uselessly conservative. Ask for measured refusal rates on unanswerable questions, citation-grounding rates, and examples of how those thresholds are calibrated on your own policy and product materials.
  • The authors report a quantized 12B model that fits into an 8.4GB footprint, runs on a single GPU setup, and is 3–5x faster at 20–50x lower cost than GPT-4.1. Treat the exact economics as vendor- and workload-dependent, but the direction matters: private or dedicated smaller models may be credible for high-volume, grounded support tasks.
  • The strongest signal is not the benchmark table; it is the reported deployment across 40+ financial institutions with a 7.1 percentage-point lift in query resolution. If replicated by other teams, this points to a near-term automation wedge in customer support, internal knowledge desks, compliance help, and operations QA.
  • The evidence is strongest for banking RAG, and the paper notes limits around a 258-example proprietary banking test set, three institutions, retail-banking skew, and unreleased proprietary training data. This is not yet proof that the same recipe works for trading, insurance, investment advice, or rare regulatory edge cases.

Evidence ledger

The strongest claims in the brief, along with the confidence and citation depth behind them.

capabilityhighp.1p.3

FinRAG-12B is a 12B-parameter banking RAG model trained with a relatively small corpus of 98,648 samples / 143M tokens.

capabilityhighp.1

The model is explicitly tuned to refuse when evidence is missing, with reported refusal behavior between the unsafe base model and GPT-4.1 over-refusal.

strategichighp.1

The authors report production deployment at 40+ financial institutions and a statistically significant 7.1 percentage-point query-resolution improvement.

inferencemediump.5p.7

Quantization and serving choices make the model cheaper and faster to operate than a large proprietary API in the authors’ reported setup.

caveatmediump.8

Generalization is uncertain because the evaluation is concentrated in banking RAG, includes proprietary data, and underexplores rare edge cases.

Related briefs

More plain-English summaries from the archive with nearby topics or operator relevance.

cs.LG

Scalable AI Inference: Performance Analysis and Optimization of AI Model Serving

Hung Cuong Pham, Fatih Gedikli

cs.CL

When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models

Cosimo Galeone et al.

cs.LG

CHASM: Unveiling Covert Advertisements on Chinese Social Media

Jingyi Zheng et al.

cs.LG

DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data

Venus Team et al.

Thank you to arXiv for use of its open access interoperability. This product was not reviewed or approved by, nor does it necessarily express or reflect the policies or opinions of, arXiv.
LightDark