How Our 3-Model AI Consensus Engine Works

When we set out to build a credit card statement parser, we faced a fundamental challenge: LLMs are powerful but imperfect. A single model might misread a date format, confuse a credit for a debit, or hallucinate a transaction that does not exist. For financial data, "mostly accurate" is not good enough. So we built something better.

The Single-Model Problem

Most AI-powered apps send your data to one model and trust the output. This works fine for casual use cases, but financial parsing demands precision. A single misread transaction amount can throw off your entire monthly analysis. We tested GPT-4o-mini, Claude 3.5 Haiku, and Gemini 2.0 Flash individually and found that each model had different strengths and blind spots:

GPT-4o-mini excels at structured output and date parsing but occasionally struggles with non-standard statement layouts.
Claude 3.5 Haiku is strong at understanding context and categorization but can sometimes merge adjacent transactions.
Gemini 2.0 Flash handles multi-currency and international formats well but may misinterpret ambiguous line items.

No single model was reliable enough on its own. But together, they are remarkably accurate.

The Consensus Architecture

Vector's parsing pipeline works in three stages:

Stage 1: Document Intelligence

Before any parsing begins, we send a small probe request to understand the document. This detects the bank, country, currency, date format, and statement type. This metadata configures the main parsing stage so each model knows exactly what to expect.

Stage 2: Parallel Parsing

We send the statement text to all three models simultaneously. Each model independently extracts every transaction with its date, description, amount, and type (debit or credit). Each model also identifies card metadata, statement periods, and summary totals. The three responses come back in parallel, typically within 3-5 seconds.

Stage 3: Majority Voting

This is where the magic happens. Our consensus engine compares the three sets of extracted transactions and applies a voting algorithm:

Transaction matching: We match transactions across models using fuzzy matching on dates, amounts, and descriptions. A transaction that appears in at least 2 of 3 models is accepted.
Amount reconciliation: When models disagree on an amount, we take the majority value. If all three disagree, we flag the transaction for review.
Confidence scoring: Each transaction gets a confidence score based on how many models agreed. Three-way agreement scores highest. Two-way agreement is still accepted. Single-model-only transactions are included but flagged.
Summary validation: We cross-check extracted totals against the sum of individual transactions and against any summary section found in the statement.

The Fallback Chain

If OpenRouter (which provides Claude and Gemini access) is unavailable, we gracefully degrade to single-model parsing with GPT-4o-mini. If all LLM providers fail, we fall back to our bank-specific regex parsers — hand-tuned pattern matchers for the seven most common bank statement formats. You always get results, even if the AI is having a bad day.

Results

In our testing across hundreds of real statements from banks worldwide, the consensus approach achieves over 97% transaction-level accuracy, compared to roughly 89-93% for any single model alone. The flagging system catches most remaining edge cases, giving users clear visibility into which transactions might need manual review.

The tradeoff is cost and latency — three models cost three times as much, and we wait for the slowest model. But for financial data, accuracy is worth it. Your money deserves more than a best guess.

Three models. One truth. Zero guesswork.