Bank statement OCR and 92% auto-match across four Hungarian banks

Automatic bank-line matching is the most tedious manual step in bookkeeping, and the Netorigo Finance module automates it. Across a tested portfolio at four Hungarian banks (OTP, K&H, Raiffeisen, Erste) we measure a 92% auto-match rate. The remaining 8% requires human eyes, but the module surfaces 1-click candidate matches to keep the queue moving.

Inputs the module accepts

We ingest three formats: PDF (the default for Hungarian banks' customer statements), CAMT.053 (the SEPA XML standard), and MT940 (the SWIFT legacy, still in use). PDF import runs in two tiers. Tier one uses pdf-parse to extract text. For 80% of bank PDFs this yields machine-readable text with a clean table structure. Tier two activates when the PDF is a scanned image — Tesseract OCR pulls the table, then a regex pipeline cleans line fields.

CAMT.053 and MT940 imports are trivial: standardised format, exact field offsets. The only thing to watch is idempotency on the wire-level transaction ID (bank-tx-id), so a retried import does not duplicate lines.

The matching logic

Every imported bank line is stored in a bank_transaction table; a match-engine.service.ts runs the matcher. The logic walks three dimensions:

Amount: must match exactly. A 100,000 HUF inbound can only match a 100,000 HUF invoice or obligation.
Date window: bank value-date is allowed within a ±3 day window relative to the invoice date. Bank settlement is rarely instant, and B2B invoices often back-date.
Fuzzy payee-name match: the bank reference (or payer-name field) is compared against the partner's company name via string-similarity (Dice coefficient). We only accept matches with a score above 0.82. That threshold is the empirical optimum we found across the four-bank portfolio between false positives and missed real matches.

When all three criteria pass, the match is automatic: matched_invoice_id is stamped on the bank line, the invoice flips to paid_status = PAID, and the audit trail records match_method = AUTO so the bookkeeper can see machine-driven matches at a glance.

The 8% unmatched queue

The 8% that do not auto-match land in an 'unmatched queue' surface. For each row the bookkeeper sees the top three candidate matches (lower scoring, wider date window, partial amount match). One click accepts a candidate and the line resolves. Typical handling time is 30 to 90 seconds per row.

The bank that breaks our auto-match

Unicredit. The other four banks put the partner company name in the bank statement's 'reference' or 'comment' field. If the customer registers as 'Pelda Kft.', the reference will read 'PELDA KFT'. Unambiguous.

Unicredit instead puts an internal transaction identifier such as 'TRX-2026-05-18-984371' in that slot. The partner name appears in a separate partner_name field but is often truncated to 22 characters, upper-cased, stripped of accents, and rendered as something like 'PELDA KORLATOLT FE'. The fuzzy match score routinely falls below 0.65.

The fix: a Unicredit-specific extractor that first tries the additional_info field for a longer partner-name variant, and if that is empty, looks up which partner historically pays from this counterparty account via a partner_account_history table. With this extractor in place, the Unicredit match rate climbed from 71% to 88%.

The other four banks

OTP 94%, K&H 93%, Raiffeisen 92%, Erste 91%. The four-bank average is 92.5%, which we round to 92% in public docs.

Bank-statement matching is not glamorous, but at a 200-line-per-month tenant it absorbs 20 to 30 hours of bookkeeper time per month. At Hungarian rates that is 240,000 to 360,000 HUF per year of recovered capacity.

Back to Journal