What question did this study set out to answer?

This research aims to improve invoice matching accuracy in accounting by leveraging an advanced dual-augmentation RAG system.

June 3, 2026Open Access

Beyond Fuzzy Matching: A Dual-Augmentation RAG System for Robust Product Reconciliation in Accounting

Puntos clave

This research aims to improve invoice matching accuracy in accounting by leveraging an advanced dual-augmentation RAG system.
Designed a retrieval-augmented generation architecture for product matching under noisy conditions.
Evaluated the architecture against established benchmarks including Abt-Buy and Amazon-Google datasets.
Assessed the system's performance in real accounts payable workflows with approximately 200 verified invoice lines.
Achieved Top-3 Recall rates of 91.60% to 97.96% across multiple benchmarks.
Produced a Top-3 hit rate of approximately 97% in practical deployment on Greek invoice lines.
Outperformed the strongest non-LLM baselines in all benchmarks evaluated.

Resumen

Accurate product-to-catalog invoice matching is a foundational internal control for financial oversight and audit quality, yet it is bottlenecked by inconsistent vendor descriptions and the resulting ‘long tail’ of supplier heterogeneity, driving costly manual reconciliation in Enterprise Resource Planning (ERP) environments. This study pursues three objectives: (i) to design a Retrieval-Augmented Generation (RAG) architecture that matches invoice line items to a product catalog under conditions of optical character recognition noise, vendor-specific abbreviations, and multilingual heterogeneity; (ii) to evaluate this architecture on three public entity resolution benchmarks against established lexical and Dense retrieval baselines; and (iii) to assess its viability as a decision support system in a real accounts payable workflow with audit-trail requirements. To address (i), we introduce a novel ‘augment-both-sides’ strategy: large language models (LLMs) proactively enrich each catalog Stock Keeping Unit (SKU) with synonyms and alternative descriptions before vectorization, while invoice lines undergo runtime query expansion, and an LLM-based reranker produces the final Top-3 candidates. For (ii), evaluation on the Abt-Buy, Amazon-Google, and Walmart-Amazon datasets yields Top-3 Recall of 91.60% to 97.96%, matching or exceeding the strongest non-LLM baseline on every benchmark. For (iii), a production deployment on approximately 200 manually verified Greek invoice lines (proprietary dataset, anecdotal observation) yields a Top-3 hit rate of approximately 97%, consistent with the public-benchmark results. The architecture functions as a reliable intelligent decision aid, narrowing the search space from thousands of SKUs to a precise candidate set for structured human verification.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo