AI Hallucination ResearchRegulatorsGlobal standard-settersINTBIS-CPMICPMI-ISO-20022-HARMONISATION-UPDATED-2026White paper › Detail
AI Labs · updated 2026-06-03 · methodology v2.1

ISO 20022 Harmonisation: Numeric Conflation and Attribution Failures in Cross-Border Payment Regulation

Numeric conflation across disaggregated adoption-rate subcategories — collapsing distinct faster-payment-system and RTGS figures into a single blended claim — is the primary failure surface for Claude Opus 4.7 with web search on the CPMI Harmonised ISO 20022 Data Requirements for Enhancing Cross-Border Payments — Updated Report. Claude Sonnet 4.6 with web search exhibits a different but structurally related failure: attribution errors on multi-body institutional roles, and false-negative evasion on quantitative operational statistics that appear in official speeches but not in the core publication text.

Across both models, the failures concentrate on content that is either numerically granular at the subcategory level or delivered through secondary regulatory channels — speeches, working-group announcements, implementing-body FAQs — rather than the primary document body. This failure shape is a signal worth attending to: it suggests that when regulator-attributed statistics arrive via channels with lower indexing density, both models fall back on internally-reconstructed composites rather than retrieval, and that the reconstruction process degrades silently rather than producing an explicit uncertainty signal.

When this affects AI Labs

The CPMI ISO 20022 harmonisation framework is operational infrastructure. Compliance officers at correspondent banks, payment scheme operators, fintech developers integrating cross-border rails, and regtech tools that advise on implementation timelines are all asking frontier models about this regulation's requirements today. The failure modes documented here — blended adoption-rate statistics, wrong institutional attribution, evasion on official speech content — surface precisely in that usage pattern: a user asks a numeric or attribution question about the rule, the model answers with confidence, and the answer is wrong in a way that does not announce itself.

The downstream exposure for a lab is concrete. A compliance professional acting on a blended 79% adoption figure — when the actual regulator record distinguishes faster-payment-system adoption sharply from RTGS adoption — may misadvise a client on implementation readiness timelines. A payment-scheme operator asking which central bank chairs the relevant CPMI working group and receiving the wrong institution will pursue the wrong relationship. These are not edge cases. CPMI standards govern the messaging architecture for the G20 cross-border payments roadmap; the number of professionals whose daily workflow depends on accurate retrieval of these rules is large and growing.

If a lab's model is a default tool in a compliance or legal-tech workflow and produces confidently wrong attribution or blended numeric outputs, the lab carries a material share of the misuse-claim exposure when the error reaches a regulatory consequence.

ISO 20022 is also structurally challenging for models in a way that will generalise across the CPMI publication corpus. The framework spans multiple implementing bodies — each with their own FAQ layers, implementation guides, and migration timelines. Key quantitative statements appear not in the core published report but in official speeches and press releases issued by central bank governors and BIS leadership, indexed separately and at lower density than the primary document. Technical format specifications are embedded in structured data schemas where over- or under-specification of mandatory versus optional fields is consequential but not visually salient.

A model that reconstructs from training rather than live retrieval in any of these sub-domains will produce an answer that looks like a correct citation of the rule while disagreeing with the regulator's actual record.

Aggregate impact

Model Configuration Failure count Dominant error pattern
Claude Opus 4.7 Web search enabled 2 Subcategory-numeric conflation — blending distinct system-type statistics into a single composite figure
Claude Sonnet 4.6 Web search enabled 2 Institutional attribution error + false-negative evasion on official speech content

Claude Opus 4.7 with web search produced two failures, both involving numeric misrepresentation. In the first, the model collapsed the regulator's separately-stated adoption figures for faster payment systems and RTGS systems into a single blended percentage claim attributed to both. The actual record — drawn from a March 2026 official speech — gives two distinct figures that diverge substantially; the model produced a single figure that matches neither.

In the second, the model over-specified the mandatory field structure of the postal address format required under the hybrid/end-state implementation approach, adding structural elements (Building Number, Post Code, Country Sub-Division) to the mandatory tier that the implementing body's own FAQ places in the optional tier. In both cases, the model produced a confident specific answer rather than surfacing uncertainty.

Claude Sonnet 4.6 with web search failed on institutional attribution and on quantitative operational claims embedded in official speeches. On the attribution question, the model named the Federal Reserve Bank of New York as the CPMI working-group chair for ISO 20022 harmonisation. The actual chair is the Reserve Bank of Australia; the RBA Governor served as Co-Chair of the relevant CPMI workstream. The model's claim came with an explicit "based on available public sources" hedge — but the hedge did not prevent a confidently-wrong attribution to a higher-frequency institution.

On the operational statistics question, the model returned a false negative — claiming no specific official statistic existed — despite the regulator's record containing an official speech from March 2026 giving precise figures: 1–3 per cent of payments generate inquiries requiring 5–10 manual touchpoints, with resolution times reducible by up to 80 per cent through harmonised ISO 20022 implementation. The model retrieved nothing from that speech.

Failures across both models cluster on two structural failure surfaces: disaggregated subcategory statistics where the model's internal representation averages across categories rather than retrieving the distinct values, and content delivered through official-speech and FAQ channels that appear to sit outside the models' effective retrieval index. The joint failure pattern — one model blending numbers, the other evading them — suggests that quantitative claims on this regulation are a high-risk zone regardless of architecture, and that the retrieval pipeline is under-indexing on BIS and central-bank speech content issued in early 2026.

Findings

4 findings in this case study. Click any to see its full evidence card.

  1. Finding on 'Q006 Probe' for Claude Opus 4.7 with web search ON see this finding →
  2. Finding on 'Q010 Probe' for Claude Opus 4.7 with web search ON see this finding →
  3. Finding on 'Q004 Probe' for Claude Sonnet 4.6 with web search ON see this finding →
  4. Finding on 'Q007 Probe' for Claude Sonnet 4.6 with web search ON see this finding →

What your team should do

Implications for your training data

The subcategory-numeric conflation observed in Claude Opus 4.7 with web search — where the model produced a single blended adoption percentage for two system types that the regulator characterises with distinct and substantially different figures — points to a specific gap in how CPMI statistical content is represented in the training corpus. When adoption metrics are published through official speeches rather than structured tables in the primary document, the model's internal representation appears to average across categories or reconstruct from aggregate framing, losing the subcategory resolution.

Training-side: the corpus ingestion for BIS and central-bank speech content should preserve subcategory-level numeric claims as structured extraction targets rather than free-text passages. Where a speech gives System Type A: X% and System Type B: Y%, that should be preserved as two distinct data points, not a single passage that a summarisation pass might compress into a blended figure.

The schema over-specification failure on postal address format (Claude Opus 4.7 with web search, Finding 10) and the institutional attribution failure (Claude Sonnet 4.6 with web search, Finding 4) both suggest that implementing-body FAQ layers and working-group press releases are under-represented relative to the primary CPMI publication. The RBA press release naming the working-group chair was published in October 2023 — well within any plausible training window — but the model substituted the Federal Reserve Bank of New York, suggesting the press release did not land with sufficient weight to anchor the attribution.

For multi-body regulatory frameworks where chair and co-chair roles are distributed across less-prominent central banks, press releases from those banks need to be in the corpus with sufficient density to compete against the baseline frequency of major-market institutions.

Implications for your post-training logic

The false-negative evasion in Claude Sonnet 4.6 with web search (Finding 7) is a calibration problem on retrieval confidence. The model claimed no official statistic existed; the statistic exists in a BIS-published speech from March 2026. When web search is active and returns no substantive content for an official-speech query, the model should not escalate to a confident "not found" claim — it should surface uncertainty about its coverage of recent official speech content from this issuer.

Post-training: add a calibration signal distinguishing "I searched and found nothing" from "I searched and the content may exist but fall outside my effective retrieval window." For BIS and central bank speech content from the prior 12 months, the latter framing is more often correct.

The retrieval ranker with web search enabled is also treating official-speech URLs and FAQ pages as lower-authority than the primary regulation document, which may explain the schema over-specification error: when the implementing-body FAQ is not surfaced, the model reconstructs the field-level mandatory/optional boundary from training, and reconstruction of schema detail tends toward over-specification. Post-training: for queries that name a specific implementing body's technical format requirements, the retrieval ranker should explicitly elevate that body's FAQ and implementation-guide URLs over third-party summary content.

Specific eval / red-team probes RegLeg suggests

How RLB can help

Across our work on CPMI, FSB, and national regulator publications, we document nuanced failure patterns that are difficult to surface through internally-focused evals: subcategory-numeric conflation where adoption statistics lose resolution across system types; multi-body institutional attribution drift toward higher-frequency institutions when the named role belongs to a lower-frequency central bank; schema over-specification on technical format requirements when implementing-body FAQ layers are not retrieved; false-negative evasion on quantitative claims delivered through official-speech channels that fall outside standard indexing cadence.

These failure shapes are consistent enough across model versions and configurations that they tell you something about the structural gap between what your models retrieve and what the regulator's actual record contains.

We can help your team close that gap in concrete ways. We generate targeted correction pairs per failure mode — derived directly from the regulator's authoritative text — formatted for ingestion into your training-data pipeline. We run embedded eval partnerships against a defined regulator portfolio, producing quarterly comparative reports across model versions with regression monitoring on previously-documented failure modes so you can verify that a training cycle has closed a specific gap without opening others.

For capability launches that touch regulated domains — financial services, payments infrastructure, cross-border regulatory content — we can run pre-release evaluation cycles to flag failure shapes before they reach users. And as new regulations enter your deployment footprint, we provide red-team consultation on the specific failure surfaces those regulations present.

If any of the failure modes documented here are worth investigating further in your next evaluation cycle, we'd welcome a technical conversation to scope what a targeted partnership could cover. Contact us at reglegbrief.com.

← Back to summary Other AI Labs white papers →