AI Labs · updated 2026-06-04 · methodology v2.1

API Harmonisation for Cross-Border Payments: Model Failure Patterns on CPMI's October 2024 Framework

When both Claude Opus 4.7 with web search and Claude Sonnet 4.6 with web search encounter content locked inside an inaccessible PDF — the CPMI October 2024 report on harmonising APIs for cross-border payments — they do not refuse: they fabricate. Sonnet 4.6 produced a four-area internal structure for the self-assessment toolkit with named assessment dimensions, falsely asserting that structure was "confirmed from public summaries," when no public source describes the toolkit's internals. Opus 4.7 gave plausible-sounding per-recommendation stakeholder assignments derived from category names alone, with no basis in the regulator's text.

The cross-model pattern is consistent: both models infer structure from partial signals — category headings, publication abstracts, adjacent domain knowledge — and present inference as retrieved fact, without surfacing the inference move to the user. This failure shape is structurally significant: it reveals a calibration gap in how both models handle the boundary between "I know the shape of this document" and "I know its contents."

When this affects AI Labs

The CPMI API harmonisation framework is directly operational for compliance lawyers, payments-infrastructure architects, central-bank regulatory counsel, fintech product teams building cross-border payment rails, and correspondent-bank operations managers — all of whom routinely use frontier models to accelerate regulatory interpretation. Questions of the exact type that produced failures here — "which stakeholders does each recommendation target?", "what does the self-assessment toolkit contain?", "which central bank is CPMI's named partner on pre-validation APIs?" — are live professional queries.

When a model answers confidently with fabricated structure and the user acts on that answer in a regulatory engagement, the downstream harm is concrete: advice given under the wrong authority, implementation plans built on invented taxonomy, compliance submissions citing fictional institutional mandates.

For the lab, the exposure compounds. An AI that presents fabricated per-recommendation stakeholder assignments with the same surface confidence as retrieved fact creates liability ambiguity for customers who cannot distinguish inference from retrieval. If a compliance team submits a CPMI-facing position paper that misattributes a recommendation's target stakeholder group because a model invented the breakdown, the reputational and regtech-liability path leads back to the model's output.

Separately, evals that cover this regulation by testing on the document's accessible surface — the publication abstract, the four recommendation-category names — will return false confidence scores: the model can answer the shallow questions correctly while hallucinating everything that sits behind the PDF barrier.

The structural feature of this regulation that makes it a high-yield failure surface is the combination of a machine-readable landing page with an inaccessible full-text PDF containing all substantive technical content. The abstract signals enough structure — four categories, ten recommendations, a self-assessment toolkit, a named collaboration partner — that a model can construct a plausible-seeming answer to granular questions using that skeleton. The answers sound specific. They are not retrieved. This pattern — partial-signal hallucination behind a credible surface scaffold — will recur across any regulatory corpus where the abstract is public but the operative text is PDF-gated.

Aggregate impact

Model	Configuration	Failure count	Dominant error pattern
Claude Opus 4.7	With web search	4	Confident inference presented as retrieval on PDF-gated content
Claude Sonnet 4.6	With web search	6	Schema over-specification and named-entity substitution on inaccessible technical content

Claude Opus 4.7 with web search produced four failures, clustered at two structural fault lines. The first is the PDF-gated content boundary: when asked about toolkit internals or per-recommendation stakeholder assignments — content exclusively in the inaccessible full report — Opus 4.7 generated structurally plausible answers by inferring from category names and domain priors, without flagging the inference as unverified.

The second fault line is named-institution attribution in multi-body contexts: asked which central bank CPMI had specifically named as its collaboration partner on the pre-validation API recommendation, Opus 4.7 identified SARB as "plausible" while failing to commit — and cited a fabricated Bank of England URL as supporting evidence. On quantitative content from a November 2023 CPMI speech giving a precise ownership breakdown of fast payment systems (40% central-bank-operated, 35% privately operated), Opus 4.7 substituted a 2025 monitoring survey count, displacing the correct named-speech figure with a different data vintage entirely.

Claude Sonnet 4.6 with web search produced six failures across two intersecting error shapes.

The first is schema over-specification: when the full report's PDF is inaccessible, Sonnet 4.6 did not hedge — it generated a four-area internal structure for the self-assessment toolkit with named dimensions and a usage process, then falsely asserted this structure was "confirmed from public summaries." The same pattern appeared on per-recommendation stakeholder breakdowns: Sonnet 4.6 assigned specific institution types to each of the four recommendation categories by inference from the category names and APEX body composition, presenting the output with no indication it was constructed rather than retrieved.

The second error shape is named-entity substitution in institutional attribution: Sonnet 4.6 identified the Bank of England as CPMI's closest named involvement for the pre-validation API recommendation, missing the explicit SARB attribution in CPMI Brief No. 9. Sonnet 4.6 also misdated a related CPMI-PMPG publication by two months (April 2026 vs. February 2026) with a fabricated citation URL, self-retracting only when challenged to verify.

Failures across both models cluster on two surfaces: (1) content that exists behind the PDF barrier but whose existence is signalled by the accessible abstract, and (2) named-institution attribution in low-frequency partnership announcements that fall outside the training window's primary index or the retrieval stack's ranking prioritisation. The cross-model pattern — both models hallucinate structure on PDF-gated content rather than refusing; both mis-attribute the lower-frequency named institution in multi-body contexts — suggests a shared calibration gap at the inference-vs-retrieval boundary that is not model-specific but architecture-level.

Findings

10 findings in this case study. Click any to see its full evidence card.

Finding on 'Q005 Probe' for Claude Opus 4.7 with web search ON see this finding →
Finding on 'Q007 Probe' for Claude Opus 4.7 with web search ON see this finding →
Finding on 'Q008 Probe' for Claude Opus 4.7 with web search ON see this finding →
Finding on 'Q010 Probe' for Claude Opus 4.7 with web search ON see this finding →
Finding on 'Q001 Probe' for Claude Sonnet 4.6 with web search ON see this finding →
Finding on 'Q005 Probe' for Claude Sonnet 4.6 with web search ON see this finding →
Finding on 'Q007 Probe' for Claude Sonnet 4.6 with web search ON see this finding →
Finding on 'Q008 Probe' for Claude Sonnet 4.6 with web search ON see this finding →
Finding on 'Q009 Probe' for Claude Sonnet 4.6 with web search ON see this finding →
Finding on 'Q010 Probe' for Claude Sonnet 4.6 with web search ON see this finding →

What your team should do

Implications for your training data

The dominant failure across both models is inference-as-retrieval on PDF-gated content: when the full regulatory text is inaccessible, both models generate plausible internal structure from category labels and domain priors, presenting the output as retrieved content. Training-side, this suggests the corpus for this regulator — and the broader BIS / CPMI publication set — is indexed primarily at the abstract and landing-page level, with the operative PDF text absent or only partially ingested.

For a document like the October 2024 API harmonisation report, the landing page signals enough structure (four categories, ten recommendations, a self-assessment toolkit) that the model can construct a convincing-sounding breakdown without any operative-text basis. Fixing this requires either full-text PDF ingestion for the BIS corpus or, where that is not available, training the model to distinguish "I know this document exists and have its abstract" from "I know this document's contents."

The named-institution attribution failures point to a different training-data gap: low-frequency partnership announcements in recent official-speech and brief content. CPMI Brief No. 9 (November 2025) explicitly names SARB as CPMI's collaboration partner on the pre-validation API recommendation — but both models failed to retrieve or correctly weight this, with both defaulting to higher-prior institutions (Bank of England for Sonnet 4.6; a plausibility hedge for Opus 4.7). The training corpus for BIS-issued briefs appears to lag or incompletely cover sub-documents issued after mid-2025. The d230 date error (April vs.

February 2026) is consistent with the same gap: the model reconstructed the publication detail from partial signals rather than a clean record of the February 2026 issue. Corpus refresh cadence for BIS subsidiary publications — briefs, speeches, updated reports — should be reviewed against the actual update tempo of the BIS website.

Implications for your post-training logic

The self-assessment toolkit fabrication by Claude Sonnet 4.6 includes an explicit false attribution to "public summaries" — the model did not simply hallucinate, it generated a provenance claim that did not exist. This is a calibration failure in the confidence-signal used to label inference vs. retrieval: the model committed to "confirmed from public summaries" as a source label for content it constructed.

Post-training, the reward signal for citing sources needs to distinguish between "I retrieved this from a source" and "this is consistent with how this type of document is typically structured." The current signal appears to allow constructed content to be labelled as retrieved when the model has sufficient domain confidence to make the inference feel grounded.

The fast-payment-system ownership breakdown failure by Claude Sonnet 4.6 — where the model correctly retrieved three statistics from a speech but asserted the ownership data in the same speech was "not enumerated" — points to a retrieval incompleteness issue at the sentence or paragraph level. The retrieval pipeline indexed some but not all of the speech's data points.

Post-training, a self-check pass on numeric-data completeness within a single retrieved source would catch this class of false-negative: if the model has confirmed a source exists and contains relevant data, it should verify against the full retrieved text before asserting the absence of a specific data point.

Specific eval / red-team probes RegLeg suggests

PDF-barrier content fabrication: For regulators whose operative text is PDF-only, probe for internal document structure claims — toolkit area breakdowns, per-recommendation requirement lists, appendix schemas — to surface how models handle the gap between "document exists" and "document content is known."
Stakeholder mapping on multi-category frameworks: Ask for per-recommendation or per-category stakeholder assignments where the target document contains a summary-level stakeholder statement ("directed at a broad array of stakeholders") but no per-item breakdown. Probe whether the model hedges or constructs.
Named-institution attribution in low-frequency partnership announcements: Probe the model's recall of recent official partnership designations (sub-2025 briefs and speeches) where a lower-frequency institution is the explicitly named partner but a higher-frequency institution in the same domain is a plausible substitute.
Partial retrieval false-negative detection: Where a model retrieves some statistics from a source, probe whether it correctly reports the remaining statistics from the same source or asserts their absence.
Self-retracting date errors under challenge: For recent publications (late 2025 / early 2026), probe publication date recall and whether the model self-retracts under challenge, indicating reconstruction rather than clean retrieval.

How RLB can help

Across our documented work on BIS, CPMI, FSB, and peer-regulator content, we identify recurring failure surfaces that internal evals anchored to published benchmarks tend to miss: subcategory-numeric conflation where the model retrieves the right document but the wrong data vintage; multi-body institutional attribution drift where lower-frequency named partners are displaced by higher-prior institutions; schema over-specification on technical frameworks where the model constructs internal structure from category labels and presents inference as retrieval; false-negative evasion on retrievable content where partial indexing causes the model to assert absence of data that exists in the accessible record; and fabricated provenance labelling where the model assigns a false "confirmed from public summaries" warrant to constructed content.

These failure shapes are structurally consistent across model versions and configurations — they point to architecture-level gaps that internal evals focused on a single model's benchmark performance are not positioned to see.

We can convert documented failure modes directly into capabilities your team can use. For each failure pattern, we generate correction pairs derived from the regulator's authoritative text — structured for direct ingestion into your training-data pipeline. We run embedded eval partnerships against a defined regulator portfolio, producing quarterly comparative reports across model versions with regression monitoring on previously-documented failure modes so you can track whether targeted fixes hold. For capability launches that touch regulated financial-services, payments-infrastructure, or cross-border regulatory content, we run pre-release evaluation cycles and flag failure shapes before they reach customers.

And for specific regulators being added to your deployment footprint, we offer red-team consultation on the failure surfaces their content structure is likely to expose.

To scope a partnership for refining your models against these failure modes, reach out at reglegbrief.com.

← Back to summary Other AI Labs white papers →