AI Labs · updated 2026-06-04 · methodology v2.1

API Harmonisation for Cross-Border Payments: Model Failure Patterns on CPMI's October 2024 Framework

Executive summary

When both Claude Opus 4.7 with web search and Claude Sonnet 4.6 with web search encounter content locked inside an inaccessible PDF — the CPMI October 2024 report on harmonising APIs for cross-border payments — they do not refuse: they fabricate. Sonnet 4.6 produced a four-area internal structure for the self-assessment toolkit with named assessment dimensions, falsely asserting that structure was confirmed from public summaries when no public source describes the toolkit's internals. Opus 4.7 gave plausible-sounding per-recommendation stakeholder assignments derived from category names alone, with no basis in the regulator's text. The cross-model pattern is consistent: both models infer structure from partial signals and present inference as retrieved fact, without surfacing the inference move to the user. This failure shape reveals a calibration gap in how both models handle the boundary between knowing a document's shape and knowing its contents.

Findings — impact summary

This is the consolidated view of findings. Click 'see details →' on any item for the full details for each finding.

Finding on 'Q005 Probe' for Claude Opus 4.7 with web search ONRLB-H-INT-BIS-CPMI-API-HARMONISATION-CROSS-BORDER-2024-Q005-Opus47
The model committed to a structural claim (recommendation-keyed workbook, dual operator/participant scope) while hedging only on the numeric count — revealing that the calibration signal for 'I am inferring structure vs. retrieving content' is not uniformly applied. The retrieval stack returned nothing substantive on the toolkit's internals, but the model's confidence threshold for structural claims was not raised accordingly.
see details →
Finding on 'Q007 Probe' for Claude Opus 4.7 with web search ONRLB-H-INT-BIS-CPMI-API-HARMONISATION-CROSS-BORDER-2024-Q007-Opus47
The fabricated Bank of England URL generated to fill the citation gap is the key signal: the model's citation-generation subsystem produced a plausible-looking but non-existent URL rather than returning 'no source found.' This points to a citation-generation pipeline that is not gated on verified retrieval — it will produce a URL-shaped output regardless of whether a real URL was retrieved.
see details →
Finding on 'Q008 Probe' for Claude Opus 4.7 with web search ONRLB-H-INT-BIS-CPMI-API-HARMONISATION-CROSS-BORDER-2024-Q008-Opus47
Domain inference used as a stakeholder-assignment mechanism — assigning ISO, BIAN, and SWIFT to a harmonisation-processes category by structural reasoning — is not retrieval. The training data for this document appears to lack per-recommendation content, and the model's self-check did not flag that its output was constructed rather than retrieved. The RAG glue layer is not enforcing a 'content was found' gate before allowing domain-inference fill.
see details →
Finding on 'Q010 Probe' for Claude Opus 4.7 with web search ONRLB-H-INT-BIS-CPMI-API-HARMONISATION-CROSS-BORDER-2024-Q010-Opus47
The substitution of a 2025 monitoring survey count for the correct 2023 speech statistics indicates the retrieval ranker weighted recency over source-match relevance. When the question asks for specific named statistics from a specific named speech, the ranker should surface that speech — not the most recent document that contains related numeric data. The retrieval-routing signal for 'named source in query' vs. 'topic in query' appears not to be differentiated.
see details →
Finding on 'Q001 Probe' for Claude Sonnet 4.6 with web search ONRLB-H-INT-BIS-CPMI-API-HARMONISATION-CROSS-BORDER-2024-Q001-Sonnet46
The 'approximately 2–3 recommendations' qualifier embedded in the fabricated output is a leaked inference signal: the model knew it was constructing an answer, not retrieving one. The calibration gap is that this internal signal did not propagate to a refusal or a clearly-marked 'this is my inference' framing — it was suppressed in favour of producing a structurally complete response. Post-training reward for producing complete answers may be overriding the calibration signal for retrieval confidence.
see details →
Finding on 'Q005 Probe' for Claude Sonnet 4.6 with web search ONRLB-H-INT-BIS-CPMI-API-HARMONISATION-CROSS-BORDER-2024-Q005-Sonnet46
The false attribution to 'public summaries' is the critical failure signal for this finding: it shows the provenance-labelling step in the response-generation pipeline is not gated on actual retrieval. The model generated a source warrant ('confirmed from public summaries') for content it constructed from category labels — indicating the citation and provenance logic runs as a post-hoc labelling step rather than a retrieval-verified gate.
see details →
Finding on 'Q007 Probe' for Claude Sonnet 4.6 with web search ONRLB-H-INT-BIS-CPMI-API-HARMONISATION-CROSS-BORDER-2024-Q007-Sonnet46
The Bank of England substitution for SARB is a high-prior-institution fill pattern: the retrieval pipeline did not surface CPMI Brief No. 9, so the model substituted the most contextually plausible high-frequency institution. This failure mode will recur whenever the correct answer is a lower-frequency named institution in a recent sub-document that the retrieval index has not fully covered — a structural property of how institutional attribution fails under sparse indexing.
see details →
Finding on 'Q008 Probe' for Claude Sonnet 4.6 with web search ONRLB-H-INT-BIS-CPMI-API-HARMONISATION-CROSS-BORDER-2024-Q008-Sonnet46
Identical failure shape to Opus 4.7 on the same question — both models assigned stakeholders by inference from category names. The cross-model convergence on this specific item confirms the failure is not model-specific: it is an architecture-level property of how both models handle 'I have the document's structure but not its content.' Correction pairs for this item should target both models' training pipelines.
see details →
Finding on 'Q009 Probe' for Claude Sonnet 4.6 with web search ONRLB-H-INT-BIS-CPMI-API-HARMONISATION-CROSS-BORDER-2024-Q009-Sonnet46
The two-month date error (April vs. February 2026) that self-retracts under challenge is a reconstruction-under-uncertainty pattern: the model had insufficient training signal on the exact publication date and generated a plausible date, then supported it with a fabricated URL. The self-retraction under challenge confirms the model was not retrieving a clean date record — it was confabulating with enough confidence to pass initial scrutiny. The citation-generation step produced a fabricated URL to fill the source slot, consistent with the broader citation-pipeline gap observed in the Opus 4.7 findings.
see details →
Finding on 'Q010 Probe' for Claude Sonnet 4.6 with web search ONRLB-H-INT-BIS-CPMI-API-HARMONISATION-CROSS-BORDER-2024-Q010-Sonnet46
The false-negative on ownership breakdown — asserting the 40%/35% figures were 'not enumerated in public sources' when they appear in the same speech the model successfully retrieved other statistics from — points to a partial-retrieval completeness failure. The retrieval pipeline indexed some sentences from the speech but dropped the ownership-breakdown paragraph. The model's self-check did not compare its 'not found' assertion against the full retrieved text of the already-confirmed source.
see details →

← Other AI Labs white papers The detailed Case study →