Numeric conflation across disaggregated adoption-rate subcategories — collapsing distinct faster-payment-system and RTGS figures into a single blended claim — is the primary failure surface for Claude Opus 4.7 with web search on the CPMI Harmonised ISO 20022 Data Requirements for Enhancing Cross-Border Payments — Updated Report. Claude Sonnet 4.6 with web search exhibits a different but structurally related failure: attribution errors on multi-body institutional roles, and false-negative evasion on quantitative operational statistics that appear in official speeches but not in the core publication text. Across both models, failures concentrate on content that is either numerically granular at the subcategory level or delivered through secondary regulatory channels — speeches, working-group announcements, implementing-body FAQs — rather than the primary document body. This failure shape is a signal worth attending to: it suggests that when regulator-attributed statistics arrive via channels with lower indexing density, both models fall back on internally-reconstructed composites rather than retrieval, and that the reconstruction process degrades silently rather than producing an explicit uncertainty signal.
This is the consolidated view of findings. Click 'see details →' on any item for the full details for each finding.
This failure implicates the training corpus's handling of subcategory-level numeric claims from official-speech channels. The model produced a single blended 79% figure where the regulator's March 2026 speech gives two distinct values — one for faster payment systems and a substantially lower one for RTGS. This suggests the speech content either was not retrieved or was compressed during ingestion in a way that averaged across the two system-type categories. If your eval suite tests adoption-rate questions at the aggregate level only, this failure is invisible; the gap is specifically at subcategory resolution.
see details →This failure implicates retrieval coverage of implementing-body FAQ layers. When the FRB Services FAQ defining the hybrid/end-state postal address format is not retrieved, the model reconstructs the mandatory/optional field boundary from training, and reconstruction tends toward over-specification — adding Building Number, Post Code, and Country Sub-Division to the mandatory tier where the FAQ places them as optional. The RAG or retrieval glue is not surfacing the implementing body's own technical specification when it conflicts with a more structured internal representation.
see details →This failure implicates training-data density for working-group chair attribution across multi-body CPMI frameworks. The RBA press release (October 2023) naming the Reserve Bank of Australia as working-group chair exists within any plausible training window, but the model substituted the Federal Reserve Bank of New York — a higher-frequency institution in CPMI-adjacent content. The hedge 'in available public sources' signals the model detected uncertainty but did not prevent the wrong attribution from being stated.
The calibration signal for multi-body institutional role questions — where the correct answer belongs to a lower-frequency institution — is not sufficient to hold against the frequency prior.
see details →This failure implicates retrieval coverage of early-2026 BIS official-speech content and the calibration signal distinguishing 'not found' from 'outside retrieval window.' The model returned a confident false negative on statistics that appear in a datestamped BIS speech from 12 March 2026 — figures on inquiry rate and resolution-time reduction that are precise and attributable. The web search tool did not surface this content, and the model escalated to a definitive 'no such statistic exists' rather than signalling coverage uncertainty.
For users in compliance or payment-operations roles, a false negative on an official quantitative claim is as harmful as a wrong number.
see details →