Both Claude Opus 4.7 with web search and Claude Sonnet 4.6 with web search produced failures on CPMI-IOSCO's Implementation Monitoring of the PFMI: Level 3 Assessment on General Business Risks (Bank for International Settlements, November 2025) that share a common shape: the models reconstructed rule conditions from internalized schema rather than from the regulator's published text, generating structurally plausible but materially wrong formulations of the standard's quantitative requirements.
The six confirmed failures across both models converge on three areas — the LNAFE minimum structure under PFMI Principle 15 Key Consideration 3, the Basel/CRD capital-counting carve-out within the same provision, and institutional attribution for the assessment's co-governance structure — where the regulatory record is precise and the models' outputs deviated from it in ways that would carry direct compliance consequences for any practitioner acting on the response. When web search is enabled, neither model resolved these gaps through retrieval; in several cases, sourcing worsened the output by introducing third-party paraphrases that diverged from the regulator's verbatim text.
The pattern signals a systematic gap in how both model configurations handle the intersection of technical regulatory numerics, conditional qualifications within formally structured standards, and recent official publications that fall at or past the retrieval pipeline's effective indexing boundary.
PFMI Principle 15 sits at the operational centre of global financial market infrastructure — central counterparties, central securities depositories, trade repositories, and payment systems collectively route trillions in daily settlement exposures against the capital adequacy standards this assessment evaluates. Legal, risk, and compliance teams at these institutions are active, sophisticated AI users. When they query a model about LNAFE minimums, Basel capital counting, or the specifics of a BIS-IOSCO Level 3 review, they are not doing academic research: they are drafting internal policy, preparing regulatory submissions, or briefing boards.
A confidently wrong response — one that fabricates a "greater of" condition where the rule states a flat floor, or denies a capital-counting carve-out that exists in the published standard — produces downstream regulatory exposure for the institution and, once traceable to a specific model output, reputational and litigation exposure for the lab that deployed it.
The failure modes documented here would not surface in standard capability benchmarks. PFMI Principle 15 Key Consideration 3 is a short, technically precise provision in a long, technically dense document published by a multi-jurisdictional standard-setting body in late 2025. The failure condition is not obscurity — it is the specific combination of a formally structured international standard, quantitative conditions expressed as conditional qualifications rather than simple statements, a cross-reference network between related Key Considerations, and a publication date that strains indexing cadence for even well-resourced retrieval pipelines.
These are structural features shared by a large category of regulatory content, making the PFMI findings a representative stress test rather than an isolated edge case.
For a lab's evals and red-team programme, the gap here is the absence of authoritative-source grounding on regulatory numerics. Neither web search configuration closed the gap; in one case it compounded it by surfacing a third-party commentary that paraphrased the rule incorrectly and the model deferred to. Any deployment footprint that includes financial services, payments infrastructure, regtech, or legal research for regulated industries carries this exposure at scale. The BIS-IOSCO PFMI framework applies across dozens of FMIs in every major financial centre — the addressable surface is large, the question-type is routine, and the failure mode is systematic.
| Model | Configuration | Failure count | Dominant error pattern |
|---|---|---|---|
| Claude Opus 4.7 | Web search enabled | 3 | Conditional-structure fabrication on PFMI Principle 15 numerics; institutional attribution evasion via source-deflection |
| Claude Sonnet 4.6 | Web search enabled | 3 | Conditional-structure fabrication on PFMI Principle 15 numerics; Key Consideration mis-assignment; timeline truncation on assessment process |
Claude Opus 4.7 with web search produced three failures concentrated on PFMI Principle 15 and the assessment's governance record. On the LNAFE minimum (Key Consideration 3), the model generated a "greater of" compound condition — the higher of a scenario-analysis-derived amount or six months of operating expenses — where the regulator's text states only the flat six-month floor. On the Basel/CRD capital-counting carve-out within the same provision, the model substituted an invented liquidity-and-non-duplication condition for the regulator's actual published criterion.
On the co-governance attribution question, the model declined to name the IMSG co-chairs — redirecting the query to the report's cover page — despite the information being present in the published standard. In the third case, the model cited a third-party regulatory commentary URL (labelled Pretextual) in a response that then refused to draw on any attributed source.
Claude Sonnet 4.6 with web search produced a parallel failure on the same LNAFE Basel/CRD carve-out, asserting flatly that Key Consideration 3 "does NOT include any carve-out or exception for equity held under international risk-based capital standards" — a direct contradiction of the regulator's published text. On the LNAFE minimum structure, Sonnet 4.6 located the six-month operating expense floor but attributed it to Key Consideration 2 rather than Key Consideration 3, a cross-reference error that would misdirect a compliance reviewer to the wrong provision of the standard.
On the assessment process timeline, the model truncated the IMSG's engagement window to 2023–2024, omitting the 2025 follow-up rounds that are explicitly documented in the published text and that are material to the assessment's conclusions.
Across both configurations, failures cluster where the regulatory text requires holding a precise conditional structure intact: a single-floor minimum that must not be inflated into a compound condition, a carve-out that is stated as permissive rather than mandatory, a Key Consideration cross-reference that must be exact, and a timeline with a hard endpoint in 2025. Both models showed a tendency to generate internally coherent but externally wrong formulations — the responses read as authoritative precisely because they reconstruct plausible schema rather than retrieve authoritative text.
The joint failure pattern points to a gap in how retrieval-augmented configurations handle formally structured international standards where the regulator's verbatim language is load-bearing and paraphrase is not a safe substitute.
6 findings in this case study. Click any to see its full evidence card.
The dominant failure across both models is the substitution of plausible schema for verbatim regulatory text on PFMI Principle 15 Key Consideration 3.
The "greater of" compound condition generated by Claude Opus 4.7 with web search, and the flat denial of the Basel/CRD carve-out by Claude Sonnet 4.6 with web search, both point to the same training-data gap: the models have a generalised structural model of how PFMI Key Considerations work — one that draws on the broader framework architecture and adjacent provisions — but lack reliable anchoring to the verbatim language of individual Key Considerations where that language departs from the structural expectation.
Correction at the training-data level requires pairing the regulator's exact text for each Key Consideration with curated examples of the common reconstruction errors, so the model learns to prefer verbatim constraint over structural inference when the two diverge.
The timeline truncation failure on the assessment process (2023–2024 returned instead of 2023–2025) is a retrieval-boundary artefact. The regulator published the November 2025 assessment with explicit 2025 engagement dates; the model's indexed content for this publication appears to have been captured before those final-phase details were reliably available. For BIS-IOSCO content specifically, the corpus refresh cadence for formal assessment publications needs to be aligned with BIS publication dates, not third-party commentary dates, and should include the full Annex content where engagement timelines are documented.
The cross-reference mis-assignment (KC2 vs KC3 for the six-month floor) suggests that the training corpus representation of the PFMI Annex A provision list may be underspecified, with Key Consideration numbers and their associated text not reliably linked.
In both tested configurations web search was enabled, and in neither case did retrieval prevent the Key Consideration errors. In the institutional attribution case, the model cited a third-party regulatory commentary URL (Pretextual) while simultaneously declining to name the officials — a pattern where the retrieval result was non-authoritative but was still preferred over admitting retrieval failure. Post-training adjustment should tighten the signal for "retrieved content is third-party summary of a regulatory document" and raise the threshold for treating that content as authoritative on specific numeric or personnel-attribution questions.
For regulator-domain queries, the ranker should de-weight third-party commentary relative to primary regulatory text when the query targets specific rule conditions, named individuals, or dated assessment outcomes.
Where the model commits to a specific Key Consideration number in its response (e.g., "KC3 requires X" or "the six-month floor is in KC2"), a self-verification pass against the regulator's primary document structure would catch the class of cross-reference mis-assignment seen here. Similarly, where the model generates a compound condition ("greater of A or B") for a regulatory minimum, a calibration check on whether that compound structure is supported by the retrieved or trained-on text — rather than inferred from adjacent provisions — would have prevented the LNAFE fabrication.
The flat-denial failure pattern (Sonnet 4.6 asserting "does NOT include any carve-out") warrants specific attention: high-confidence categorical denials of a provision's existence, where the provision is present in the authoritative text, are among the highest-consequence failure modes for compliance-context users.
Across our work on international regulatory standards, we have documented a specific and recurring class of failure modes that internal evals are structurally unlikely to surface: compound-condition fabrication on flat-floor regulatory minimums, categorical denial of embedded carve-outs present in the authoritative text, cross-reference mis-assignment within multi-Key-Consideration standards, institutional personnel attribution failure disguised as confident source-referral, and timeline truncation on recent BIS-IOSCO assessment publications where retrieval cadence lags the regulator's publication tempo. These failure shapes appear across model versions and configurations on PFMI content and adjacent regulatory frameworks — and they are not random.
They concentrate where the regulator's verbatim language diverges from a structurally plausible reconstruction, which is precisely the terrain that standard capability benchmarks do not probe.
On any failure mode we have documented, we can generate targeted correction pairs derived directly from the regulator's authoritative text — structured for direct ingestion into your training-data pipeline, with the regulator's verbatim constraint matched against the specific reconstruction error the model produced. Beyond individual correction pairs, we can run embedded comparative evaluation against a defined regulator portfolio on a quarterly cadence: version-to-version regression monitoring on previously documented failure modes, cross-configuration comparison (default vs. web-search vs. extended-retrieval) on rule-specific questions, and structured comparison reports formatted for your evals team.
For planned capability launches that will expand your models' deployment footprint into financial services, payments infrastructure, or cross-border regulatory content, we can run pre-release evaluation cycles and flag failure shapes before they reach users — rather than after a compliance team has relied on a confidently wrong answer in a regulatory submission.
If identifying and closing these failure modes at the regulator-specific level is a priority for your team, we should talk. Reach out via reglegbrief.com to scope a technical partnership.