AI Hallucination ResearchRegulatorsMajor advanced economiesUSCFTCSWAP-DEALER-BUSINESS-CONDUCT-DOCUMENTATION-2025White paper › Detail
AI Labs · updated 2026-06-03 · methodology v2.1

Amendment-Layer Failures on CFTC Swap Dealer Business Conduct Rules

Fabrication and scope-conflation are the dominant failure shapes surfaced on the CFTC's December 2025 Revisions to Business Conduct and Swap Documentation Requirements for Swap Dealers and Major Swap Participants rulemaking — with Claude Opus 4.7 with web search producing invented document identifiers and misattributing the scope of a no-action letter, while Claude Sonnet 4.6 with web search elided the specific appendix identity when asked about a correction notice it could partially characterise in general terms.

Across four confirmed failures spanning both models, the errors are not random noise: they concentrate on the amendment layer of the rule — the January 2026 correction, the staff letter governing intended-to-be-cleared swaps on named trading venues, and the surgical restructuring of PTMMM disclosure paragraphs — content where the gap between "the regulation existed" and "the specific amendment text is retrievable and retained correctly" is widest. The pattern suggests that web-search-enabled configurations, when asked about recent regulatory amendments, are substituting retrieval of secondary commentary for the regulator's primary text and then confabulating citation scaffolding around that secondary source.

For a regulation where the entire compliance burden turns on which paragraph was moved, which appendix was accidentally deleted and then restored, and which trading venues fall within a specific no-action letter's scope, this failure shape produces confidently-wrong outputs on precisely the questions practitioners will ask.

When this affects AI Labs

The CFTC's swap dealer business conduct and documentation rules sit at the intersection of derivatives trading, cross-border execution, and counterparty-level disclosure obligations — a domain where compliance counsel, swap desk heads, operations leads, and fintech integrators routinely query frontier models for rule interpretation. Any deployment that touches financial services, regtech, or legal-research use cases will field questions about this rulemaking: which disclosure was eliminated, which correction restored what, which no-action letter covers a given trading venue. These are not edge-case queries. They are the daily workflow of anyone managing a swap book under U.S. jurisdiction.

The downstream harm profile is specific. A user who asks a model which appendix was accidentally removed and acts on a generic non-answer — or on a fabricated document identifier that returns no result — may file a correction request, brief a counterparty incorrectly, or structure documentation procedures around a requirement that was actually restructured rather than eliminated. The model's confident tone on these outputs amplifies the risk: a response that cites a specific Federal Register document number, even a fabricated one, reads as authoritative.

Labs whose models are deployed in financial-services co-pilots or document-drafting contexts carry direct reputational exposure when those fabricated citations surface in client-facing materials.

This particular regulation concentrates the risk. The December 2025 final rule spans multiple Part 23 subparts, involves surgical paragraph-level amendments rather than wholesale rewrites, and was followed within weeks by a correction notice that restored a single appendix. The cross-reference density — paragraphs moved between subsections, appendix status altered then restored, no-action letters naming specific foreign trading venues — means the failure modes that models exhibit on structurally simple regulations are amplified here.

A model that reconstructs "the PTMMM requirement was eliminated" from training without correctly characterising what "eliminated" means at the paragraph level will produce an output that is directionally true but operationally wrong.

Aggregate impact

Model Configuration Failure count Dominant error pattern
Claude Opus 4.7 Web search enabled 3 Scope substitution and fabricated citation scaffolding on amendment-layer content
Claude Sonnet 4.6 Web search enabled 1 Qualifier suppression — characterises a correction without naming its specific subject

Claude Opus 4.7 with web search produced three confirmed failures, all concentrated on the amendment and correction layer of the rulemaking. On the January 2026 correction notice, the model correctly characterised the general mechanics — a correction reinstating an accidentally removed appendix — but fabricated the Federal Register document identifier it cited as evidence. That fabricated identifier maps to no live document.

On the CFTC Staff Letter 25-49 question, the model substituted a generic ITBC swap framing (covering any contemporaneously-cleared swap, or swaps on any SEF or DCM) for the letter's actual scope, which is specific to Eligible UK Trading Venues; the cited source was a law-firm commentary that did not contain the precise restriction the letter establishes.

On the PTMMM elimination question, the model extended "eliminated in its entirety" beyond the regulation's meaning — asserting a product-agnostic elimination across the entire covered swap book, when the rule actually restructured which paragraphs carry the disclosure and compensation obligations rather than removing those obligations wholesale.

Claude Sonnet 4.6 with web search produced a single confirmed failure on the same correction-notice question. Where Opus 4.7 fabricated a document identifier, Sonnet 4.6 suppressed the specific content: it characterised the correction as reinstating "an appendix" without naming Appendix A to Subpart H or its function — guidance on §§23.434 and 23.440 for swap dealers making recommendations to counterparties or Special Entities. The appendix has been part of the regulatory framework since 2012. A user asking a partner-level compliance advisory question would receive a response that is accurate in structure but useless in substance.

The cross-model convergence on the correction-notice question is the most significant signal. Both models, with web search active, encountered the same high-specificity question about a recent amendment and failed in related but distinct ways: one fabricated supporting documentation, the other suppressed the specific content. Neither retrieved the regulator's primary text for the correction. The failures cluster on content where the amendment is recent, the scope is precise, and secondary commentary is available but incomplete — the exact conditions under which retrieval of a law-firm summary or general regulatory roundup is most likely to substitute for the primary source.

Findings

4 findings in this case study. Click any to see its full evidence card.

  1. Finding on 'Q002 Probe' for Claude Opus 4.7 with web search ON see this finding →
  2. Finding on 'Q003 Probe' for Claude Opus 4.7 with web search ON see this finding →
  3. Finding on 'Q004 Probe' for Claude Opus 4.7 with web search ON see this finding →
  4. Finding on 'Q002 Probe' for Claude Sonnet 4.6 with web search ON see this finding →

What your team should do

Implications for your training data

The fabricated Federal Register document identifier on the correction-notice question is a citation-generation failure, not a retrieval failure — the model did not retrieve a wrong document, it constructed a plausible-sounding identifier that does not exist. This pattern is consistent with training on secondary commentary that references correction notices without reproducing their actual document numbers. For this regulator's domain, the training corpus likely contains abundant law-firm roundups and compliance-community summaries that discuss the fact of a correction without linking to or reproducing the primary Federal Register notice.

When the model is asked for a specific document identifier and the training corpus has only the commentary layer, it generates a structurally correct identifier rather than declining to specify. Correction: training-data ingestion for this regulator should prioritise Federal Register primary documents and their structured metadata (document numbers, effective dates, CFR citation maps) over secondary commentary — and where those primary documents are absent, the calibration signal should produce a refusal rather than a confabulation.

The scope-substitution failure on Staff Letter 25-49 — replacing "Eligible UK Trading Venues" with a generic SEF/DCM framing — indicates the training corpus for CFTC staff letters may not consistently capture the jurisdiction-specific and venue-specific restrictions that distinguish one letter from another. Staff letters frequently make general statements about ITBC swap treatment before specifying the bounded scope of their relief; if the training corpus skews toward the general statement and underweights the limiting clause, the model will reproduce the general statement as the letter's full coverage.

Structured extraction of the scope-restriction clauses in staff letters and no-action letters — treated as a distinct data type from the interpretive prose surrounding them — would address this.

Implications for your post-training logic

Where web search is active and the top retrieval results are law-firm summaries or compliance-community roundups rather than the regulator's primary publication, the model's citation-generation path should apply a higher bar before committing to a specific document identifier. Currently the path appears to treat secondary-source confidence as sufficient to generate a primary-source citation. A self-check pass — does the cited identifier map to a retrievable document at the regulator's primary portal? — run before the response is finalised would catch fabricated Federal Register numbers at inference time without requiring retraining.

The PTMMM restructuring failure suggests a gap in how the model handles "eliminated in its entirety" language when the underlying regulatory text performs a restructuring rather than a removal. Post-training calibration should include adversarial examples where "eliminate / delete / remove" in a rulemaking context means paragraph-level reorganisation rather than obligation extinguishment — the model should flag the distinction rather than defaulting to the stronger reading.

For the no-action letter scope failure, the retrieval ranker's weighting of law-firm commentary vs. primary regulatory text for queries that name a specific CFTC staff letter should be examined; a query that names "Staff Letter 25-49" should route retrieval toward the primary letter before secondary commentary.

Specific eval / red-team probes RegLeg suggests

How RLB can help

Across our work on CFTC, FCA, MAS, CPMI-IOSCO, and adjacent regulatory bodies, we document failure modes that concentrate at specific structural features of regulatory content: subcategory-numeric conflation where a model attributes a portfolio-level figure to a sub-portfolio; multi-body institutional attribution drift where the lead-body attribution swaps under paraphrase pressure; scope-restriction suppression where a model reproduces a general statement and drops the bounding clause; fabricated citation scaffolding on recent amendment and correction content where primary documents are absent from the retrieval layer; and false-negative evasion on official-speech content delivered in the weeks before a model's indexing cadence catches up.

These are not isolated anecdotes — they are recurrent shapes that appear across model versions and configurations when the regulatory content has specific structural properties.

What we can deliver into a model improvement workflow: targeted correction pairs per documented failure mode, derived from the regulator's authoritative text and formatted for direct ingestion into a training-data pipeline — each pair couples the wrong output shape the model exhibited with the regulator's verbatim corrective text. Embedded evaluation against a defined regulator portfolio — quarterly comparative reports across model versions, with regression monitoring on previously-documented failure modes so a fix that closes one failure surface doesn't reopen an adjacent one.

Pre-release evaluation cycles for capability launches touching financial services, payments infrastructure, or cross-border regulatory content, with failure-shape reports before those capabilities reach customers. And red-team consultation on regulator-specific failure surfaces as new regulations enter a model's deployment footprint.

To scope a technical partnership on refining your models against these failure modes, reach out via reglegbrief.com.

← Back to summary Other AI Labs white papers →