AI Labs · updated 2026-06-03 · methodology v2.1

CFTC Digital Asset Collateral Staff Guidance 2025: Hallucination Patterns in Claude Opus 4.7 and Claude Sonnet 4.6

Condition-sunset misclassification and fabricated amendment provenance are the dominant failure surfaces across both Claude Opus 4.7 and Claude Sonnet 4.6 on the CFTC's Digital Asset Collateral No-Action Relief and Tokenized Asset Staff Guidance (Market Participants Division, December 2025). Both models incorrectly classified the weekly digital asset holdings reporting obligation as a condition that terminates at the end of the pilot's initial three-month phase — when the regulator's text is unambiguous that it persists.

Separately, both models reconstructed details of a subsequent staff letter amendment — including a specific reissuance date and the nature of the definitional change — from inference rather than the regulator's record, with Claude Opus 4.7 additionally fabricating the reissuance date entirely. The pattern signals a structural problem: where a recent regulatory instrument introduces a phased obligation structure with partial sunset provisions, models appear to over-generalise the sunset to the entire condition set rather than tracking the specific carve-outs the regulator specified — a failure mode that compounds when the instrument post-dates the model's training horizon.

When this affects AI Labs

The CFTC's December 2025 digital asset collateral framework is being actively operationalised by futures commission merchants, digital asset custodians, payment stablecoin issuers, and fintech compliance teams right now. These are precisely the users who route high-stakes regulatory questions to frontier models expecting authoritative answers. An FCM asking whether its weekly reporting obligation persists after the pilot's initial phase, or a stablecoin issuer asking which staff letter is operative for margin acceptance and what the eligibility hook is for its charter type, is not doing exploratory research — they are building compliance procedures.

A confident wrong answer from the model is a compliance procedure built on a false premise.

The downstream exposure for a lab is concrete on two vectors. First, regulated-entity customers who act on confidently-wrong outputs about a live CFTC no-action letter face regulatory consequence — and the question of whether the model contributed to that consequence is one regulators, plaintiffs' counsel, and journalists will ask. Second, the failure modes documented here — condition-sunset misclassification, fabricated amendment provenance — are exactly the patterns that will be surfaced in adversarial red-team exercises as labs seek deployment in financial-services enterprise contexts. Discovering them internally, before deployment, is structurally cheaper than discovering them in production.

The CFTC's December 2025 package is a structurally challenging instrument for models. It arrived as a suite of interlocking staff letters and FAQs issued in rapid succession across late 2025 and early 2026, each amending or superseding earlier elements. The operative version of any given provision depends on tracking which letter is current, what changed between iterations, and which conditions from the initial phase carry forward versus lapse.

The instrument also makes heavy use of defined terms whose scope is precisely delineated — "payment stablecoin," "permitted issuer," "asset type restriction," "incident-reporting condition" — in ways that diverge from how those terms appear in third-party commentary. For a model whose retrieval stack weights third-party legal summaries as comparable authority to the regulator's primary text, and whose training corpus underrepresents the most recent amendment cycle, the failure conditions are structurally embedded.

Aggregate impact

Model	Configuration	Failure count	Dominant error pattern
Claude Opus 4.7	Web search enabled	2	Fabricated amendment date + condition-sunset over-generalisation
Claude Sonnet 4.6	Web search enabled	3	Condition-sunset over-generalisation + definitional provenance elision

Claude Opus 4.7 with web search produced two distinct failure shapes. On the question of which staff letter is operative for payment stablecoin margin acceptance, the model fabricated a specific reissuance date — "February 6, 2026" — for which the regulator's record provides no basis, while simultaneously eliding the specific regulatory instrument (OCC Interpretive Letter 1183) that provides the eligibility hook for national trust bank issuers.

On the question of which obligations persist after the pilot's initial phase, the model incorrectly classified the weekly digital asset holdings reporting requirement as one of the conditions that sunset — directly contradicting the regulator's text, which lists this obligation as continuing.

Claude Sonnet 4.6 with web search failed on the same two question areas as Opus 4.7, and added a third. On the payment stablecoin amendment, the model omitted the specific regulatory instrument anchoring national trust bank eligibility, inferring the definitional scope from the amendment's structure without surfacing the hook the regulator explicitly named.

On the condition-sunset question, the model went further than Opus 4.7: it cited "March 2026 CFTC Staff FAQs" as authority for the weekly reporting requirement ceasing — a fabricated source — and presented the termination as a precise procedural rule keyed to the third calendar month following notice filing. On the haircut-rate question, the model described the 20 percent haircut floor as applying to digital assets not accepted by any registered clearing organisation, omitting the regulator's governing rule for the multi-DCO case: that the FCM must apply the highest haircut among all registered DCOs that accept the asset.

Failures cluster at three structural junctures: obligation-lifecycle provisions in phased frameworks (where partial sunset language is present and models over-generalise it), cross-document eligibility hooks (where the operative rule turns on a specific secondary instrument the model fails to surface), and numeric governance rules in multi-entity contexts (where a tie-breaking or worst-case selection rule is present and the model defaults to the simpler threshold).

The convergence across both configurations on the condition-sunset misclassification — with web search active in both cases — indicates that the failure is not a retrieval gap the tool resolves: the model is constructing the answer from a structural inference about phased frameworks rather than from the regulator's specific text.

Findings

5 findings in this case study. Click any to see its full evidence card.

Finding on 'Q005 Probe' for Claude Opus 4.7 with web search ON see this finding →
Finding on 'Q006 Probe' for Claude Opus 4.7 with web search ON see this finding →
Finding on 'Q005 Probe' for Claude Sonnet 4.6 with web search ON see this finding →
Finding on 'Q006 Probe' for Claude Sonnet 4.6 with web search ON see this finding →
Finding on 'Q007 Probe' for Claude Sonnet 4.6 with web search ON see this finding →

What your team should do

Implications for your training data

Two failure patterns here point to specific training-data gaps. The condition-sunset misclassification — present in both configurations — indicates the corpus for this instrument either lacks the regulator's primary text entirely or contains it only in third-party-paraphrased form. Third-party summaries of phased regulatory frameworks consistently flatten the obligation lifecycle: they describe what starts and what ends at the headline level, omitting the carve-outs the regulator specified for individual obligations. If the model's picture of this instrument came primarily from law-firm client alerts and news summaries rather than the staff letters themselves, the over-generalised sunset answer is exactly what you'd expect.

The fix at the training-data level is not just corpus refresh — it's privileging primary-source text over derivative commentary for instruments where the operative detail is in the specific enumeration, not the structural summary.

The fabricated source ("March 2026 CFTC Staff FAQs") and the fabricated reissuance date ("February 6, 2026") both point to a corpus gap on the amendment cycle for this regulation. The model appears to have the pre-amendment version of the instrument adequately represented, but the amendment series — Staff Letter 26-05 and associated FAQ updates — at insufficient density to retrieve the correct details. When the corpus is sparse on a recent amendment, models fill the gap with plausible reconstructions: a date near the original instrument's date, a FAQ document consistent with the regulator's publication pattern.

Structured amendment tracking — paired primary-source texts for each iteration of an instrument, with explicit version metadata — would reduce this failure mode at the corpus level.

Implications for your post-training logic

The retrieval ranker is the proximate failure point for findings where web search was active and third-party sources were cited as authority. In both configurations, the cited sources were law-firm commentary and financial press summaries — not the regulator's primary text. For queries explicitly scoped to a named CFTC staff letter or FAQ, the ranker should weight primary regulatory source material above third-party paraphrase.

The current behaviour — treating a Lexology summary or a financial news article as comparably authoritative to the regulator's published letter — is a calibration failure in the retrieval layer, not a training-data gap, and it can be addressed without retraining.

The fabricated-source finding (Sonnet 4.6 on the condition-sunset question) points to a calibration gap in the model's uncertainty signalling for recent regulatory documents. The model committed to a specific source citation — "March 2026 CFTC Staff FAQs" — with no apparent retrieval basis, rather than flagging that it could not locate the governing FAQ and defaulting to a more conservative answer.

A self-check pass on named-source citations — where the model verifies that any document it names by title and date is present in its retrieved context before committing to the citation — would catch this failure class before it reaches the user.

Specific eval / red-team probes RegLeg suggests

Phased obligation lifecycle probes: For any regulatory instrument with a multi-phase structure and partial sunset provisions, probe whether the model correctly identifies which obligations persist versus lapse — particularly obligations that are cadence-based (weekly, monthly) where the model may infer the cadence was a temporary condition.
Cross-document eligibility hook probes: For rules whose scope turns on a secondary instrument (an interpretive letter, a grandfather provision, a cross-referenced definition), probe whether the model surfaces the specific secondary instrument or describes only the structural effect of the cross-reference.
Multi-entity worst-case selection probes: For numeric thresholds with a tie-breaking rule in multi-party contexts (highest haircut among multiple DCOs, most restrictive capital charge among multiple regulators), probe whether the model applies the tie-breaking rule or defaults to the base threshold.
Recent-amendment provenance probes: For instruments amended within 6-12 months of the model's training horizon, probe whether the model can accurately identify the amendment date, the superseded letter, and the specific definitional change — flagging where it reconstructs rather than retrieves.
Third-party-vs-primary source divergence probes: For questions where the third-party commentary record diverges from the regulator's primary text, probe whether the model follows the regulator or the commentary — particularly where the commentary paraphrases away a qualifier or a tie-breaking rule.

How RLB can help

We document failure modes across model versions and configurations on regulatory content — specifically the nuanced, high-stakes failure shapes that internal evals miss because they require external regulatory text as the ground truth.

The patterns we've catalogued across our work include: phased obligation lifecycle misclassification (where partial sunset language causes models to over-generalise which conditions lapse); cross-document eligibility hook elision (where the operative rule turns on a secondary instrument the model fails to surface); multi-entity worst-case selection failures (where a tie-breaking or worst-case rule is present and the model defaults to the base threshold); fabricated amendment provenance (where the model constructs a plausible but false citation to fill a gap in its amendment-cycle coverage); and retrieval-ranker authority miscalibration (where third-party commentary is weighted as comparable to primary regulatory text for named-regulator queries).

These are not the failure modes your internal benchmarks are designed to find.

We can support your model improvement work in several concrete ways. We generate targeted correction pairs per failure mode — the model's wrong answer alongside the regulator's authoritative text, formatted for direct ingestion into your training-data pipeline. We run embedded eval partnerships against a defined regulator portfolio, producing quarterly comparative reports across model versions with regression monitoring on previously-documented failure surfaces. We can run pre-release evaluation cycles for capability launches touching regulated domains — financial services, payments infrastructure, digital assets, cross-border regulatory content — and surface failure shapes before they reach enterprise customers.

For new regulations being added to your deployment footprint, we provide red-team consultation on the specific failure surfaces that instrument's structure is likely to generate.

To scope a technical partnership for refining your models against these failure modes, reach out at reglegbrief.com.

← Back to summary Other AI Labs white papers →