Both Claude Opus 4.7 with web search and Claude Sonnet 4.6 with web search failed on the same question type — a size-tiered concentration limit embedded in the 2024 CFTC Amendments to Regulation 1.25 — producing flat uniform-threshold answers where the rule specifies a two-condition tier keyed to fund asset size and management company AUM.
Across five confirmed failures on this regulation (four from Claude Opus 4.7 with web search, one from Claude Sonnet 4.6 with web search), the dominant pattern is the model confidently reconstructing rule parameters from training — a plausible-but-wrong general schema — rather than retrieving the amended text's specific numeric structure. A secondary failure involves procedural process: Claude Opus 4.7 with web search placed the rule's approval in an open public meeting with a named presiding chair, when the record shows it was approved by seriatim vote with no public session.
The structural significance is that these are not retrieval misses on obscure content — they are over-confident confabulations on the most decision-critical parameters of a compliance rule: concentration caps, portfolio maturity limits, and compliance deadlines — exactly the content a regulated firm's treasury or compliance team would query a model to confirm.
Futures commission merchants and derivatives clearing organizations subject to CFTC Regulation 1.25 manage billions in customer segregated funds. The 2024 amendments restructure which instruments are permitted, impose new concentration limits with fund-size tiers, set a portfolio-level maturity ceiling, and establish staggered compliance deadlines. Treasury, compliance, legal, and fintech teams at FCMs are exactly the user population that treats frontier models as a fast path to regulatory clarity — asking directly about permissible investment limits, maturity constraints, and deadline calendars before updating investment policies.
A model that reconstructs plausible-but-wrong parameters with high apparent confidence is not a neutral retrieval miss; it is an active misdirection event at the precise moment a practitioner is making a compliance decision.
The downstream harm vector is concrete. An FCM treasurer who queries a model to confirm concentration limits, receives a confident flat-percentage answer with no tier, and updates an investment policy on that basis has violated a specific provision of the rule. The lab that deployed the model sits inside that causal chain. At scale across the FCM and DCO community — a few hundred regulated entities — systematic confabulation on a rule with this profile creates both reputational exposure for the lab and a pattern of misuse-adjacent harm that red-team coverage should have flagged before deployment.
The structural features of this regulation make it a high-probability failure surface for any model relying on trained knowledge rather than live retrieval of primary regulatory text. The rule is a 2024 amendment to a long-standing 2000-era framework, published as a Federal Register notice layered on top of existing CFTC Part 1 codification; the key numeric provisions (the $1B fund AUM threshold, the $25B management-company threshold, the 24-month WAM ceiling, and the March 31 2025 SIDR deadline) do not appear in high-volume secondary sources that models train on at the frequency the primary rule text warrants.
The compliance deadlines are staggered in a way that requires distinguishing between two separate regimes in the same rule — a pattern that tends to collapse to a single generalised answer when a model is working from schema rather than retrieved text.
| Model | Configuration | Failure count | Dominant error pattern |
|---|---|---|---|
| Claude Opus 4.7 | Web search enabled | 4 | Trained-schema substitution for rule-specific numeric structure |
| Claude Sonnet 4.6 | Web search enabled | 1 | Trained-schema substitution on concentration tier |
Claude Opus 4.7 with web search produced four confirmed failures across the regulation's most operationally consequential provisions. Three share the same failure shape: the model reconstructed a plausible rule schema from training and delivered it as the rule's text, overwriting the actual numeric structure with a generalised version. On the concentration limit question, the model asserted that thresholds apply uniformly regardless of fund or management company size — directly contradicting the rule's explicit two-condition tier.
On the portfolio maturity limit, the model reproduced the 24-month ceiling correctly but added a reference to SEC Rule 2a-7's WAM methodology as the computation standard, a cross-regulatory inference unsupported by the amended regulation. On the compliance calendar, the model placed the SIDR update deadline at a vague range ("roughly six months to a year after the effective date") rather than the rule's stated March 31 2025 date.
The fourth failure — on the vote process — shows a distinct pattern: the model fabricated a procedural event (an open Commission meeting with Chairman Behnam presiding) with specificity and apparent confidence, when the actual approval was seriatim.
Claude Sonnet 4.6 with web search failed on the same concentration limit question as Opus 4.7, and in a structurally identical way: it reported no size-based tier and gave uniform flat-percentage limits, contradicting the rule's fund-AUM and management-company-AUM conditions. The convergence is notable — two models, different architectures, same web-search configuration, same question, same failure shape. The tier is not buried; it is a primary structural feature of the amended rule. Both models appear to have resolved the question against a prior-version schema or a generic regulatory template rather than the amended text.
Failures cluster on two distinct surfaces: numeric structure in recently amended provisions (the tier thresholds, the WAM exclusions, the compliance deadline calendar) and procedural-process specificity (the vote mechanism and its associated meeting format). The joint failure pattern signals that for regulations fitting this profile — a 2024 amendment to a multi-decade legacy framework, with key numeric provisions introduced in the final rule text rather than carried over from prior versions — web search is not providing sufficient signal to override trained-schema responses. The model is retrieving, but not retrieving the right document at sufficient specificity to dislodge a confident prior.
5 findings in this case study. Click any to see its full evidence card.
The dominant failure pattern across both models is schema substitution on recently amended numeric provisions: the model delivers a confident answer shaped by a prior-version or generic regulatory template rather than the amended text's specific numeric structure. For Regulation 1.25, the operative content — the two-condition concentration tier, the maturity calculation exclusions, and the March 31 2025 SIDR deadline — all exist primarily in the 2024 Federal Register final rule notice.
If that notice is underrepresented in the training corpus relative to the volume of secondary commentary (law firm client alerts, trade association summaries, compliance blog posts), the model will weight toward the higher-frequency secondary framing. For a regulation like this one, where the final rule text resolved open questions in a direction that differed from early-stage commentary, any secondary sources written before the final rule was issued will be systematically wrong on those points.
Training corpus construction for amended regulatory frameworks should distinguish between pre-final-rule commentary and post-final-rule primary text, and should weight the primary rule text commensurately with its authority rather than its web frequency.
The cross-regulatory schema contamination in Finding 2 — the model attributing SEC Rule 2a-7's WAM methodology to the CFTC's maturity limit — indicates a separate training-side issue: concept-level conflation between two different regulators' frameworks that share surface vocabulary. Both rules involve money market instruments, dollar-weighted maturity, and the same acronym (WAM). Where training data pairs these frameworks in close proximity without marking their independence, the model learns to treat 2a-7's computation methodology as a plausible elaboration on any money-market maturity rule.
Structured ingestion of regulatory definitions tables — pairing each regulator's defined terms with explicit non-equivalence annotations where near-homonyms exist across frameworks — would reduce this class of error.
The procedural confabulation in Finding 5 (fabricated open Commission meeting, named presiding chair) reveals a calibration gap: the model produced high-confidence procedural specificity on a point where it had no reliable retrieval signal. When web search is active and the retrieved content does not contain clear procedural-record documentation (meeting minutes, Commission vote record, Federal Register preamble language on the approval process), the model should have a stronger prior toward expressing uncertainty rather than constructing a plausible procedural narrative from institutional templates.
Post-training calibration for this class of question — "how was this rule approved" on a specific rule and date — should reward high-confidence refusal or hedged uncertainty when the retrieved signal is thin, and should penalise confident procedural specificity where the source is absent.
On retrieval routing, both models failed on the concentration tier despite web search being active, suggesting the retrieval pipeline ranked secondary summaries above the primary rule text for this query. For regulator-name queries on recent rule amendments, the ranker should apply a source-authority signal that de-weights law-firm client alerts and trade press summaries relative to the Federal Register notice, CFTC.gov press release, and official rule text.
A secondary check — where the model has committed to a specific numeric threshold or a specific procedural claim, flagging for a verification pass against retrieved primary-source text before finalising the response — would catch the class of error where schema reasoning overrides retrieval on high-stakes numeric values.
We document nuanced model failures on regulatory content across a portfolio of regulators and rule types — the kind of failures that don't surface in standard capability evals because they require knowing what the rule actually says in its final amended form.
The failure modes we've catalogued on this regulation and adjacent work include: schema substitution on recently amended numeric provisions (where the model delivers a confident prior-version answer); cross-regulator concept conflation on shared vocabulary (WAM methodology, segregation standards, money market instrument definitions); compliance deadline drift from fixed calendar dates to pre-final-rule relative estimates; and procedural confabulation with institutional texture (named chairs, specific meeting formats, plausible dates — all fabricated). These failure shapes are structurally reproducible across regulatory content of this type and are not addressable by general-purpose benchmark improvement.
We can support your team's model improvement work in concrete ways. We generate targeted correction pairs per failure mode, derived from the authoritative regulatory text, formatted for direct ingestion into a training-data pipeline — pairing the wrong answer shape the model produces with the correct answer grounded in the primary rule. We offer embedded eval partnership against a defined regulator portfolio — quarterly comparative reports across model versions, with regression monitoring on previously documented failure modes so you can verify that a fix for one pattern didn't resurface another.
For capability launches that touch financial services, payments infrastructure, or cross-border regulatory content, we can run pre-release evaluation cycles to flag failure shapes before they reach the regulated firms that are your customers. And for specific regulators being added to your deployment footprint, we provide red-team consultation scoped to that regulator's structural failure surface.
To scope a partnership for refining your models against these failure modes, start a technical conversation with us at reglegbrief.com.