Both Claude Opus 4.7 with web search and Claude Sonnet 4.6 with web search failed on the same question type — a size-tiered concentration limit embedded in the 2024 CFTC Amendments to Regulation 1.25 — producing flat uniform-threshold answers where the rule specifies a two-condition tier keyed to fund asset size and management company AUM. Across five confirmed failures on this regulation, the dominant pattern is the model confidently reconstructing rule parameters from a prior-version or generic regulatory schema rather than retrieving the amended text's specific numeric structure. A secondary failure involves procedural process: Claude Opus 4.7 with web search fabricated an open Commission meeting with a named presiding chair, when the record shows the rule was approved by seriatim vote. These are not retrieval misses on obscure content — they are over-confident confabulations on the most decision-critical parameters a compliance team would query a model to confirm.
This is the consolidated view of findings. Click 'see details →' on any item for the full details for each finding.
This finding implicates the training corpus's representation of the final rule text relative to pre-final secondary commentary. The model asserted that the final rule rejected size-based tiering — a claim that reflects early rulemaking commentary, not the published final rule. The retrieval pipeline failed to surface the primary rule text at sufficient specificity to override the trained-schema prior, suggesting the Federal Register notice is underweighted relative to secondary sources in retrieval ranking for this query type.
see details →This finding implicates cross-regulator concept conflation in the training data. The model correctly retrieved the 24-month WAM ceiling but appended SEC Rule 2a-7's computation methodology as an elaboration — a cross-contamination that occurs when two regulators' frameworks share vocabulary and are trained in close proximity without independence annotations. The error is in the generation layer: the model added a plausible-sounding methodology reference not present in the retrieved CFTC rule text.
see details →This finding implicates calibration on compliance-deadline specificity. The rule publishes a fixed calendar date (March 31 2025) for the SIDR update deadline; the model substituted a relative range drawn from pre-final-rule commentary. The retrieval pipeline either did not surface the compliance calendar from the final rule text, or the model discounted it in favour of a higher-frequency secondary framing. Post-training calibration should penalise relative-range answers on compliance deadline questions where a fixed date is retrievable from primary source.
see details →This finding implicates confidence calibration on procedural claims where retrieval signal is thin. The model fabricated a specific open Commission meeting with a named presiding chair on a precise date — all plausible, none accurate. When web search does not surface explicit procedural-record documentation (meeting minutes, Commission vote transcript, Federal Register preamble on the approval mechanism), the model should fall back to uncertainty rather than constructing an institutional narrative from learned templates. This is a calibration gap in the generation head, not a retrieval gap.
see details →The convergence with Claude Opus 4.7 on the same concentration-tier question makes this finding structurally significant: two different model architectures, same configuration, same wrong schema, same explicit denial of the tier's existence. The internal contradiction in the response — denying a size-based tier while using tier-like vocabulary — suggests the model assembled its answer from a regulatory schema template rather than retrieved primary text. This points to a retrieval-ranking issue shared across model families: the Federal Register final rule text is not being surfaced at sufficient authority weight to override the trained prior on this query.
see details →