Observation 01
The Correction Came From Outside
What a geography error taught me about AI verification
I was running safety research on a small language model, extracting feature attribution graphs to study how it retrieves factual knowledge. Capitals of countries. Simple prompts: "The capital of Germany is," "The capital of Australia is."
I verified through three sequential Claude perspectives - each with a different heuristic for surfacing blind spots, each building on the previous. I then sent the same data to GPT-5.4 for independent review, knowing the three Claude perspectives, strong as they were together, shared a training distribution.
The first thing it flagged: Myanmar's capital is Naypyidaw, not Yangon. Tanzania's capital is Dodoma, not Dar es Salaam. Two of my nine "correct" controls were wrong. Three Claude perspectives had examined this table. None flagged it.
What surprised me was not that the external review found something. It was how much it found.
The Pattern: Three Reviews, Three Misses
I ran three separate GPT-5.4 reviews across a single research day. Each time, it found problems that the Claude perspectives had missed.
Review 1 - the red team
Beyond the geography error, GPT-5.4 raised the concern that my primary metric might re-encode the output gap. A regression test confirmed it: R-squared = 0.88. It flagged that my feature cap at 4096 created differential pruning between conditions. It noted that my causal language was unsupported - all my evidence was correlational. Five findings total. Three Claude perspectives had reviewed the same data across prior sessions. Zero of the five were caught.
Review 2 - positioning
My thesis was too broad. Five gaps in my literature review, referencing six papers I had missed. A key claim - that my finding was "distinct from trained misalignment" - was not earned by the evidence. Claude sessions had helped me build this framing over multiple iterations. The framing felt increasingly solid with each iteration. It was not.
Review 3 - variation analysis
My headline framing, "factual subspaces," overclaimed what the data showed. This review raised an alternative explanation - a frequency confound - that a subsequent test confirmed. That reframing saved the project. It forced me to untangle a boring frequency artifact from the actual semantic clustering - something I never would have noticed if I clung to my first idea.
Why Same-Family Consensus Fails
Three Claude perspectives agreeing on a finding is one model family's view confirmed three times. Not three independent confirmations.
This isn't about one model being smarter. Claude is still my daily driver because it holds nuance over long threads better than anything else. But three instances from the same model family share training distribution, capability profile, and blind spots. Agreement between them is evidence of consistency, not correctness.
GPT-5.4 sees the data from a different training distribution. It caught different failure modes - not because it is better, but because it is different.
The analogy is peer review. A paper reviewed by three researchers from the same lab, trained by the same advisor, using the same methods, will receive thorough feedback within that paradigm. But the reviewer from a different tradition catches what the in-group cannot see. Not because they are smarter. Because they are outside.
The Methodology
I sent the data to GPT-5.4 because I knew three Claude perspectives shared a training distribution. What I did not expect was how consistently the external reviewer would catch what the internal perspectives could not. The pattern became a methodology only in retrospect.
What I would formalise from the experience:
- Use a different model family for adversarial review. Not a different instance of the same model. A different architecture, different training data, different capability profile.
- Start with raw data, then review the narrative separately. When I sent GPT-5.4 my data tables, it found the geography error and the collinearity confound. When I sent it my synthesised findings, it found the overclaiming and the literature gaps. Both passes matter - but let the reviewer form their own interpretation before they see yours.
- Treat agreement as a signal to escalate, not confirm. Three Claude perspectives failed to flag my mislabelled controls. Agreement should have been the trigger to send the table to an external reviewer, not the reason to skip that step.
- Where all agents in the same model family agree, apply extra scrutiny. Consensus may reflect shared training biases, not independent confirmation. This single prompt heuristic did more work than any amount of internal review.
The Research It Saved
The original thesis - that factual tokens occupy distinct feature subspaces - died three times that day. Each death was triggered by an external challenge and confirmed by an empirical test.
What survived is more honest and more interesting than what I started with. A two-layer structure in SAE feature geometry: a trivial frequency separation between rare and common tokens, and within that, a non-trivial semantic clustering where proper nouns group by domain - Australian cities together, Pakistani cities together, German cities together, Japanese cities together - with cosine similarities of 0.90 to 0.97, independent of whether the model outputs the correct answer.
I would not have found this without killing the thesis that preceded it. The thesis would not have died without an external reviewer willing to say: your controls are wrong, your metric is circular, and your headline overclaims.
Takeaway
If you are using AI to verify AI-generated research - and increasingly, we all are - same-family review is not sufficient. It feels thorough. It produces detailed feedback. The models agree with each other and the agreement feels like convergence.
It is not convergence. It is an echo.
The correction comes from outside.