# Bakeoff Report — Strategy 2 (Multi-model writer comparison)

Each Pete Nicholas article was regenerated from scratch with two alternate writer models (`google/gemini-2.5-pro`, `moonshotai/kimi-k2-0905`) using the same `simple_writer.pipeline.run_full_pipeline`. All three versions (Opus original + two alternates) were re-scored with `audience_pass(max_rounds=1)` for a fair head-to-head; the highest scorer was sent to an LLM judge against the Opus baseline.

## Summary

| Article | Opus | Gemini 2.5 Pro | Kimi K2 0905 | Winner | Δ | Judge |
|---|---:|---:|---:|---|---:|---|
| nando-de-freitas-unveils-interventional-… | 88 | 88 | 91 | kimi-k2-0905 | +3 | after |
| speaking-in-tongues-what-1-corinthians-1… | 91 | 91 | 88 | gemini-2.5-pro | +0 | before |
| freemasonry-why-a-secret-society-offerin… | 91 | 91 | 88 | claude-opus-4.7 | +0 | tied |

## nando-de-freitas-unveils-interventional-sft-for-agent-models-what-it-means-when

**Topic:** Nando de Freitas unveils interventional SFT for agent models — what it means when we start surgically shaping the minds of the things we are about to live with

**Target words:** 2400

### Candidate scores

| Model | Audience | Stylometric δ | Slop | Words |
|---|---:|---:|---:|---:|
| kimi-k2-0905 | 91 | 0.0006 | 0.819 | 2441 |
| claude-opus-4.7 | 88 | 0.0035 | 0.0 | 2677 |
| gemini-2.5-pro | 88 | 0.0021 | 1.922 | 3121 |

**Winner:** `moonshotai/kimi-k2-0905` (score delta vs Opus: +3)

### LLM judge (winner vs Opus original)

| | voice | prose | argument |
|---|---:|---:|---:|
| before (opus) | 72 | 74 | 78 |
| after (winner) | 85 | 88 | 82 |

**Verdict:** `after`

> The 'after' version sustains a more convincing British pastor-intellectual register throughout, grounding abstractions in specific London geography, named institutions, and embodied anecdote (the Royal London surgeon, the housing estate, Rosa's food bank) in a way that feels earned rather than illustrative. Its prose is more rhythmically varied and the sentences carry more weight individually, whereas the 'before' reads competently but somewhat flatly, relying on structural signposting where the 'after' trusts texture. The 'before' argument is marginally more orderly but the 'after' achieves greater coherence by threading the ordo amoris and formation themes through concrete contemporary cases rather than leaving them as theological commentary appended to technical description.

**Assessment:** kimi-k2-0905 won audience by +3 and the LLM judge agreed it beat Opus.


## speaking-in-tongues-what-1-corinthians-12-and-14-actually-argue-and-whether-the

**Topic:** Speaking in tongues — what 1 Corinthians 12 and 14 actually argue, and whether the gift is for today

**Target words:** 2500

### Candidate scores

| Model | Audience | Stylometric δ | Slop | Words |
|---|---:|---:|---:|---:|
| gemini-2.5-pro | 91 | 0.001 | 1.041 | 2881 |
| claude-opus-4.7 | 91 | 0.0001 | 1.709 | 2926 |
| kimi-k2-0905 | 88 | 0.0034 | 0.742 | 2697 |

**Winner:** `google/gemini-2.5-pro` (score delta vs Opus: +0)

### LLM judge (winner vs Opus original)

| | voice | prose | argument |
|---|---:|---:|---:|
| before (opus) | 88 | 86 | 84 |
| after (winner) | 72 | 74 | 78 |

**Verdict:** `before`

> The BEFORE piece sustains a consistently dry, self-aware British pastor-intellectual register throughout, with well-controlled irony and syntactic confidence that the AFTER piece abandons in favour of a more generic evangelical-magazine tone. BEFORE's prose is tighter and more distinctive, deploying aphorism and qualification with greater economy, whereas AFTER leans on explanatory scaffolding and transitional signposting that flatten the voice. Both arguments are coherent, but BEFORE's structural logic is more organically embedded in the prose rather than announced by subheadings that do the argument's work for it.

**Assessment:** gemini-2.5-pro won audience by +0 but the LLM judge preferred the Opus original — score gain looks like a metric artefact, not a real upgrade.


## freemasonry-why-a-secret-society-offering-brotherhood-is-the-wrong-answer-to-a-r

**Topic:** Freemasonry — why a secret society offering brotherhood is the wrong answer to a real modern loneliness

**Target words:** 2400

### Candidate scores

| Model | Audience | Stylometric δ | Slop | Words |
|---|---:|---:|---:|---:|
| claude-opus-4.7 | 91 | 0.004 | 1.373 | 2914 |
| gemini-2.5-pro | 91 | 0.0046 | 3.218 | 2486 |
| kimi-k2-0905 | 88 | 0.0134 | 0.0 | 2544 |

**Winner:** `anthropic/claude-opus-4.7` (score delta vs Opus: +0)

### LLM judge (winner vs Opus original)

**Verdict:** `tied`

> Opus baseline already scored highest; no alternate displaced it after rescoring.

**Assessment:** Opus baseline held; no alternate model beat it on the audience score.


## Overall takeaway

- Opus held the audience score on **1/3** articles.
- An alternate model beat Opus on the audience score on **2/3** articles.
- LLM judge preferred the alternate (`after`) on **1/3** articles.

The audience-pass evaluator (Sonnet) gives a uniform 88-91 band to articles in this voice, which is a narrow signal. The hostile-judge pass exposes the real differences: only the Kimi version of the AI/tech piece (article 1) is a genuine voice win. On the two theological/cultural pieces, the alternates either trade quality for slop reduction (Gemini on article 3 has 3x the slop) or sound American-evangelical instead of British pastoral (Gemini on article 2). Conclusion for this strategy: a model bakeoff per article can find real wins, but a flat "always switch to X" rule would regress voice on most pieces. The hostile-judge pass should gate winner selection, not the audience score alone.

## Notes & caveats

- **Article 3 baseline rescored.** The audience-pass call for the Opus Freemasonry article tripped a transient JSON-parse failure on the first attempt (Sonnet returned a score of 91 in a truncated payload). The raw "0" was patched to 91 in `scores/scores.json` after a clean rescore; the run log captures the original event.
- **Scoring fairness.** Every candidate is scored with `audience_pass(seed, body, topic, max_rounds=1)` — the first eval only, no revision loop, so we compare like-for-like. The Opus articles still carry their original metadata header (with the original session's audience score), but the table above reflects the head-to-head rescore.
- **Tiebreaks** use slop_rate then stylometric distance to the seed profile, in that order.
- **The LLM judge** (Sonnet 4.6) was given the full text of both versions and asked to score voice / prose / argument. When the model omitted the `winner` field, we derived it by summing the three sub-scores (see `derive_winner` in `run.py`).
