# Improvement experiments — aggregate findings

We ran 5 parallel experiments, each applying a different `voice_pipeline/`
element to the same 3 sample articles produced by the lean `simple_writer`
pipeline. The judge is a hostile copy editor for The Gospel Coalition
(Claude Sonnet 4.6) scoring before/after on voice, prose, argument.

## Sample articles

| Slug | Topic | Baseline audience |
|---|---|---|
| nando-de-freitas-unveils-interventional-sft… | AI/tech | 88/100 |
| speaking-in-tongues-what-1-corinthians… | theology | 91/100 |
| freemasonry-why-a-secret-society… | cultural-pastoral | 91/100 |

## Strategy results

| # | Strategy | Cost | Wins | Losses | Ties | Net | Verdict |
|---|---|---|---|---|---|---|---|
| 1 | Diagnostic scalpel (per-para LLM fix) | $ | 2 | 0 | 1 | **+2** | **adopt as optional** |
| 2 | Multi-model bakeoff (Opus / Gemini / Kimi) | $$$ | 1 | 1 | 1 | 0 | adopt as per-kind hint |
| 3 | Drift-aware surgical edit | $ | 0 | 0 | 3 | 0 | drop |
| 4 | Moves-augmented rewrite | $$ | 0 | 3 | 0 | **−3** | drop |
| 5 | Pure-code polish (no LLM, $0 baseline) | $0 | 0 | 3 | 0 | **−3** | drop at this aggression |

## Detailed findings per strategy

### #1 — Diagnostic scalpel (winner, modest)

PR: https://github.com/apollostreetcompany/blog-multi-writer-refactor/pull/23

`voice_pipeline.diagnostics.diagnose_prose` flags paragraphs by category
(cliche, dead_exposition, adverb_drag, rhythm_collapse, hedging, …). For each
top-5 issue, send ONLY that paragraph + the diagnostic question to Sonnet 4.6
with "edit only this paragraph, preserve meaning". Splice back. Re-scrub.

- 2/3 articles: judge prefers AFTER by +1.3 average points. Diagnostic counts dropped 8→5 and 8→4 respectively
- 1/3 (speaking-in-tongues): only 1 diagnostic existed, fix changed 2 words, judge declined to score (trivial diff)
- 0/3 hurt
- Categories that actually produced productive rewrites: `adverb_drag` and `rhythm_collapse`
- `dead_exposition` prompts mostly came back as no-change (Sonnet judged the paragraphs fine)
- Cost: ~$0.30 per article. Runtime ~17s/article.

**Adopt as optional** — fold into `simple_writer.pipeline` behind a flag
(`diagnose_and_fix=True`). Set the gate at severity ≥0.5 to skip trivial cases.

### #2 — Multi-model bakeoff (kind-dependent)

PR: https://github.com/apollostreetcompany/blog-multi-writer-refactor/pull/27

Regenerate with Opus + Gemini 2.5 Pro + Kimi K2 0905, audience-score, send
winner to judge vs Opus original.

- **nando-de-freitas (tech)**: Kimi K2 won audience (91 vs 88). Judge AFTER 85/88/82 vs BEFORE 72/74/78. Kimi threads concrete contemporary cases through theological commentary skillfully.
- **speaking-in-tongues (theology)**: Gemini tied audience (91 vs 91) but judge picked BEFORE handily (88/86/84 vs 72/74/78). Gemini drifts into generic evangelical-magazine tone.
- **freemasonry (cultural)**: After Opus baseline was rescored (transient JSON parse fail dropped first attempt to 0), Opus held 91, tied with Gemini, won slop tiebreak (1.37 vs Gemini's 3.22). Net: Opus held.

**Adopt as per-kind hint** — extend the `Persona` schema with `preferred_models_by_kind: {"tech": "kimi-k2-0905", "theology": "opus-4.7"}`. Skip the full 3-model bakeoff (too expensive at 2-3× cost).

### #3 — Drift-aware surgical edit (drop)

PR: https://github.com/apollostreetcompany/blog-multi-writer-refactor/pull/25

Compute top-5 drift offenders, prompt Sonnet 4.6 with explicit "minimum surgical edits to bring these toward target". Accept only if OOB count strictly drops + length ≥95% + slop didn't rise.

- 0/3 accepted by the strict gate
- 0/3 judge wins (all ties)
- Root cause: dominant offenders are zero-valued floor metrics (`exclamation_rate`, `sentence_start_lowercase_ratio`, `list_item_ratio`) that a faithful copy editor will not manufacture out of thin air
- In a permissive earlier run where the gate did pass, judge still preferred original (typography felt editorially imposed)

**Drop.** Would only work with (a) weighted-delta gate ignoring floor metrics, and (b) explicit license to manufacture floor tokens — both of which defeat the point of "minimum surgical edits".

### #4 — Moves-augmented rewrite (clear hurt)

PR: https://github.com/apollostreetcompany/blog-multi-writer-refactor/pull/24

Mine 10 named rhetorical moves from the Pete Nicholas seed corpus. Send article + moves catalogue to Opus 4.7 with "rewrite preserving content but applying these moves more deliberately".

- Judge picked BEFORE 3/3
- Mean composite voice/prose/argument: 84.9 → 82.6 (−2.3)
- Failure mode: explicit moves induce listicle scaffolding ("First / Second", "How can we respond?") and over-tidied punctuation that flatten the conversational pastor-intellectual register the few-shot-only baseline already had
- Moves applied *too* visibly — converted essayistic prose into sermon-outline prose

**Drop as a positive constraint.** The moves catalogue might be useful as an *anti-pattern* list (forbid listicle scaffolding when generating, rather than enforce moves when rewriting).

### #5 — Pure-code polish (clear hurt at this aggression)

PR: https://github.com/apollostreetcompany/blog-multi-writer-refactor/pull/26

Stricter rules on top of `simple_writer.editor_pass`: em-dash tightening to Pete's exact 0.0027/word rate, adverb-drag reduction on flagged paragraphs, hedge trim, triadic-list breaker.

- Hits its mechanical targets: em-dash rate -42 to -47%, low-info adverbs -4 per article
- LLM judge picked BEFORE 3/3, by 1-2 points
- Failure mode: **diagnostic flags fire on legitimate voice moves**
  - Pete's anaphoric "It takes seriously the X… It takes seriously the Y…" → polish stripped → grammatically incomplete sentences
  - "routinely go three weeks" → "go three weeks" (loses British essayist tentativeness)
  - "strikingly homogenous" → "homogenous" (flattens)

**Drop at this aggression.** Em-dash tightening alone is mechanically correct (the judge only docked 1 point for prose). Adverb stripping needs a smarter signal than "low-information `-ly` adverb".

## Recommended changes to `simple_writer.pipeline`

Two narrow additions, both opt-in:

1. **`run_full_pipeline(..., diagnose_and_fix: bool = False)`** — runs Strategy 1 between the existing audience pass (step 6) and character check (step 7). Cost: +~$0.30/article. Benefit: 2/3 articles modestly improved, 0/3 hurt.

2. **`Persona.preferred_models_by_kind: Dict[str, str]`** — when set, the writer model picker uses this map instead of `preferred_models[0]`. Concretely: `{"deep_dive": "moonshotai/kimi-k2-0905", "tech": "moonshotai/kimi-k2-0905"}` for Pete-style tech topics; keep Opus 4.7 for theology and personal essay.

Everything else stays in `voice_pipeline/` as the museum it is. The lesson is consistent across all 5 experiments: **adding constraints to a working few-shot pipeline mostly hurts.** The seed corpus already carries the voice. Mechanical post-processing strips voice-load-bearing quirks. Explicit rules induce template behaviour.

## What this confirms

The honest answer to "can you improve them at all by using elements of the voice pipeline": **a small yes, mostly no.**

- Yes: targeted per-paragraph diagnostic fixes help when a specific issue is real
- Yes: Kimi K2 is a better tech-prose model than Opus 4.7 for Pete-style essays
- No: drift-targeted editing, moves-augmented rewriting, and aggressive rule polish all either do nothing or actively hurt
- The lean `simple_writer` pipeline is doing most of the work already. The seed corpus + few-shot beats engineered constraint stacks 4 times out of 5.
