Report index / other-agent-reports

diagnostic-scalpel/REPORT.md

Source: /Users/borker/dev/hybrid-blog-writer-26/.claude/worktrees/agent-a22487196a1d1acd3/experiments/diagnostic-scalpel/REPORT.md

Open raw file

# Strategy 1 — Literary diagnostics -> targeted LLM scalpel

## What this strategy does

1. Run `voice_pipeline.diagnostics.diagnose_prose` on the source article.
2. Take the top 5 issues by severity (de-duped by paragraph index).
3. For each, send ONLY that paragraph + the diagnostic question to `anthropic/claude-sonnet-4.6` and ask for a paragraph-only rewrite (meaning preserved, length within +/-20%, voice preserved).
4. Splice rewrites back into the article.
5. Run `simple_writer.pipeline.editor_pass` once on the result.
6. Score with a hostile-copy-editor LLM judge.

## Per-article results

### nando-de-freitas-unveils-interventional-sft-for-agent-models-what-it-means-when.md

- header audience score (before, unchanged): **88/100**
- word count: **2677 -> 2693** (delta +16)
- character diff vs before: **1.30%**
- diagnostics before: 8 (adverb_drag=4, dead_exposition=2, rhythm_collapse=2)
- diagnostics after: 5 (adverb_drag=3, dead_exposition=2)
- issues attempted: 5, applied: 3

**Per-edit log:**

- para 10 [dead_exposition] -> no-change
- para 30 [dead_exposition] -> no-change
- para 4 [rhythm_collapse] sev 0.4 -> rewrote (68w -> 83w)
- para 11 [rhythm_collapse] sev 0.4 -> rewrote (57w -> 72w)
- para 18 [adverb_drag] sev 0.3 -> rewrote (90w -> 92w)

**LLM judge verdict:**

- before scores: voice 84, prose 83, argument 85 (avg 84.0)
- after scores: voice 85, prose 85, argument 86 (avg 85.3)
- winner: **after**
- reasoning: The after version makes modest but real improvements: the Augustine section is tightened and more precise ('not the catalogue of their deeds but the structure of their loves' is sharper than the original's comma-heavy construction), and the opening technical explanation flows more cleanly. The gains are marginal — this is a light copyedit, not a rewrite — but consistently in the right direction across all three dimensions.

**Honest assessment:** Helped: judge prefers AFTER by +1.3; diagnostic count 8 -> 5.

---

### speaking-in-tongues-what-1-corinthians-12-and-14-actually-argue-and-whether-the.md

- header audience score (before, unchanged): **91/100**
- word count: **2926 -> 2924** (delta -2)
- character diff vs before: **0.10%**
- diagnostics before: 1 (adverb_drag=1)
- diagnostics after: 0 (_none_)
- issues attempted: 1, applied: 1

**Per-edit log:**

- para 35 [adverb_drag] sev 0.3 -> rewrote (80w -> 78w)

**LLM judge verdict:**

_Skipped: diff 0.10% < 1%_

**Honest assessment:** Did nothing meaningful: change too small to judge.

---

### freemasonry-why-a-secret-society-offering-brotherhood-is-the-wrong-answer-to-a-r.md

- header audience score (before, unchanged): **91/100**
- word count: **2914 -> 2967** (delta +53)
- character diff vs before: **1.98%**
- diagnostics before: 8 (adverb_drag=4, rhythm_collapse=4)
- diagnostics after: 4 (adverb_drag=3, rhythm_collapse=1)
- issues attempted: 5, applied: 4

**Per-edit log:**

- para 11 [rhythm_collapse] sev 0.4 -> rewrote (89w -> 107w)
- para 24 [rhythm_collapse] sev 0.4 -> rewrote (90w -> 110w)
- para 31 [rhythm_collapse] -> no-change
- para 39 [rhythm_collapse] sev 0.4 -> rewrote (74w -> 83w)
- para 1 [adverb_drag] sev 0.3 -> rewrote (70w -> 76w)

**LLM judge verdict:**

- before scores: voice 88, prose 87, argument 86 (avg 87.0)
- after scores: voice 89, prose 89, argument 87 (avg 88.3)
- winner: **after**
- reasoning: The after version makes several targeted improvements that lift the prose without overwriting it: 'disintegrating behind a composed exterior' is sharper than 'quietly falling apart,' the Hiram Abiff passage gains a devastating closing clause about men lowering themselves into a grave from which nothing living rises, and the church-failure section tightens its list into a single propulsive sentence with em-dash emphasis on cost. The gains are real but modest—this is revision at the margin, not transformation.

**Honest assessment:** Helped: judge prefers AFTER by +1.3; diagnostic count 8 -> 4.

---

## Aggregate

- helped: **2/3**
- hurt: **0/3**
- did nothing meaningful: **1/3**