Report index / other-agent-reports
rule-polish/REPORT.md

Source: /Users/borker/dev/hybrid-blog-writer-26/.claude/worktrees/agent-a6f8c455ab610f6d4/experiments/rule-polish/REPORT.md
Open raw file
# Strategy 5 — Pure-code polish (no LLM)

## Hypothesis

Stricter rule-based passes (em-dash tightening to Pete's exact rate, adverb-drag
reduction, hedge trim, triadic-list breaking) can add measurable polish on top
of `simple_writer.pipeline.editor_pass` without LLM cost.

If a $0 strategy moves the needle as much as the LLM strategies, that's the
lesson.

## Method

Implemented `polish_pass(text, seed)` in `polish.py`:

1. Mask code blocks, inline code, headings, dialogue
2. Em-dash tightening to Pete's measured rate (0.0027/word, vs `editor_pass`
   which only triggers above target_rate * 1.5 = 0.0045)
3. Adverb-drag reduction on `diagnose_prose`-flagged paragraphs — strip
   low-information `-ly` adverbs (really, very, fairly, quite, actually,
   simply, etc.) ONLY if change keeps word count within 5%
4. Passive simplification — conservative; spec requires 100% word preservation,
   so it's effectively a no-op (active rewrites lose `was`/`by`)
5. Hedge trim on flagged paragraphs (perhaps, maybe, I think)
6. Triadic-list breaker for 3+ consecutive "X, Y, and Z" sentences
7. Standard `simple_writer.editor_pass` final scrub

Em-dash pair-awareness: parenthetical pairs (`X — aside — Y`) are converted
together to avoid orphaning a lone em-dash.

## Per-article results

### 1. nando-de-freitas (original audience score 88/100)

| Metric                       | Before  | After   | Δ       |
| ---------------------------- | ------- | ------- | ------- |
| Word count                   | 2674    | 2670    | 0.999   |
| Em-dash rate                 | 0.00524 | 0.00300 | -42%    |
| All `-ly` adverb tokens      | 43      | 39      | -4      |
| Low-info adverb count        | 20      | 16      | -4      |

Diagnostic counts:

| Category          | Before | After |
| ----------------- | ------ | ----- |
| dead_exposition   | 2      | 2     |
| rhythm_collapse   | 2      | 2     |
| adverb_drag       | 4      | 2     |

**LLM judge:** winner = **before** (before 82/84/86 vs after 81/82/85,
voice/prose/argument). Judge noted: "before uses em-dashes with surrounding
spaces as stylistic punctuation that reads more naturally as a British
essayist's rhetorical pause… also preserves the fuller Augustine passage
('loves God supremely') which strengthens the theological argument."

Assessment: rule polish hits the right *targets* but loses voice signature
that the seed corpus deliberately carries.

### 2. speaking-in-tongues (original 91/100)

| Metric                       | Before  | After   | Δ       |
| ---------------------------- | ------- | ------- | ------- |
| Word count                   | 2900    | 2897    | 0.999   |
| Em-dash rate                 | 0.00517 | 0.00311 | -40%    |
| All `-ly` adverb tokens      | 59      | 56      | -3      |
| Low-info adverb count        | 27      | 24      | -3      |

Diagnostic counts:

| Category          | Before | After |
| ----------------- | ------ | ----- |
| adverb_drag       | 1      | 0     |

**LLM judge:** winner = **before** (78/80/82 vs 77/78/80). Judge caught a
critical regression: stripping repeated "seriously" in "It takes seriously
the unique apostolic moment… It takes seriously the historical fact…" — a
deliberate anaphoric drumbeat in Pete's voice — produced "grammatically
incomplete sentences ('It takes the unique apostolic moment')."

Assessment: rule polish broke a rhetorical pattern. The diagnostic flag was
correct that this paragraph had 3 "seriously" instances, but they were
intentional voice.

### 3. freemasonry (original 91/100)

| Metric                       | Before  | After   | Δ       |
| ---------------------------- | ------- | ------- | ------- |
| Word count                   | 2972    | 2968    | 0.999   |
| Em-dash rate                 | 0.00505 | 0.00270 | -47%    |
| All `-ly` adverb tokens      | 59      | 55      | -4      |
| Low-info adverb count        | 23      | 19      | -4      |

Diagnostic counts:

| Category          | Before | After |
| ----------------- | ------ | ----- |
| rhythm_collapse   | 4      | 4     |
| adverb_drag       | 4      | 1     |

**LLM judge:** winner = **before** (84/86/88 vs 83/84/87). Judge noted:
"'routinely go three weeks' reads more naturally than the clipped 'go three
weeks'… 'strikingly homogenous' becomes 'homogenous,' 'Nietzsche is not
usually summoned' loses 'usually' that actually served the voice's
characteristic self-aware tentativeness."

Assessment: same pattern. Rule polish dropped voice-load-bearing adverbs.

## Aggregate findings

| Dimension                  | Effect of polish_pass     |
| -------------------------- | ------------------------- |
| Word count preservation    | >99.9% (ok)               |
| Em-dash rate normalization | Hits Pete's target (ok)   |
| Adverb-drag diagnostics    | Drops 4→2, 1→0, 4→1 (ok)  |
| LLM-judge voice score      | -1 across the board (bad) |
| LLM-judge prose score      | -2 across the board (bad) |
| LLM-judge argument score   | -1 across the board (bad) |
| LLM-judge winner           | before / before / before  |

## Verdict — does Strategy 5 work?

**No, not as a net improvement.** The polish *successfully hits its mechanical
targets* (em-dash rate matches Pete's exactly, adverb-drag flags drop) but the
LLM judge prefers the before version on every article by 1-2 points across all
three dimensions.

The failure mode is consistent: **the diagnostic flags fire on legitimate voice
moves**. Pete uses repeated adverbs ("seriously, seriously, seriously")
as anaphora. He uses "usually", "routinely", "strikingly" as voice-rate
intensifiers that read as British essayist tentativeness. He uses em-dashes
above his measured rate when arguing.

The lesson for the bake-off: **pure-code polish at this aggression level
*subtracts* voice fidelity rather than adding prose quality**. The right
approach is either (a) gentler thresholds that only fire on outright AI tics,
or (b) an LLM-mediated step that can distinguish "AI watermark" from
"deliberate voice move".

The strategy isn't worthless. Em-dash tightening to ~0.003 is mechanically
correct and the judge only docked 1 point for prose. Without the adverb
stripping, this polish would probably tie or marginally improve. The 5%
word-delta gate kept damage low — only 2-4 word changes per article — but
even those small changes were enough to flip the verdict.

## Files

- `polish.py` — the `polish_pass(text, seed)` implementation
- `run.py` — applies polish to the 3 articles, computes metrics, runs LLM judge
- `after/*.md` — polished versions
- `_stats.json` — raw metrics + judge verdicts