Report index / reports-and-code
ITERATION_NOTES.md

Source: /Users/borker/dev/hybrid-blog-writer-26-voice-pipeline/experiments/same_author_lift/ITERATION_NOTES.md
# Same-author lift iteration notes

Date: 2026-05-18

Goal: raise Pete outputs to at least 52/61 `same_author_llm` passes while
preserving voice and avoiding new repetitive tells.

## Tooling added

- `experiments/same_author_lift/run.py`
  - cached same-author evaluation
  - optional `single` judge mode matching `simple_writer.pipeline.character_check`
  - optional `multi` judge mode using all three seed excerpts
  - opener-prefix repair that writes candidates under `experiments/same_author_lift/after/`
  - rejection gates for same-author failure, word-count loss, drift rise, slop rise,
    scaffold increase, and quality-guard failures
  - rejected candidates preserved under `experiments/same_author_lift/rejected/`

## Metric finding

The production `same_author_llm` check is weak as an optimization target. It
compares every article to only the second seed representative excerpt:

> When we feel hard-pressed, God's grace fills us up...

That excerpt is a 68-word devotional paragraph. The generated articles are
longer discursive essays. Optimizing directly against this single excerpt would
push the articles toward devotional paragraph cadence and away from the broader
seed range.

The multi-excerpt judge is fairer, but still rejects many outputs because they
are too rhetorically polished, sardonic, or essayistic compared with the actual
seed corpus.

## Iterations

### v2: conservative opener copy edit

- Sample: first 5 articles
- Result: 1/5 after/carry-forward
- Failure: model made punctuation-level edits only.

### v3: include same-author rejection reason

- Sample: first 5 articles
- Result: 1/5 after/carry-forward
- Failure: still too conservative; most edits did not address register mismatch.

### v4/v5: stronger rewrite with Opus

- Sample: first 5 articles
- Result: 1/5 after/carry-forward
- Failure: Opus preserved the existing synthetic sentence architecture.

### one-off generative prefix rewrite

- Article: `bc-and-ad-how-the-christian-calendar-became-a-quiet-daily-confession-of-a-contro.md`
- Model: Gemini attempted, fell back to Sonnet 4.6
- Result: flipped same-author false -> true
- Rejected by current gate because it introduced an extra scaffold hit (`it is worth`)
- Quality guard judged it meaning-preserving but mildly flatter (`voice_delta = -1`)

### v6: all seed excerpts + generative prefix rewrite

- Sample: first 5 articles
- Result: 0/5 accepted with strict gates
- Failure: generated candidates still tended to be mild smoothing; quality guard
  repeatedly identified loss of the synthetic article's sharper phrasing.

### single-excerpt check on v6

- Sample: first 5 articles
- Result: 1/5 after/carry-forward
- Interpretation: the original `single` judge remains a bottleneck and should not
  be used alone for acceptance.

### v8: neutral brief -> fresh full-article rewrite

- Sample: first 5 articles
- Result: 2/5 after/carry-forward
- Attempted repairs: 4
- Accepted repairs: 1
- Only `bc-and-ad-how-the-christian-calendar-became-a-quiet-daily-confession-of-a-contro.md`
  flipped false -> true and passed gates.
- Accepted candidate improved deterministic tells: slop 0.7822 -> 0.0,
  drift 0.4286 -> 0.3929, scaffold hits 8 -> 3.
- Guard caveat: meaning/facts preserved, but voice_delta = -1 and judge still
  preferred the original prose for warmth and rhythm.
- Failure mode on the other attempts: either same-author stayed false, or the
  model produced too-short rewrites that the harness fell back to the original.

### v9: seed-intro opener rewrite

- Sample: first 5 articles
- Result: 1/5 after/carry-forward
- Accepted repairs: 0
- Sonnet and Opus variants both stayed false on the same-author judge.
- Failure mode: preserving the article's anecdotal/essay opening keeps the first
  300 words too journalistic or discursive for the single devotional seed
  excerpt. Pushing the opener toward pastoral application often loses the
  sharper first-person voice.

### v10: intro-brief opener recomposition

- Sample: first 5 articles
- Result: 1/5 after/carry-forward
- Accepted repairs: 0
- Rebuilding only the first movement from a neutral brief avoided some line-edit
  anchoring, but it still did not pass the single-excerpt judge.
- Quality guard repeatedly preferred the original because the new openings
  softened concrete anecdotes into generic pastoral framing.

### diagnostic devotional-gate stress test

- Article: `bc-and-ad-how-the-christian-calendar-became-a-quiet-daily-confession-of-a-contro.md`
- Method: deliberately added seed-like anaphoric devotional cadence to the
  opening.
- Result: still false on same-author.
- Guard result: meaning_preserved = false, voice_delta = -2,
  new_repetitive_tell = true.
- Interpretation: directly imitating the 68-word seed excerpt is both ineffective
  and unsafe; it creates exactly the repeated scaffold fingerprint we are trying
  to avoid.

### v11: Kimi neutral-brief full rewrite

- Sample: first 5 articles
- Result: 1/5 after/carry-forward
- Accepted repairs: 0
- Failure mode: Kimi often produced substantially shorter drafts
  (e.g. 1076/2527, 1682/2830, 1385/2683 words), so the length/heading guard
  correctly fell back to the original. This is not worth scaling with the
  current full-brief prompt.

### v12: section-brief rewrite

- Sample: `bc-and-ad...` smoke test
- Result: 0/1 same-author lift
- Useful effect: fixed the full-brief shortening problem. The rewritten article
  stayed near source length (2504 vs 2527 words), slop dropped 0.7822 -> 0.3982,
  and scaffold hits dropped 8 -> 3.
- Failure mode: the opening remained essayistic/journalistic, so the
  single-excerpt same-author judge still rejected it.

### v13: hybrid full-brief opening + section-brief body

- Sample: first 5 articles
- Stable result after cached rerun: 2/5 after/carry-forward, 1 accepted repair.
- Smoke result: `bc-and-ad...` flipped false -> true and passed gates:
  word count 2527 -> 2361, slop 0.7822 -> 0.4235, scaffold hits 8 -> 2,
  quality guard voice_delta = -1.
- A transient run had `christian-tithing...` same-author true, but a repeat
  evaluation returned false. Treat that as unstable, not a real win.
- The other two attempted failures still sounded too essayistic to the
  single-excerpt judge. `catholics-and-protestants...` also degraded voice
  (voice_delta = -2) and introduced a summarising "the author" tell.
- Takeaway: hybrid is the strongest post-hoc transform so far, but still nowhere
  near the 52/61 target. It may be useful as a candidate generator inside a
  multi-candidate regeneration pipeline, not as the default answer.

### v14: evaluator and de-amplification probes

- Full-corpus authorship probe on the first five originals: 0/5 passed. The
  mismatch is real, not only an artifact of the 68-word single seed excerpt.
- Plain-brief de-amplification on the first five: 1/5 after/carry-forward,
  0 accepted repairs. It improved some deterministic metrics but still failed
  the single-excerpt gate.
- Full-corpus probe on plain-brief candidates found one useful signal:
  `christian-tithing...` was plausible against the broader seed corpus despite
  failing the single-excerpt judge. The current quality guard is therefore
  partly misaligned: it protects the synthetic original voice, not necessarily
  real seed fidelity.
- True-seed one-article variants for `bc-and-ad...`: 2/3 passed full-corpus
  authorship, 0/3 passed the original single-excerpt gate. They preserved facts
  and meaning but shortened the article and still lost some warmth.

### v15: hybrid scaled to ten articles

- Sample: first 10 articles
- Result: 3/10 after/carry-forward, 1 accepted repair.
- Before baseline on this sample: 2/10.
- The only accepted repair remained `bc-and-ad...`.
- Common failure: hybrid reduces scaffolds/slop on many articles, but
  same-author usually remains false. Where it rewrites substantially, it often
  strips personal narrative texture and replaces it with smoother exposition;
  the quality guard flags this as voice loss.

### v16: portfolio selection over existing raw candidates

- Sample: first 10 articles
- Method: scan all existing raw rewrites, reject by deterministic gates first
  (word loss, slop rise, scaffold rise), run a one-vote same-author probe, then
  run three-vote confirmation and full quality guard only for plausible
  candidates.
- Result: 3/10 selected passes, matching v15 and not improving the aggregate.
- Source baseline on this sample under repeated vote: 2/10.
- Accepted candidates:
  - `bc-and-ad...`: hybrid-brief-section, 3/3 same-author, word ratio 0.934,
    slop 0.4225, scaffold hits 2, voice_delta -1.
  - `christian-tithing...`: brief rewrite, 3/3 same-author, word ratio 0.776,
    slop 0.4706, scaffold hits 5, voice_delta -1. This is acceptable by the
    loose portfolio gate but close to the lower length bound.
- Rejection pattern:
  - most failed candidates never passed even the one-vote same-author probe
  - several plausible rewrites were too short
  - some section rewrites improved voice enough to pass same-author but raised
    slop or scaffold risk
- Tooling note: `portfolio.py` now caches repeated same-author votes and runs
  expensive full quality checks only after same-author and cheap deterministic
  gates pass.
- Takeaway: candidate selection can prevent bad repairs and recover isolated
  wins, but it cannot manufacture enough same-author candidates from the
  existing post-hoc pool.

## Other-agent strategy review

Checked the five worktree reports under
`/Users/borker/dev/hybrid-blog-writer-26/.claude/worktrees/agent-*`.

- Diagnostic scalpel: modest quality win (2/3 helped, 0/3 hurt), but it is a
  copy-edit tool, not a same-author lift tool.
- Multi-model bakeoff: useful per topic; Kimi improved the tech article, but
  Gemini hurt theology/cultural voice. Needs hostile-judge gating.
- Drift-aware surgical edit: no accepted improvements; the dominant metric
  offenders are floor metrics that faithful editing should not manufacture.
- Moves-augmented rewrite: hurt 3/3 by turning organic prose into visible
  checklist/listicle prose.
- Rule-only polish: hurt 3/3 at current aggression because it stripped
  voice-bearing adverbs, dashes, and repetitions.

## Current conclusion

Post-hoc transformations are not enough to reach 52/61 safely. The best
post-hoc transform so far, hybrid full-brief opening + section-brief body,
reached only 3/10 on the first ten articles with one accepted repair.
Brief-based full rewrites can occasionally flip the same-author judge, but they
tend to shorten and flatten the article. Opener-only repairs preserve the
article but do not move the single-excerpt judge. Directly imitating the
devotional seed excerpt creates a repetitive tell and still fails.

The current generated corpus is systematically over-amplified: more literary,
polemical, and rhetorically polished than the seed corpus. A full-corpus
authorship probe also rejected the original first-five sample, so the mismatch
is real, not merely an evaluator artifact. Fixing it completely requires
generation-time changes, not only after-the-fact transformation.

The next viable strategy is seed-grounded regeneration plus a better evaluator:

1. Fix the evaluator first:
   - replace single-excerpt `same_author_llm` with a multi-excerpt or
     majority-vote judge
   - report confidence and reason categories
2. Build a batch fingerprint gate:
   - repeated scaffold phrase document frequency
   - repeated 5/6-grams excluding Scripture quotations
   - systematic drift offenders
3. Regenerate or full-rewrite failed articles with a stricter seed-grounded
   prompt:
   - suppress "soul amplification": clever hook, invented anecdote, dramatic
     scene, punchline title, and synthetic "useless residue"
   - prefer the actual corpus's plainer pastoral-practical cadence
   - preserve SEO facts/headings
   - gate with same-author, quality guard, drift, slop, and repetition checks

Post-edit tools can clean up a good generation. They cannot reliably convert the
current synthetic house style into the actual seed voice without flattening or
gaming the judge.