Report index / reports-and-code
VOICE_SIGNATURE_AUDIT.md

Source: /Users/borker/dev/hybrid-blog-writer-26-voice-pipeline/docs/improvements/VOICE_SIGNATURE_AUDIT.md
# Voice signature audit — Pete Nicholas simple_writer outputs

Date: 2026-05-18

## Question

Do the generated Pete Nicholas articles have a statistically visible LLM/pipeline
signature, or do they read as a quantitatively distinct voice suitable for SEO
author use?

## Sources checked

- 61 final articles in `outputs/simple/pete-nicholas/*.md`
- Pete seed corpus: `/Users/borker/Downloads/seed-pete-n.md` (5,256 words)
- Seed profile: `seeds/pete-nicholas.json`
- Other-agent worktrees under `/Users/borker/dev/hybrid-blog-writer-26/.claude/worktrees/agent-*`
- Existing aggregate: `docs/improvements/SUMMARY.md`

## Assumptions and limits

- This is a deterministic stylometric audit, not a commercial AI detector.
- The Pete seed is small. Eight paragraph-preserving chunks is enough to reveal
  large drift, but not enough to prove authorship.
- The 61 generated pieces are mostly theological/SEO topics, while the seed has
  more city/politics texture, so topic mix is a confound.
- The current slop hard-fail count is not reliable because the remnant regex
  flags ordinary "I cannot..." sentences as AI self-reference.

## Headline finding

The outputs are not generic LLM slop. They do form a recognizable synthetic
Pete-ish house voice. But they are still statistically separable from the real
seed and carry a repeatable pipeline fingerprint. I would not call this
"no LLM fingerprint" yet.

## Corpus summary

| Measure | Result |
|---|---:|
| Final generated articles | 61 |
| Generated word count | 166,375 |
| Seed word count | 5,256 |
| Audience score mean / median / range | 91.85 / 91 / 88-94 |
| `same_author_llm: true` in metadata | 16/61 |
| Metadata stylometric distance mean / median / max | 0.0019 / 0.0013 / 0.0093 |
| Metadata foibles overlap mean / median / min | 0.628 / 0.633 / 0.433 |
| Slop rate mean / median / max | 0.674 / 0.711 / 2.244 per 1k words |

The metadata is internally contradictory: code stylometry and foibles are close,
but the LLM same-author check says "no" on 45/61 articles.

## Drift against seed profile

| Measure | Result |
|---|---:|
| Drift score mean / median / range | 0.395 / 0.393 / 0.250-0.607 |
| Out-of-band metrics mean / median / range | 11.07 / 11 / 7-17 |

Systematic offenders:

| Metric | Articles OOB | Seed | Generated mean | Interpretation |
|---|---:|---:|---:|---|
| `dash_rate` | 61/61 | 0.0027 | 0.0045 | overuses em-dash pause |
| `sentence_start_lowercase_ratio` | 61/61 | 0.0119 | 0.0000 | too clean/normalized |
| `short_sentence_ratio` | 61/61 | 0.1344 | 0.2845 | too many short punchlines |
| `list_item_ratio` | 61/61 | 0.0396 | 0.0051 | fewer list/fragment structures |
| `exclamation_rate` | 60/61 | 0.0006 | 0.0001 | too restrained |
| `parenthesis_rate` | 57/61 | 0.0156 | 0.0045 | underuses parenthetical asides |
| `sentence_length_p10` | 50/61 | 8.0 | 4.86 | bottom tail too clipped |
| `quote_rate` | 42/61 | 0.0190 | 0.0129 | fewer direct quotations |
| `semicolon_rate` | 36/61 | 0.0027 | 0.0020 | slightly under seed |
| `sentence_length_p50` | 33/61 | 19.0 | 14.52 | median sentence too short |

This is the main quantitative signature: the generator has learned the broad
register but regularizes the surface too much, then compensates with short
sentences and em-dashes.

## Repetition fingerprints

Phrase recurrence across 61 generated articles compared with seed:

| Pattern | Generated docs | Generated hits | Seed hits |
|---|---:|---:|---:|
| `this is not` | 48 | 94 | 0 |
| `not ... but ...` within 90 chars | 60 | 236 | 5 |
| `the question is` | 22 | 51 | 0 |
| `First/Second/Third,` | 25 | 79 | 8 |
| `I want to be careful` | 12 | 14 | 0 |
| `in the end` | 14 | 16 | 0 |
| `it is worth` | 11 | 15 | 0 |

Top repeated n-grams not present in the seed include:

- `i want to be careful` (12 docs)
- `the question is not whether` (8 docs)
- long Micah 6:8 fragments in 10-16 docs

The Micah repetition is partly domain/theology, but the "careful/question/not
but" family is a reusable rhetorical scaffold. It is voice-like in one article
and watermark-like across 61.

## Classifier separability

Using only deterministic stylometric metrics from `voice_pipeline.metrics`
(no lexical topic vectors), a logistic classifier can separate:

| Comparison | Samples | Accuracy | AUC | Caveat |
|---|---:|---:|---:|---|
| Pete seed paragraph chunks vs generated paragraph chunks | 8 vs 246 | 0.976 | 0.994 | seed is very small and unbalanced |
| Balanced repeated seed-vs-generated chunk samples | 8 vs 8 per run | 0.902 mean | 0.962 | high variance from tiny seed |
| Pete generated articles vs other generated LLM drafts | 61 vs 21 | 1.000 | 1.000 | topic/persona confounded |

The classifier result should not be overclaimed as an authorship proof. It does
mean the generated corpus has a stable statistical profile distinct from the
seed, even before using content words.

Top separating features for seed-vs-generated chunks were:

- higher generated `stopword_ratio`
- higher generated `dash_rate`
- lower generated `sentence_length_p10`
- zero generated `sentence_start_lowercase_ratio`
- lower generated `parenthesis_rate`
- lower generated `quote_rate`
- lower generated `rare_word_ratio`

## Worktree experiment check

The five other-agent experiments all produced reports and after-artifacts:

| Strategy | Worktree | Result |
|---|---|---|
| Diagnostic scalpel | `agent-a22487196a1d1acd3` | modest help, 2/3 judge wins |
| Multi-model bakeoff | `agent-a7acdcf109f784a93` | Kimi helped tech; not universal |
| Drift surgical | `agent-acfcd407f3a8a6b3d` | no accepted changes |
| Moves augmented | `agent-acd288d8ce72140b1` | hurt 3/3 |
| Rule polish | `agent-a6f8c455ab610f6d4` | hurt 3/3 |

Deterministic metrics on the three shared samples match the reports:

| Strategy | Drift | OOB metrics | Slop | Foibles |
|---|---:|---:|---:|---:|
| Diagnostic scalpel | 0.440 -> 0.405 | 12.3 -> 11.3 | 1.03 -> 1.02 | 0.646 -> 0.646 |
| Rule polish | 0.440 -> 0.405 | 12.3 -> 11.3 | 1.03 -> 1.03 | 0.646 -> 0.646 |
| Bakeoff | 0.440 -> 0.429 | 12.3 -> 12.0 | 1.03 -> 1.08 | 0.646 -> 0.629 |
| Moves augmented | 0.440 -> 0.452 | 12.3 -> 12.7 | 1.03 -> 1.02 | 0.646 -> 0.646 |
| Drift surgical | 0.440 -> 0.440 | 12.3 -> 12.3 | 1.03 -> 1.03 | 0.646 -> 0.646 |

The "moves" and rule-polish failures are important: forcing explicit voice
rules removes precisely the roughness that made the baseline better.

## Answer

The current articles have a unique generated voice, but not yet a uniquely
human author voice. They are Pete-ish, coherent, and much better than generic
LLM output. They also carry measurable generator habits:

1. Too many short sentence punchlines.
2. Too many em-dashes.
3. Too few parenthetical asides, direct quotes, lowercase starts, and list-like
   structures compared with seed.
4. Repeated rhetorical scaffolds across articles.
5. LLM same-author checks fail most outputs even when code metrics pass.

## Recommendation

Do not adopt the heavy voice-pipeline constraints. The best path is:

1. Keep `simple_writer` as the core.
2. Add diagnostic scalpel only as an opt-in, gated by before/after judge and
   minimal diff.
3. Add a non-fingerprint gate before publication:
   - max repeated 5/6-gram document frequency excluding Scripture quotes
   - max repeated scaffold phrase frequency across a batch
   - drift gate on the systematic OOB metrics above
   - same-author judge must pass on a blind mid-article excerpt, not only code
     stylometry
4. Increase Pete seed corpus materially. 5k words is not enough to separate
   genuine voice from prompt-induced caricature; target 20k-50k words.
5. Fix the slop remnant regex so ordinary "I cannot..." pastoral sentences do
   not hard-fail.

Adopt: diagnostic scalpel as optional.

Reject: moves-augmented rewrite, drift forcing, aggressive rule polish.

Open question: whether a larger seed corpus plus batch-level repetition gates
can reduce the synthetic profile without losing the strong current prose.