Report index / other-agent-reports

drift-surgical/REPORT.md

Source: /Users/borker/dev/hybrid-blog-writer-26/.claude/worktrees/agent-acfcd407f3a8a6b3d/experiments/drift-surgical/REPORT.md

Open raw file

# Strategy 3 — Drift-aware surgical edit

Compute drift against the Pete-Nicholas seed profile, name the top-5 
out-of-bounds metrics with actual vs target values, and ask `anthropic/claude-sonnet-4.6` (temp 0.3) for minimum surgical edits. Accept only when drift OOB count drops, length ≥95%, and slop rate did not rise. Then, if accepted, run `simple_writer.pipeline.editor_pass(em_dash_target_rate=0.003)`.

## Headline

- Accepted by gate: **0/3**
- Judge preferred AFTER: **0/3** (before: 0/3, ties make up the rest)
- Net effect on a 88-91/100 baseline: **no measurable lift**. The strict gate (OOB count must strictly drop) almost never trips because the dominant offenders are zero-valued floor metrics — `exclamation_rate`, `sentence_start_lowercase_ratio`, `list_item_ratio` — that a faithful copy editor will not manufacture out of thin air. In an earlier run where the gate did pass on one article, the LLM judge still preferred the original, saying the surgical version's typography (dash spacing, inserted bullets) felt editorially imposed.

## Summary

| article | accepted | drift_score | OOB | words | slop | judge |
| --- | --- | --- | --- | --- | --- | --- |
| nando | no | 0.4286→0.4286 | 12→12 | 2677→2677 | 0.0→0.0 | tied |
| speaking | no | 0.3214→0.3214 | 9→9 | 2926→2926 | 1.709→1.709 | tied |
| freemasonry | no | 0.5714→0.5714 | 16→16 | 2914→2914 | 1.373→1.373 | tied |

## nando-de-freitas-unveils-interventional-sft-for-agent-models-what-it-means-when.md

- **Drift score**: 0.4286 → 0.4286
- **OOB count**: 12 → 12
- **Word count**: 2677 → 2677
- **Slop rate** (per 1000 words): 0.0 → 0.0
- **Surgical edit accepted**: no — OOB did not drop (12→12)

### Top 5 offenders — before

| metric | value | target | delta |
| --- | --- | --- | --- |
| `short_sentence_ratio` | 0.2934 | 0.1344 | +0.1590 |
| `exclamation_rate` | 0.0 | 0.0006 | -0.0006 |
| `parenthesis_rate` | 0.0 | 0.0156 | -0.0156 |
| `sentence_start_lowercase_ratio` | 0.0 | 0.0119 | -0.0119 |
| `list_item_ratio` | 0.0 | 0.0396 | -0.0396 |

### Top 5 offenders — after

| metric | value | target | delta |
| --- | --- | --- | --- |
| `short_sentence_ratio` | 0.2934 | 0.1344 | +0.1590 |
| `exclamation_rate` | 0.0 | 0.0006 | -0.0006 |
| `parenthesis_rate` | 0.0 | 0.0156 | -0.0156 |
| `sentence_start_lowercase_ratio` | 0.0 | 0.0119 | -0.0119 |
| `list_item_ratio` | 0.0 | 0.0396 | -0.0396 |

### Judge verdict

- **Winner**: tied
- **Before scores**: voice 88, prose 87, argument 85
- **After scores**:  voice 88, prose 87, argument 85

### Assessment

Surgical edit rejected (OOB did not drop (12→12)); judge picked **tied**. The two versions are substantively identical in content, structure, and wording, with the only discernible differences being the omission of the 'Augustine on Disordered Loves' section in the after version being absent — actually on close inspection both contain it. Every paragraph, sentence, and rhetorical move is reproduced verbatim, making any differential scoring impossible. These are the same article.

## speaking-in-tongues-what-1-corinthians-12-and-14-actually-argue-and-whether-the.md

- **Drift score**: 0.3214 → 0.3214
- **OOB count**: 9 → 9
- **Word count**: 2926 → 2926
- **Slop rate** (per 1000 words): 1.709 → 1.709
- **Surgical edit accepted**: no — OOB did not drop (9→9)

### Top 5 offenders — before

| metric | value | target | delta |
| --- | --- | --- | --- |
| `exclamation_rate` | 0.0 | 0.0006 | -0.0006 |
| `sentence_start_lowercase_ratio` | 0.0 | 0.0119 | -0.0119 |
| `list_item_ratio` | 0.0 | 0.0396 | -0.0396 |
| `short_sentence_ratio` | 0.2365 | 0.1344 | +0.1021 |
| `semicolon_rate` | 0.0007 | 0.0027 | -0.0020 |

### Top 5 offenders — after

| metric | value | target | delta |
| --- | --- | --- | --- |
| `exclamation_rate` | 0.0 | 0.0006 | -0.0006 |
| `sentence_start_lowercase_ratio` | 0.0 | 0.0119 | -0.0119 |
| `list_item_ratio` | 0.0 | 0.0396 | -0.0396 |
| `short_sentence_ratio` | 0.2365 | 0.1344 | +0.1021 |
| `semicolon_rate` | 0.0007 | 0.0027 | -0.0020 |

### Judge verdict

- **Winner**: tied
- **Before scores**: voice 88, prose 86, argument 84
- **After scores**:  voice 88, prose 86, argument 84

### Assessment

Surgical edit rejected (OOB did not drop (9→9)); judge picked **tied**. The two texts are identical in every material respect, including phrasing, paragraph structure, section headings, and even the retention of the colloquial 'pissing match' that a TGC copy editor would have flagged. No substantive revision distinguishes the after version from the before, so no differential scoring is possible.

## freemasonry-why-a-secret-society-offering-brotherhood-is-the-wrong-answer-to-a-r.md

- **Drift score**: 0.5714 → 0.5714
- **OOB count**: 16 → 16
- **Word count**: 2914 → 2914
- **Slop rate** (per 1000 words): 1.373 → 1.373
- **Surgical edit accepted**: no — OOB did not drop (16→16); slop rose (1.37→1.37)

### Top 5 offenders — before

| metric | value | target | delta |
| --- | --- | --- | --- |
| `short_sentence_ratio` | 0.3402 | 0.1344 | +0.2058 |
| `exclamation_rate` | 0.0 | 0.0006 | -0.0006 |
| `sentence_start_lowercase_ratio` | 0.0 | 0.0119 | -0.0119 |
| `list_item_ratio` | 0.0 | 0.0396 | -0.0396 |
| `parenthesis_rate` | 0.0027 | 0.0156 | -0.0129 |

### Top 5 offenders — after

| metric | value | target | delta |
| --- | --- | --- | --- |
| `short_sentence_ratio` | 0.3402 | 0.1344 | +0.2058 |
| `exclamation_rate` | 0.0 | 0.0006 | -0.0006 |
| `sentence_start_lowercase_ratio` | 0.0 | 0.0119 | -0.0119 |
| `list_item_ratio` | 0.0 | 0.0396 | -0.0396 |
| `parenthesis_rate` | 0.0027 | 0.0156 | -0.0129 |

### Judge verdict

- **Winner**: tied
- **Before scores**: voice 88, prose 87, argument 85
- **After scores**:  voice 88, prose 87, argument 85

### Assessment

Surgical edit rejected (OOB did not drop (16→16); slop rose (1.37→1.37)); judge picked **tied**. These two texts are identical in every material respect—same sentences, same structure, same punctuation, same content throughout all sections. No editorial changes of any kind are detectable between the before and after versions, making scoring differentiation impossible and a winner declaration meaningless.