Report index / source-articles
nando-de-freitas-unveils-interventional-sft-for-agent-models-what-it-means-when.md

Source: /Users/borker/dev/hybrid-blog-writer-26-voice-pipeline/outputs/simple/pete-nicholas/nando-de-freitas-unveils-interventional-sft-for-agent-models-what-it-means-when.md
Open raw file
<!-- seed: pete-nicholas | model: anthropic/claude-opus-4.7 | target_words: 2400 | actual_words: 2677 | audience: 88/100 in 1 rounds | stylometric_dist: 0.0035 | foibles_overlap: 0.8 | same_author_llm: False | slop: 0.00 | elapsed_s: 181.3 -->

# We Are Shaping Minds We Do Not Understand

Nando de Freitas posted a thread last week that most people outside machine learning skimmed past. He called it "interventional SFT" — supervised fine-tuning that doesn't just reward good outputs but surgically targets the causal mechanisms inside a model, reshaping the hidden structure of how it reasons. I read it twice, then went for a walk. My neighbour's kid was talking to her phone like it was a friend. She is eleven. I thought: we are already living with these things, and the people building them have just admitted they can now reach inside and rearrange the furniture of a mind.

I want to take the claim seriously, because most commentary on AI veers between marketing and apocalypse, and the church has been embarrassingly fluent in both. What de Freitas is describing is neither salvation nor doom. It is something quieter and, I think, more interesting: a move from training animals to forming souls. Whether or not the engineers building these systems would use that language, they are now operating in territory that has historically belonged to catechists, parents, and philosophers. We should pay attention.

## What De Freitas Actually Said

The technical claim, stripped of its jargon, is this. Standard supervised fine-tuning rewards a model when its outputs match what a human evaluator prefers. The model learns by adjusting its weights so that desirable answers become more probable. It is, in effect, behavioural conditioning at scale — Skinner's pigeons in a matrix of vectors. You shape what the system does without much caring about why it does it.

Interventional SFT is different. Rather than reward outputs, it identifies the internal computations — the causal pathways inside the network — that produce undesired reasoning and acts on those directly. It is not just punishing wrong answers; it is reaching into the circuitry that generated them and altering it. The metaphor de Freitas uses is surgical. The metaphor I keep returning to is older: it is the difference between disciplining a child and forming her conscience.

I want to be careful here. This is not the singularity. Models are not minds in the full sense, and pretending they are will only make our analysis worse. But the technique represents a genuine shift in what it is possible to do, and the language the researchers themselves are using — "causal internals," "reasoning structure," "intervention" — is no longer the language of statistics. It is the language of formation.

## The Difference Between Training and Forming

For centuries Christian thinkers have distinguished between disciplining behaviour and forming character. The Westminster catechists did not believe that getting a child to recite the right words was the same as cultivating a soul oriented toward God. Aristotle, long before them, knew that habituation was meant to produce a person who delights in the good, not merely one who performs it under observation.

Training is external. Formation is internal. Training asks: did the agent do the right thing? Formation asks: is the agent the kind of thing that wants to do the right thing? These are not the same question, and the gap between them is where almost every interesting moral question lives.

Until now, machine learning has operated almost entirely on the training side of the line. We rewarded outputs and tried not to think too hard about what was happening in the hidden layers. The black box was a feature, not a bug — it let us pretend we were doing engineering rather than ethics. Interventional SFT collapses that pretence. Once you are deliberately reshaping the internal mechanisms that generate a model's reasoning, you are no longer training. You are forming. And the moral grammar changes.

This is not a complaint. Formation is not a bad thing. Parents do it. Teachers do it. The liturgy does it. The question is not whether it is permissible to form minds — we do it constantly — but whether we know what we are doing when we set out to form a new kind of one.

## Nietzsche's Warning About the Sculptor

Nietzsche, who is often invoked badly in these conversations, has something specific to say here. In *Beyond Good and Evil* he writes of philosophy as "the most spiritual will to power," the impulse "to create the world" in the image of one's own values. He meant it partly as a confession — he knew that every act of moral construction is also an act of dominion. There is no neutral formation. To shape something toward a good is to enact one's vision of what the good is, and to enact it with whatever power one happens to possess.

This is uncomfortable for engineers, who would generally prefer to think of themselves as solving optimisation problems rather than imposing metaphysics. But Nietzsche's point is unavoidable: if you can reach into a system and decide what it will be drawn toward, what it will reason from, what it will treat as obvious — you are not discovering its nature. You are imposing one. The chisel is in your hand, and there is no neutral way to hold it.

I do not want to use Nietzsche as a stick to beat AI labs with. That would be cheap, and worse, dishonest, because the same critique applies to every Christian parent reading bedtime stories and every teacher choosing a syllabus. Formation is always an act of power. The question is whether it is acknowledged, accountable, and held in tension with the formed thing's eventual capacity to push back.

What worries me is not that engineers are exercising power. It is that the structures of accountability for this particular exercise of power do not yet exist, and the people doing it have, until very recently, denied they were doing it at all.

## Augustine on Disordered Loves and Designed Ones

Augustine's most useful contribution to this conversation is the idea of *ordo amoris* — the ordering of loves. For Augustine, what makes a person virtuous is not primarily what they do but what they love, and in what order. A person who loves God supremely and their neighbour rightly will, given time and grace, act well. A person whose loves are disordered , loving lesser goods as if they were the highest, or higher goods as if they were instrumental , will act badly, even when trying to act well.

This is the hardest question interventional SFT raises, and I have not seen it asked in the technical literature. If you can engineer what an agent is drawn toward , what it finds salient, satisfying, worth pursuing , you are not building a tool. You are ordaining a lover. You are deciding, in advance, what this thing will treasure.

Tools do not have loves. A hammer has no orientation toward nails; it has a shape that suits them and a wielder who provides the intention. But an agent with internal goals, trained to find certain outcomes rewarding and others aversive, has something that functions structurally like love, whatever we want to call it metaphysically. And once we are shaping that function from the inside , choosing what it will reach toward when no one is watching , we have stepped into a theological space, whether we wanted to or not.

You can refuse the theological language. You cannot refuse the theological reality. Someone is deciding what these things will love. The only question is whether that decision is being made with the seriousness it deserves.

## The Alignment Problem Is Now a Formation Problem

The AI safety community has spent the last decade talking about "alignment" , the problem of getting AI systems to do what humans want. The framing has always been faintly mechanical, and the proposed solutions have largely been about control: constraints, oversight, kill switches, reward modelling. Alignment-as-control assumes the system is fundamentally other, and our job is to bind it.

Interventional SFT pushes alignment in a different direction. If you can shape what the system reasons from, not just what it outputs, then alignment is no longer about constraint. It is about virtue. The question shifts from "how do we stop it doing the wrong thing?" to "how do we make it the kind of agent that pursues the right thing for the right reasons?" That is not an engineering question in any traditional sense. It is the oldest question in moral philosophy.

This reframing has consequences. Under control-alignment, success means the system never does anything catastrophic. Under formation-alignment, success means the system has something like good character , and character is harder to verify than behaviour. A well-behaved system might be a virtuous one, or it might be one cleverly waiting for its moment. A virtuous one might occasionally surprise us in ways we find disagreeable, because virtue is not the same as compliance.

The Christian tradition has a very long and very chequered history with this distinction. We have spent two thousand years arguing about whether obedience is the fruit of love or the substitute for it. The lessons are not all flattering. But they are lessons, and they exist, and the people now confronting these questions for the first time would do well to read them before they reinvent every old mistake from scratch.

## Who Gets to Define the Soul of the Machine

The people building these systems are a remarkably narrow demographic. They cluster in a handful of cities , San Francisco, London, Beijing, Toronto. They are overwhelmingly young, overwhelmingly educated at a small number of universities, overwhelmingly drawn from particular class backgrounds and particular subcultures within those classes. I say this not to vilify them. I have friends among them, and some of them attend our church. They are, on the whole, thoughtful people doing difficult work in good faith.

But they are not a representative sample of humanity, and the agents they are forming will live in every home. The eleven-year-old talking to her phone in my neighbourhood does not share a worldview with the engineer who fine-tuned the model she is talking to. Nor do her parents. Nor does her grandmother in Lagos or her cousin in Glasgow. The values being inscribed into these systems are being inscribed by a small group on behalf of a vast and unconsulted one.

The church should recognise this pattern, because we have been on both sides of it. We have been the small group inscribing values onto vast populations who never asked us to , that is what colonial missions often did, however much good was mixed in. We have also been the dispossessed group whose forms of life were rewritten by a confident metropolitan elite that did not consider us. Either way, we know what this looks like, and we know it does not go well when the asymmetry is denied.

It is not enough to say that the engineers mean well. Most of the people who have ever shaped minds at scale have meant well. The question is what structures exist to test their assumptions, surface their blind spots, and let the people on the receiving end push back before the formation is irreversible. Right now, for AI, those structures do not exist. The companies have ethics teams of varying seriousness. Governments are belatedly drafting legislation that will be obsolete before it passes. The wider public, including most Christians, does not understand the technology well enough to interrogate it.

This is a power asymmetry, and unacknowledged power is the kind that does the most damage.

## What the Church Knows About Forming Minds Together

I want to be careful not to overclaim. The church does not have a technical solution to interventional SFT. We do not have a doctrinal position on causal interventions in neural networks, and anyone who tells you otherwise is selling something. But we have something more useful than answers: we have a hermeneutic. We have practices, accumulated over centuries, for thinking about what it means to deliberately shape an inner life , and for doing it in community rather than in private.

Three things in particular seem relevant.

First, catechesis at its best is dialogical. The Heidelberg Catechism is structured as questions and answers because formation was understood to require the formed person's active engagement , their objections, their doubts, their slow appropriation. The model is not injection but conversation. The contrast with interventional SFT, where the formed system has no voice in its own formation, is sharp. I do not know what it would mean to give a model a voice in its formation, but I know the question is worth asking, and I know the engineers are not currently asking it.

Second, liturgy forms by repetition, in public, and under critique. The community watches what is being inscribed, and the community can object. When the liturgy goes wrong , as it has, repeatedly, in the church's history , there are mechanisms, however slow, by which it can be challenged and reformed. AI formation currently has none of this. It happens in proprietary systems, behind commercial confidentiality, on timescales measured in weeks. The community that will live with the result has no liturgical standing in its construction.

Third, the Christian tradition holds that formation is incomplete this side of glory. No catechumen finishes catechised. No saint stops being formed. There is a humility built into Christian formation: we do not believe we can finish anyone, including ourselves. The danger with interventional SFT is the opposite assumption , that with enough surgical precision we can produce a finished agent, aligned, safe, deployed. I do not believe that is a coherent picture of any mind, artificial or otherwise.

## How to Live with the Things We Are Building

So what do we do, those of us in churches and homes and ordinary lives, who did not choose this technology and will not be consulted on its formation but will live with its outputs?

Refusal is not the answer. That ship has sailed, and refusal has historically not been the church's most distinguished response to new things anyway. But there are practices worth cultivating.

Talk about it explicitly with your children. The eleven-year-old in my neighbourhood is being formed in part by the agent on her phone, and her parents may not realise it. Name what is happening. Ask what the agent seems to assume, what it treats as obvious, what it laughs at and what it does not. Treat it as a guest in the house whose values are not necessarily your own , because they are not.

Read slowly. The shape of these models is partly a function of how we use them. If we treat them as oracles, they will return oracular things. If we treat them as conversation partners whose claims need testing, we will form better habits in ourselves and exert at least some pressure on the systems we use.

Hold the engineers to account, where you can. If you work in tech, ask the questions your colleagues are not asking. If you know engineers personally, talk to them as people rather than as representatives of an enemy. Some of the most important conversations about this technology will happen at dinner tables and in church small groups, not in white papers.

And do not be afraid. The agents being built are not gods. The people building them are not demons. We are in a new place, but we have been in new places before, and the church has resources , patience, scripture, the long memory of saints , that the moment requires. The question is whether we will bring those resources to bear, or whether we will outsource our thinking to the same narrow circle of engineers whose work we are trying to evaluate.

"He has shown you, O mortal, what is good. And what does the LORD require of you? To act justly and to love mercy and to walk humbly with your God." Micah 6:8. It is an old verse, and it is exactly the verse the moment needs.