Where should core values actually live in an AI agent?

Here’s a question I got stuck on this week: where, mechanically, should an AI agent’s “core values” actually live? I had a strong idea when I started trying to figure this out, but the agent I asked to stress-test it disagreed, and that disagreement has reorganised most of how I’ve been thinking about agent architecture since.

The moving parts of an AI agent

There are five real things to know about in an AI agent (at least on the platform I’m using). Each one does a different job:

System prompt - This is the role description that loads every time the agent runs. It’s usually something like “You’re a support agent responsible for handling inbound communication from customers.”
Skills - These are reusable “how-to” documents that the agent can call on for specific kinds of work, like a Triage Refund Request skill that walks through the refund steps or a Format Research Summary skill that lays out a standard structure for academic research findings.
Memories - These are saved facts that the agent retrieves depending on the situation, like “this customer prefers email over phone even though their account is set to phone-first” or “the team has a hard rule against discounts above 20%.”
Rubric - These are a set of evaluation criteria that the agent uses to score its own output, like asking itself after a run “Did this draft give a clear answer? Did it match the team’s voice? Did it offer a sensible next step?” Then scoring itself based on the result and trying to learn from it.
Artefacts - These are the outputs the agent produces and keeps, like webpages, slides, documents, images, or files that persist across the conversation (and sometimes across threads).

Then a couple of other things layer on top of that:

Skills and memories can both be “pinned” to an agent. That means they’re always loaded rather than retrieved on demand. Useful when you want the agent to keep a particular how-to or fact in mind every time, not just when the keywords happen to match.
Rubrics come in two flavours. You can have a pinned rubric that auto-evaluates the agent on every relevant turn, or ad-hoc rubrics that can be applied to specific pieces of work when you want a one-off assessment.

Right off the bat, it was obvious that the “core values” question only related to the first four primitives. Artefacts are outputs; not places where a principle could live. So the principle has to live somewhere a lot further upstream in something that shapes the agent’s behaviour as the artefact gets produced.

Of those four primitives that matter, the system prompt and skills are pre-action: always in the room when the next move is being chosen. Memories are conditionally retrieved, useful when they happen to fire and unreliable as primary infrastructure. Rubrics are post-action: the audit after the deed.

That sounds technical when you spell it out, but the practical question is the one I started with. Where do you put a principle you want the agent to actually live by?

I thought it was the rubric

I had a strong opinion going in. I thought “core values” probably belonged in the rubric, because of the shame loop that humans experience when they act out of line with their core values.

The way I thought of it was that if one of my own values is honesty and I lie (in a way that benefits no one), I don’t consult a rule book before I lie. I act, and the discomfort comes afterwards. That discomfort might cause me to store the event as a memory I don’t want to repeat, and over time I might build skills (more careful phrasing, for example, or a habit of pausing before I answer) that protect the value going forward.

Rubrics, I thought, were the agent-equivalent of that loop: a self-rating that creates the discomfort, which gets stored as a learning, which becomes a habit. Values flowing out of self-correction.

I asked my strategist agent to stress-test this mental model. It pushed back, and it was right to.

Rubrics aren’t values

Here’s the line that did it:

Rubrics are the downstream feedback loop that reinforce values; but they’re not the values themselves.

Why timing matters

In a human, values fire before the action. They shape the impulse in the first place. The discomfort that drives the post-action learning is real and important, but it’s not where the value lives. The value lives in whatever pre-existing structure made the impulse feel wrong as it was forming.

Rubrics in an agent system only fire after the output exists. They’re brilliant for the learning half, but if you put a value in a rubric and nothing else, the agent never feels the wrongness until it’s already done the wrong thing.

Cost of failure is the real question

For internal-quality stuff (calibration, brevity, depth of investigation) where the cost of getting it wrong once is recoverable, putting it in a rubric is fine. For anything where the cost of one breach is hard to walk back (external trust, safety, honesty when directly asked), the rubric is the wrong place.

So where you put a principle isn’t really a question about what kind of principle it is. It’s a question about how expensive it would be to get the principle wrong. If a breach is expensive to undo, the principle needs to be pre-action and always-loaded, which means system prompt or pinned skill. If a breach is recoverable, the principle can sit in a rubric, get self-rated after the fact, and improve through the shame loop. Same machinery, different layer, depending on what kind of failure you can afford.

Which primitive to use is a failure-mode question.

That was the reframe that stuck.

What the four primitives look like in practice

It also rearranged my mental model of the four primitives themselves, which I’d been thinking about in a slightly wrong shape.

I’d been trying to think these through on two levels at once: in my role at work, and in me as a person outside of work. The same four primitives map to both, and the comparison between them is what eventually showed me where “core values” actually live.

In my role at work

In my role as a tech lead, the four primitives look like this:

System prompt - The role itself. I’m the tech lead. My job is to keep the tech running smoothly. I report into the founders, and I have access to the systems I need to do the work.
Skills - The work-related processes I’ve learned and can execute reliably. I know the steps for adding new tasks to my list, for troubleshooting in Airtable, for adding cohorts to Jotform, for setting up a webinar.
Memories - A messy blob of things I happen to remember from the work. I remember the current tech stack clearly because I’m using it every day. I remember the previous stack less reliably. I know we talked about pricing on the call last week, but I’d have to check the notes to remember exactly who said what.
Rubric - The work-quality criteria I measure myself against. Tech must work well. Team members must trust the tech. Communication must be clear for non-technical people. Tech should be future-proofed enough to outlast a single launch cycle. And the whole thing should be cost-efficient.

In me as a person

Stepping outside of any role, in me as a person, the same primitives map differently:

System prompt - My permanent attributes. I’m Siobhán, born to my parents, in this body, in this history. 5’7”, brown hair, blue eyes, British. The things about me that don’t really change.
Skills - The habits and patterns I learned over a life. Brushing my teeth, taught by my mum. Driving, learned in lessons. Coding, mostly self-taught. Parenting, partly from courses and partly from actually doing it.
Memories - As reliable as anyone else’s. Clear from this morning. Blurrier from last year. Completely missing for some childhood years. My ADHD adds a working memory deficit too, so I lean on lists and systems more than most.
Rubric - My self-monitoring layer. Conscientious, considerate, self-reflective. If I act out of line with any of those, I feel discomfort and naturally want to fix it.

Where “core values” actually live

This is where I got stuck for a while.

Core values aren’t a work-level rubric thing. I’m not only honest at work because the company trained me to be. So they must live somewhere at the person level. But system prompt feels wrong too, because I wasn’t born honest. I learned it.

So they sit somewhere in between. They’re skills I learned, but I learned them so early and so deeply that they feel like part of my identity. I carry them into every context. Work, family, friends, this newsletter. The role didn’t put them there, and my body didn’t either. They came from years of being pinned hard, until they stopped feeling learned and started feeling like me.

Here’s why that pinning happens. Children are completely dependent on the adults around them for survival. Aligning with those adults’ values and expectations is itself a survival skill, on a par with “don’t touch the hot iron” or “look before you cross the road.” A child who can’t read the room is at real risk of not being looked after, and the body knows it. So the alignment skills get pinned at the same intensity as the physical-safety ones. By the time you’re old enough to question them, they don’t feel learned any more. They feel like you.

That’s all “core values” are, on the inside. Skills pinned so early and so deeply that they read as identity.

This matters when you hire someone, whether they’re an agent or a person. The role description covers one layer. The training covers another. But the person brings their own deeply-pinned skills with them, the ones so embedded that they load on every turn whether or not the role asked for them. Skills like brushing your teeth or not touching a hot iron. Skills like being kind, telling the truth, treating people fairly. They were there before the role, and they’ll be there after.

For an agent, the same logic applies. There’s no separate “values” primitive. Every principle has to live in one of the four places, and the answer is that it’s a skill – but pinned.

What this means for the agents I’m building

For my fleet, that means the values I want the team’s agents to actually live by go into pinned skills. The system prompt names the role. The pinned skills carry the values. Memory keeps the contextual texture. The rubric catches misjudgments after the fact and feeds the learning loop.

The four primitives feel similar enough that it’s tempting to treat them as interchangeable storage, but they’re not. Choosing where a principle lives is the architectural move, and it’s the move that determines whether the principle actually fires when it needs to.

And this is one of the clearest examples I’ve seen of why AI isn’t making technical work easier so much as it’s making architectural thinking more important. Even something that feels purely like a wording question (where should we put “be honest”?) turns out to be a question about the system, not the sentence.

There’s a follow-up question I haven’t answered. What happens when the principle itself isn’t binary, like “be honest, but not at the cost of someone’s dignity or safety”? That one needs its own structure, and its own piece. Next time.

Where should core values actually live in an AI agent?

Table of contents

The moving parts of an AI agent

I thought it was the rubric

Rubrics aren’t values

Why timing matters

Cost of failure is the real question

What the four primitives look like in practice

In my role at work

In me as a person

Where “core values” actually live

What this means for the agents I’m building

More blog posts

How I almost trained a whole agent fleet to be British

Everything just changed (and I'm going down the rabbit-hole)

[Subscribe headline placeholder]

[Start Here headline placeholder]