Taste as Infrastructure

You can’t encode taste. The moment you write the rule, taste moves somewhere the rule can’t reach. So stop trying to build a system that has taste, and build the orchestration that makes a single pers

Jun 30, 2026

Taste as Infrastructure: Building Judgement into AI Systems

The piece had passed everything. Brand voice, on message, claims accurate, tone right, structure clean. Every check we had built came back green. Then someone on my team read it and said the thing no rubric can say: this is worse than what we used to ship before we had any of this. He was right, and that was the problem. We had taught a system to recognize good work, and the first thing it reliably produced was competent, on-spec, forgettable work that cleared the bar we had written and sat well below the bar we used to hold in our heads. The dashboard said we had succeeded. We had built a machine for scaling the absence of taste, and it was working.

I build content quality infrastructure at Typeface, so I should be transparent that this is the problem space I work in. It is also where I learned that almost everyone, including me at the start, walks into this with the wrong goal and a green dashboard to prove it.

The goal almost everyone gets wrong

The instinct, once production is free and judgment is the bottleneck, is to teach the machine taste. Build the model or the system that can tell good from bad, and the scarcity problem solves itself. That goal is impossible, and chasing it is how you end up where we were, shipping fluent mediocrity at volume with the meters all in the green.

It’s impossible for a reason that took me too long to see. You can’t encode taste, because the act of encoding it is what moves taste out of reach. Paul Bakaus says taste can’t be lab-grown because the target keeps moving, and the usual reading of that is gentle: fashion drifts, purple gradients in 2022, beige serif in 2026, so the rules go stale. That is true and it is not the sharp part. The target does not drift on its own. We move it, and we move it precisely by pinning it down. Encode “strong visual hierarchy” as a check and within a quarter every piece has strong visual hierarchy and none of it is interesting, because the instant a quality becomes a rule everyone can clear, clearing it stops being a mark of taste and becomes the price of entry. Taste is, by definition, the part of judgment nobody has written down yet. Write a piece of it down and that piece stops being taste. Sarah Guo’s line, a thing you can measure is a thing you can train against, is the same law seen from the model’s side: the measurable surface commoditizes, and taste is whatever still sits above it.

I know this because I built the cage first. The earliest version of our system did exactly what the instinct says to do. It read a brand’s unstructured material, the decks and guidelines and back catalog, and codified all of it into a structured set of rules. Then, to make something new, it pulled the applicable subset of those rules and generated from them. It was elegant on paper. In practice everything it produced was on-brand and flat. “Meh” was the word my team kept reaching for. We’d turned the brand into a box and asked for creativity inside it, and the box was the problem.

The single biggest jump in quality we ever got came from stopping that. Now the system writes something interesting first, with room to be wrong, and only then runs the brand rules over the draft, catching what broke and fixing those spots surgically while leaving the rest alone. Same rules, opposite end of the pipeline. Generated from, they produce compliance. Applied afterward, they produce work that’s compliant and alive. That one inversion is the whole thesis in a single design decision: the encoded standard is a floor you clear at the end. It’s the last thing the work touches, and it should never be the seed it grows from.

What the system is actually for

So if the system can’t hold taste and shouldn’t generate from it, what is it for? It takes the judgment of the few people who do have taste and makes it reach a thousand decisions those people will never personally see. It’s leverage for human judgment. In practice that turns out to be a loop, and most of what gets said about loops right now misses what makes one work.

We got the loop wrong before we got it right. We tried to perfect the first draft. Most first drafts, human or machine, land around 65 to 70% of the way there, and the energy spent forcing that first pass to 100% is mostly wasted, because the person who knows what good looks like has not entered yet. The leverage is to bring them in early and make it cheap. Let someone explore a few angles before anything is committed, see a draft, change course, try another, with every one of those moves nearly frictionless. Judgment enters as steering, and that is how one person’s taste ends up shaping work they would never have had the hours to write themselves.

Then you give the loop a target, which is where a rubric earns its keep. Its job is to define where the iteration is headed and when it is done, a different job from being the thing you generate from. There is a lot of noise about loops at the moment, most of it picturing an agent grinding away on its own until some private sense of finished kicks in. The loops that work look different: a person steering early, a rubric naming the goal, and each pass ending by clearing the floor. Explore, draft, steer, measure against the rubric, clear the floor, again.

There is no one workflow, so build for that

Here is the part that took me longest to accept. There is no single version of that loop that is right for everyone. Who gets involved and when, what you automate and what you leave to a person, what you measure, what the interim artifacts even look like, all of it has to flex to how a particular company works. The path that fits a ten-person brand team is wrong for a regulated enterprise with a legal gate in the middle, and pretending one pipeline serves both is how you end up with software nobody adopts.

So the real thing you’re building is an orchestration layer rather than a fixed assembly line. It has to let you compose a workflow with intention: this step runs automatically, here the work pauses for a human’s call, this is the interim artifact a stakeholder signs off on, the context graph feeds in what the company knows at this point, and the floor gets cleared at the end, all inside one path from blank page to shipped. Orchestration is where this whole argument lives, because it’s the only layer flexible enough to carry a different shape of judgment for every company that needs one. It’s also why you can’t buy taste infrastructure off a shelf as a finished product. The judgment it scales is yours, and so is the workflow it has to live inside.

Two of those pieces, the context graph the system reasons over, the brain that holds what a company knows, and the evals that tell you whether an output cleared the bar, are load-bearing enough to deserve their own essay, and they already have one. I wrote about both as the foundation of a content operating system in From Content Generation to Content Operating System, so I won’t relitigate them here. The point that matters for taste is narrower: the knowledge chest is what keeps the work grounded in something real instead of fluent and hollow, and the evals are the rubric the loop steers toward. Neither one holds taste. They make the place where taste enters legible enough to build around.

One more design choice keeps that orchestrated loop from quietly freezing. If a person only sees the work when it trips a check, the rubric can never get smarter than it already is, so it hardens into last year’s bar. We inverted it. The senior reviewer now spends most of their time on the pieces that passed, not the ones that failed. You sample above the line, not at it, and every time someone reads a clean piece and says this could be sharper, here is how, that judgment becomes the next constraint and the floor rises under everyone.

The goal, the system, and the workflow of taste as infrastructure

The number worth watching

All of which gives you a metric, and it is not the one the industry is racing toward. Everyone is driving AI cost per output to zero. The number that tells you whether you built taste infrastructure or a fast mediocrity engine is something else, what I’ve started calling judgment density: how many human judgment calls you spend per thousand outputs.

It fails in both directions, which is what makes it useful. Drive judgment density too high and you haven’t built infrastructure, you’ve hired a review queue and called it a system. Drive it to zero and you’ve automated your standard away, and nothing will warn you, because every piece is dutifully passing the checks you wrote a year ago while the work gets safer and emptier each quarter. The target is the smallest amount of scarce human judgment that still keeps the bar rising, and that ratio tells you more about whether your taste survived contact with scale than cost per token ever will.

I’ll be honest that I haven’t fully defined this yet. I know the failure modes at both ends, but not the cleanest way to count it, so I’d rather call it a working instrument than pretend the metric is finished. If you’re already tracking something like it inside your own team, I’d like to know how you define it.

There is a market signal under all of this worth naming once. When Meta paid roughly $900 million for Cred in a deal Gokul Rajaram read as the first acquisition aimed at product taste rather than model-building, that was the price tag on the scarce resource. But a price tag on judgment tells you to go acquire judgment. It doesn’t tell you how to make the judgment you already have reach further than the person who holds it. That second problem is the one no check on a dashboard solves, and it’s the one worth building for.

The question for anyone building this

We spent a decade treating taste as a finishing layer, the thing a few gifted people applied at the end, because production was the bottleneck and judgment was the cheap part. That arrangement is now inverted, and the reflex it leaves behind is to ask whether your AI can tell good from bad. It cannot, and every quarter spent building toward that question is a quarter spent manufacturing competent work nobody remembers, under meters that say you are winning.

The question that matters is the other one. Not whether your system has taste, but whether it spends the little judgment you have exactly where the outcome turns, inside a workflow shaped to how your company really works, and whether the bar it holds is higher this quarter than last. If it is, you built infrastructure, and one person’s standard is now reaching work they will never see. If the dashboard is green and the work is quietly getting safer, you built the other thing, the machine for scaling the absence of taste. It will not tell you. It will just keep shipping, on spec, on brand, and forgettable, until someone on your team reads a piece that passed everything and says the one thing the system was never built to hear.

This is the second essay in The Human Premium, a series on what stays valuable when AI handles everything else. The first, Human Breakthroughs Don’t Show Up on Dashboards, argued that no dashboard can see whether the work got better. This one is about what you build once you accept that, and why most of what gets built is the wrong thing.

Building Through the Shift

Discussion about this post

Ready for more?