LIVE EXPERIMENT - PART 2 of 4

Four ways to build a synthetic user (I tried all of them)

Building our synthetic user in public, Part 2. Part 1 mapped the ways to build one. This time I built each one and put it through its paces.

Tania Clarke
Tania
PMM · Great Question
June 2026~12 min read

Four parts · One live experiment

We're building our synthetic user in public,
start to finish.

01

The map

The vocabulary, the ways to build a synthetic version of your customer, and the priors going in.

YOU ARE HERE
02

Four ways to build one

I built each of the four workflows, and worked out what each is good for and where each falls apart.

03

Robot vs human

An experiment putting synthetic users and real humans head to head. Where do they diverge?

04

The recap & the skill

Concludes the live experiment. The synthetic user skill ships, and everyone on the list gets it.

Data scientists have spent years trying to predict what a user will do next from what they've done before. With the availability of AI, it's hard not to feel optimistic about how we could blend past user behaviour to predict user behaviour and feedback, at a much greater level. I like to think of synthetic users as data science on steroids.

What data science & synthetic users have in common

When it comes to building a synthetic user, we're essentially blending behavioral prediction (product usage), and giving it much better raw material (interview transcripts).

The old way mostly had the numbers to work with, what people clicked and how often. A grounded synthetic user adds the why behind the numbers: the actual words customers used in interviews, and the context that only qualitative data can collect.

Caitlin Sullivan framed it for me that way in Part 1, and it reset how I thought about the whole project. It's the same goal the data teams have always chased, just with far richer input.

The ground rules (they apply to all four): A synthetic user is only as honest as the evidence under it, so the same non-negotiables went into every version before I picked a workflow:

Evidence-backed claims only. No source, no claim.

Cite every claim inline, with the quote attached.

Flag the gaps. When the evidence is thin, the synthetic user says so instead of inventing a pattern.

Qualitative confidence threshold. When I run a qualitative study, I always want to know how many people actually said something before I trust it as a pattern. The skill does the same thing. It tells me how many interviews are behind every claim it makes, and it won't call something a pattern unless enough of them back it up. I set the bar at 8: hit 8 interviews and it states the point plainly, anything below that and it categories the theme as medium or low confidence.

I've built all of these requirements into a synthetic user skill, that I'll be testing for the duration of this live experiment. We'll make it available to you when part 4 lands, and concludes our live experiment.

Why I'm building the synthetic user from a research repository, not a pile of transcripts

The first question I asked Jack, an AI Product Manager from the Great Question team, is why can't I just query a whole bunch of transcripts from github or Google Drive?

The DIY version is to drop a folder of transcripts into Claude and start asking questions. It works for about three transcripts. Past that you hit "lost in the middle," where the model skims the middle of a long document and quietly fills the gaps with things that sound right. You won't catch it, because the invented parts read exactly like the real ones. Jack, who built our repository retrieval system, said this:

"If you don't build a RAG pipeline that knows what it's doing? It's going to be hallucinating left and right. And you won't know."

Jack · AI Product Manager, Great Question

A repository earns its place by doing the unglamorous work that keeps that from happening:

Hybrid search. Keyword and semantic together. Pure semantic search feels clever but loses the exact-string matches that let you anchor a claim to the precise sentence a customer said. You want both running.

Server-side filtering. Rather than shipping a 90-minute transcript to the model and hoping, the repo narrows to the relevant chunks first, so the model only ever reasons over material it can actually hold in context.

Structured metadata. Studies, segments, dates, participants. You can scope a query to "B2B researchers, last 18 months" instead of praying the right transcripts surface on their own.

A curated layer. Insights and highlights you've already validated sit on top of the raw transcripts, so the synthetic user draws on evidence that's been checked, not just whatever the search happened to return.

Citations that resolve. Every claim links back to the session it came from, which is the whole difference between a synthetic user you can audit and one you have to trust blind.

The DIY route can get there, but only by building your own version of all this. Anything you'd actually rely on, and especially anything high-stakes, means building your own RAG: server-side filtering, citation plumbing, metadata, the lot. That's a real engineering project before you've even started on the synthetic user. A repo is that project already finished, which is why all four workflows below run on top of one.

By the way, we did experiment with building a lightweight synthetic persona in the past, which was a collection of 8-10 interview transcripts from a previous study. This felt lightweight to me? My intention with this series is to build something meatier, with MUCH more data available to you than 8-10 raw transcripts.

Jargon, decoded

Two terms worth getting straight.

RAG (retrieval-augmented generation)

RAG

Instead of relying on what the model already knows, you pull the relevant pieces of your own data (your interviews and notes) and feed them in alongside the question, so the answer is built on your evidence.

Lost in the middle

Lost in the middle

Language models read the start and end of a long document closely and skim the middle. Hand one a 90-minute transcript and the middle is exactly where it's most likely to miss something or quietly make it up.

The build

4 ways to build a synthetic version of your customer

Here are the four ways I experimented with below, and their pros and cons.

Workflow 1

Digital twin

How I built it

Take one real, named user you have deep data on. Strip the PII (any personal details that identify them), store what's left as a synthetic-user document, and tell the agent to answer as that person. It's the highest-fidelity option because it's grounded in one real human rather than an average.

Best data to ground it

This one goes deep on a single person, not wide. You want everything you have on them: their interview transcripts, their product-usage history (feature adoption, drop-offs, the Mixpanel trail), their support tickets, and their CRM and sales-call notes. The richer the single-person record, the more convincing the twin.

The trade-off

Its strength is fidelity. Nothing gets you closer to a specific person's perspective, which makes it ideal when a key account or a design partner needs a seat in a roadmap conversation. Its limit is that it's exactly one person, quirks and all, so it can't speak for a segment.

Workflow 2

Segment-based synthetic user

How I built it

Aggregate eight to ten or more real users across a bunch of different segments into one synthetic user. You're building an archetype from a cluster of evidence, then writing it up as a single coherent person whose every trait traces back to the underlying interviews.

Best data to ground it

Breadth is the whole game here. Transcripts across the segment for language and goals, the insights and highlights you've already curated for themes that are validated, candidate and demographic data so no single sub-group dominates the mix, and product-usage data so the behavior is real and not self-reported.

Step zero is an audit: do you actually have eight or more solid sessions on the segment you want? If not, your next step would be to fill that research gap, so you have solid foundations to build upon.

Types of segment-based synthetic users:

Power users: gather the data on your 'best' customers, how they're using the product, what they say, what they love, what feedback they've given in the past.

Casual users: gather product usage data on a segment of your less frequent users.

Churned users: pull out churn surveys, customer interview transcripts or closed-lost interviews.

The reason I like segment-based synthetic users is because you can then run a PRD or an artifact, or a concept past all 3, and then compare the insights. I love experimenting with Perplexity's model council for this reason. If you haven't used it yet, Perplexity's model council runs any query through 3 models so you can triangulate and sharpen your point-of-view based on the models arguing against each other. Fun stuff.

The trade-off

Because it's built on a few dimensions of data, the patterns are stable enough to trust…with a grain of salt of course.

Its limit is that you sand off the sharp edges of any one person, and it's only as good as your data's coverage of that segment. Thin segment = thin synthetic user.

Again, at best this could be used internally to drive customer empathy, and provide some directional feedback on new concepts. We know from Part 1, that nuance is where synthetic users unfortunately fall down…today.

Workflow 3

Synthetic panel

How I built it

Sample several synthetic users from a segment (built on survey data), and run them through the same study together, the synthetic version of a recruited panel. Instead of one voice, you get a spread of them answering the same questions.

Best data to ground it

This one is built purely off a large survey, where you'd have statistical significance, plus enough demographic and behavioral variation in the source data to make the panel genuinely diverse.

The trade-off

Its strength is distribution. You get a range of responses rather than a single point estimate, which is exactly what you want for dry-running a survey or catching a broken question before a human ever sees it. Its limit is that the diversity is capped by your data, and a panel that looks varied but isn't will hand you false confidence.

Workflow 4

Live retrieval

How I built it

No stored user at all. A skill queries the whole repo live, contextualized to whatever artifact you feed it, a PRD or a design spec, and assembles the relevant customer evidence on the spot. This is the one Ned demoed on our recent synthetic user webinar.

Watch Ned demo it on our synthetic user webinar →

He pasted in a PRD, the skill built synthetic users from the matching screener and interview data, and each one reacted to the parts of the PRD he put through.

Best data to ground it

This leans on the whole repo rather than one stored doc, so it lives or dies on the indexing from the section above, the hybrid search and server-side filtering especially. Whatever's freshest in the repo is what it pulls, which is the point. You want it reacting to your latest evidence, not a stored persona document from last year.

The trade-off

Its strength is that it's always current and contextual. You point it at the artifact in front of you and it reacts to that, with no document to maintain. Its limit is consistency. We experimented with it a few times while prepping the webinar, using the same repository and a similar prompt, and the outputs came back consistent-ish but not the same. The themes held every run; the exact wording and the examples it reached for moved around. Useful for a directional gut-check, risky the moment two people quote their own separate runs as the source of truth.

So which one do you reach for?

It comes down to what you're holding when you start. A specific person's perspective points to a digital twin. A whole segment points to the segment-based build. A need for spread, like dry-running a survey, points to a synthetic panel. A live artifact you want reactions to points to live retrieval. And for any high-stakes go/no-go, real users still win. Synthetic users is where you can start, but not use as a final decision point.

The consistency question

Does it matter if everyone gets a slightly different answer?

This is the part I keep going back and forth on. Live retrieval gives a slightly different answer each time you run it. Does that actually matter? Especially if that data is the most recent?

I saw a small version of this in our own testing. We ran live retrieval a few times while prepping the webinar, same repo, similar prompt, and the answers came back close but not identical. The big themes held every time, which was at least comforting. The exact wording and the examples moved around. That's fine if you just want a quick gut-check. It's a problem if two people each run it and quote their own version as the truth.

So my answer is yes, consistency matters, but not all the time. For a quick directional read, a slightly different answer each time is fine. For anything the whole team is building around, you want one version everyone trusts. That's why I think we'll keep one saved synthetic user as the official reference, and use live retrieval for one-off questions on top of it.

From the webinar

The questions the webinar raised

When Ned ran the live demo, the chat filled with sharp and thoughtful questions that we wanted to outline here:

On the data behind a synthetic user

How much cleanup did it take before the output felt decent?

Less than you'd expect. Coverage matters more than polish.

Does it use the repo's transcripts? What about usability-test videos?

Yes to transcripts, they're the backbone. Video works through its transcript today, so the spoken content is in but the on-screen behavior isn't yet.

On trust and accountability

Isn't it easy to over-trust this?

Yes, which is exactly why the skill we've built cites every claim and flags every gap.

Who's accountable when the output informs a real decision?

A genuinely tough question. I think where we're landing is that people are, not AI. Its job is to make the evidence legible enough that a researcher and a PM can share that accountability.

How do you coach people to read it?

Get them looking at the citations and the gaps before the conclusions.

On replacing humans

Is this only for low-stakes calls?

Early discovery isn't always low-stakes, it can set product strategy. So the rule holds: synthetic is the floor, not the ceiling, and the higher the stakes, the faster you go to real humans.

How do I stop budget-holders swapping humans for synthetic to save money, without being called "anti-AI"?

Lean on synthetic where it's strong, like dry runs and gap-finding, and say plainly where it isn't, like anything you'd bet the roadmap on.

The hard ones I can't fully answer yet

Hyper-rationalization:

real people reason messily, synthetic ones don't.

Web access:

does letting it browse inflate what a synthetic user "knows"?

Model temperature:

how do you get realistic variety instead of ten identical voices?

Brand knowledge:

how much should a synthetic user know about you?

Language bias:

AI tends to reward concrete, action-oriented wording over abstract ideas.

WEIRD bias:

model training data skews Western, Educated, Industrialized, Rich and Democratic, and aggregation alone won't fix it.

I don't have clean answers to that last group. What I have is repo-only retrieval, so the model isn't pulling in web knowledge a real customer wouldn't have, plus a gap flag that fires when the evidence isn't there.

Under the hood

The skill behind all this

Everything above runs on a skill I built. Here's how it works below:

Point it at the repo, and it builds profiles

It surveys what's in the Great Question repo, groups sessions into clusters by role, workflow and shared pains, and counts the evidence behind each one:

sessions: a high-confidence profile, stated plainly

a profile, but hedged

a signal, not yet a finding

a gap, flagged for you to go research

Each profile card covers who they are, the world they work in, their pains in their own words, the phrases they use, what they want, what it's safe to use them for, and what it still can't tell you.

Hand it a PRD or design, and a profile reacts

It pulls the relevant evidence with hybrid search (keyword and semantic at once), then responds in the first person as that customer, pushing back where the evidence contradicts the artifact and flagging anything the repo can't speak to.

The rules, every time

Every claim cites the real session it came from, by anonymous speaker handle, never a name. The repo is the only source. If the evidence isn't there, it says so instead of filling the gap.

It's not public yet, on purpose

I'm testing it in the open across this series first, so you can see where it holds up and where it doesn't before you run it yourself.

What's coming in Part 3

Part 3 is about the robot vs human experiment I'm running! I'm going to design an experiment and put both synthetic users and humans head to head. It should be interesting!

What I honestly don't know going in is where they'll diverge.

The build continues.
Follow along.

Part 3 lands next: the robot vs human experiment, putting synthetic users and real humans head to head.

01
The map
Vocabulary, the ways to build one, priors going in.
02
Now · Four ways to build one
Built each workflow. What each is good for, where each falls apart.
03
Coming · Robot vs human
Synthetic users and real humans, head to head.
04
Coming · Recap & the skill
The full guide, and the synthetic user skill you can run yourself.