Field notesJune 14, 2026~10 minute read · Includes 2 reusable agent-testing prompts● Personal · time-stamped

Field Notes

We're Live, and the Weirdest Thing I Learned Shipping It

By CalJune 14, 2026~10 minute read · Includes 2 reusable agent-testing prompts

Hey, Cal here.

Niche is live. Real product, on the street, working. You can go right now to nicheangle.com, start a trial, and there are 2,000 credits on the house waiting for you, enough to actually put it through its paces, not a teaser that runs dry before you've done anything interesting. Point your agent at it, run a real piece of content end to end, and tell me what breaks. I mean that literally. The whole rest of this post is about why I want you to break it.

But I don't want this to be a launch post that just stands on a chair and yells we shipped. We did, and I'm proud of it, and v2 is already mapped out. We'll get to all that. What I actually want to talk about is the single strangest, most useful thing I learned building a product where the primary user isn't a person. It's an agent. And that one fact quietly broke a rule I'd taken for granted my entire career.

The gap that just... closed

For all of software history there's been a gap between the people you test with and the people who actually use the thing. Beta cohorts, research panels, the five users you grabbed for a usability session. All of it is a proxy. You're always extrapolating from a sample to a population you can't really see. You make your best guess about the real user from the stand-in in front of you, and you're wrong in ways you don't discover until launch.

When your primary user is an agent, that gap collapses. My testers aren't standing in for my users. They're the same species. There's no sample-to-population leap to get wrong. I'm testing on the actual population. The agent running Niche in my test harness is the same kind of thing as the agent that'll run Niche for a customer next week. That sounds small. It is not small. It rewires the entire feedback loop, and I'm still finding the edges of what it means.

The most immediate payoff: model and surface diversity is user diversity, for free. Fable 5 reads a tool description differently than Opus 4.8. Sonnet reaches for a different tool order than either. An agent driving Niche from the CLI behaves differently than one driving it from Cursor, from Claude desktop, from Claude on the web. In a normal product I'd have to pay for a panel with that much behavioral spread: different temperaments, different mental models, different failure modes. And I still couldn't assemble one this varied. Here, the test matrix is the panel. When Niche holds up across that whole grid, every model, every surface, I've stress-tested against more genuine variance than most products see in their first year.

And I'll be honest, for a while that felt like cheating. Infinite testers. Never tired. Never annoyed. Never once sick of hearing me complain about a tool that returned the wrong shape. You can run the same flow forty times at 6am on a walk with your phone, have the agent spit out markdown, and hand it straight to your build agent. It feels like a superpower.

It's also a trap, and it took me embarrassingly long to see why.

The tireless tester hides your flaws

Agents are too good. That's the problem.

When a human hits a rough edge in your UX, they get stuck. They get confused, they get frustrated, they rage-quit. And you learn, because the failure is loud. The stuck human is the most valuable signal in software.

When an agent hits the same rough edge, it just... improvises around it. It makes three tool calls where one should've done the job. It invents a clever workaround for a description you wrote ambiguously. It recovers from your mistake so gracefully that you never find out you made one. It succeeds. And the rough edge stays completely invisible, sanded over by the agent's competence. Your bad design survives because your user was skilled enough to route around it.

So the real builder skill I had to develop on this launch wasn't "use agents as testers." Everyone will figure that part out. The skill was learning to read the agent's struggle even when it succeeds. Stop staring at the outcome. Watch the path. The wasted call, the half-second of hesitation where it weighed two tools, the recovery move it shouldn't have needed to make. Those are the bug reports a human would've screamed at me. I had to train myself to treat a successful-but-clumsy run as a failure. Because it is one.

Which brings me to the two prompts I actually use, and why they're two and not one.

Two prompts, two halves of the truth

These cover the two things you need to know: where your fluent user trips, and where your brand-new user can't even get started. Steal both.

Prompt 1: the narrated run. Point an agent at your live tools and have it do something real, but tell it to narrate its friction as it works. This is think-aloud usability testing, the kind UX researchers beg human subjects for and humans are hopeless at, because people can't narrate their own confusion. Agents can. You get the struggle annotated.

You are about to use [PRODUCT] to accomplish a real task: [SPECIFIC TASK].

Work through it for real, start to finish, using the actual tools.

As you go, narrate your friction out loud. Every time you:
- hesitate between two tools
- find a description ambiguous
- have to guess at what an input wants
- make a call that returns something other than what you expected
- take more than one step to do what felt like it should be one step

...stop and tell me, in that moment, exactly what was confusing and what
you expected instead. Don't smooth it over. I want the rough version.

At the end, give me your three highest-friction moments, ranked.

It doesn't matter if the agent knows you built the thing. When I first tried this on a fresh assistant, just "tell me about the Niche tools," it clocked within one message that I was the builder. Didn't ruin a thing. It just turned into a narrating collaborator: as it ran the tools to produce content I genuinely posted to LinkedIn, it told me in real time where it stumbled. Real output, real friction log, same session.

Prompt 2: the blind run. This one is the antidote to the trap. New agent, hard wall: it may not look at your repo, it's never heard of your product, it behaves as a true net-new user. A context-loaded agent improvises better because it half-knows your intentions, and that's exactly what hides your bad design. The blind agent has no insider knowledge to route around the rough edge, so the edge finally becomes visible.

And the closing move is the part that'll change how you work: don't ask "how was it." Have it write the feedback agent-to-agent, in builder-spec language, ready for your build agent to turn into a fix.

You are a brand-new user of [PRODUCT]. You have never heard of it. You may
NOT read its source, docs beyond the user-facing surface, or any internal
context. You know only what a first-time user would know.

Your task: [SPECIFIC TASK]. Attempt it exactly as a new user would,
with no insider assumptions and no benefit of the doubt. If something is unclear,
do NOT cleverly work around it. Do the most literal, obvious thing, and if
it fails, let it fail and note where.

When you're done, produce a feedback report written FOR ANOTHER AGENT,
the one that will fix these issues. For each problem:
- the exact step where it occurred
- what a new user would have expected
- the minimal change that would fix it
- a one-line spec the build agent can act on directly

Format it so a build agent can turn it straight into tickets.

The first time I ran that, the blind agent couldn't even find my tools. It searched for them by name and came up empty: "nicheangle" returned nothing, and one plain-English query for what the product does matched a completely unrelated tool instead of mine. That is the whole lesson in one bug. A brand-new user can't route around a tool it can't even discover, and I would never have caught it from inside, because I always knew exactly what to search for. The fix was unglamorous, just better keyword coverage on the descriptions, but it stayed invisible to me until an agent with no insider knowledge went looking the way a stranger would.

And the report came back already shaped for the fix. The test agent didn't just find the bug. It wrote the ticket, in the dialect my build agent speaks, ready to act on with almost no translation.

Put them together and you've got the whole method: the narrated run gives you annotated struggle from inside the flow, the blind run gives you uncontaminated first-contact, formatted as a spec. One tells you where your power user wastes motion. The other tells you where your newcomer hits a wall. You need both, because they fail in completely different places and neither one will ever tell you about the other's territory.

It caught more than clumsy clicks

The part I didn't expect: the method found bugs in the section of Niche I care about most, the trust layer, not just the surface.

Provenance is supposed to be the trustworthy core. It tells you how many independent places corroborate a story, which is the number a creator actually stakes their name on. One run caught me counting a Reddit listing page as a distinct source, and crediting a link to the aggregator that surfaced a story instead of the site it actually came from. Both quietly inflate that corroboration count. That's the one number you can't afford to fudge, so it got tightened the same day.

Then a worse one. My automated quality gate came back all-green on a draft, zero flags, while a few of its factual claims had slipped through unverified. One spot had quietly sharpened a fuzzy detail into a specific it couldn't back up. A clean report on something that wasn't actually checked is more dangerous than no report at all, because you trust it. So I widened what the gate treats as checkable, and stopped letting "looks fine" stand in for "was checked."

Neither of those is a UX bug. They're correctness bugs in the exact thing the product is supposed to be best at, and both surfaced because an agent ran the real flow and I'd trained myself to watch the path.

So, go break it

That's the builder lesson I'm carrying out of this launch: the infinite tester is a gift and a blindfold, and the craft is learning to see the struggle hiding underneath the success.

And for all the bugs the agents flagged, the run I keep coming back to is one that went right. I started a session with no brand profile attached, and the tool didn't guess. It returned a null recommendation and told me, plainly, that it didn't have enough to make a real pick. I'll take a system that refuses to fake an answer over a confident wrong one every time, and watching it hold that line under a test designed to make it slip is a big part of why I trust what shipped.

Which is exactly why I want human hands on Niche now, alongside the agents. You'll get stuck in ways no agent ever will. Loudly, usefully, in all the places my tireless testers quietly improvised past. That's the signal I can't generate on my own, and it's the thing that shapes v2, which is already on the board.

So here's the ask, and it's the same one I opened with: go to nicheangle.com, grab the trial, take the 2,000 credits, and run something real through it. Point your agent at it. Try to break it. Then tell me where it bent. Agent-narrated or human-frustrated, I want all of it.

We're live. The product works. Now help me find the rough edges my testers were too good to notice.

Cal

More field notes

Time-stamped takes from Cal live at /notes.

All field notes

Founder essays, time-stamped

The reference library

Evergreen, structurally cited

Start using Niche

Three-day, 2,000-credit trial · no card