Applied AI · RN-011

The difference between a demo and a deployable prototype

Demos optimize for the path of least resistance. Deployable prototypes optimize for the path of most resilience. The cost of conflating them — especially in AI work — and how to scope around it.

Published

2025 · 01

Read

5 min

Author

GET Team

Why AI demos mislead enterprise buyers

An AI demo is engineered to remove variance. The prompt is hand-tuned against the example inputs in the room. The retrieval corpus is small enough to fit the question. The latency is whatever the latency happens to be on a quiet API endpoint at 2 p.m. PT. The model version is whichever one produced the screenshot last week. Nothing in that environment resembles a production load profile, and nothing in that environment generates the signals an operator needs to trust the system at 3 a.m.

The mislead is not malicious. It is structural. Demos are built to answer the question buyers ask first — does the model do the thing — and they answer it well. They do not answer the questions buyers ask second, third, and fourth: what happens under p99 latency, what happens when the model provider deprecates the endpoint, what happens when an input drifts outside the golden set, what happens when a downstream system returns a 503. Those questions require a different artifact.

The cost of the conflation is paid late. A program that bought a demo and budgeted for a prototype discovers the gap during integration, when the missing pieces — eval harness, observability, prompt regression coverage, idempotent retries, a model registry, a rollback path — turn out to be most of the work. The model itself was rarely the hard part.

What a deployable AI prototype actually contains

A deployable prototype is not a production system. It is a production-shaped system, scoped to a narrow surface area, that exposes every load-bearing concern a real deployment will eventually require. The point is not to ship everything. The point is to make every later decision a known unknown rather than an unknown unknown.

At a minimum, the artifact should include the following, even in reduced form:

A golden set of inputs and expected behaviors, versioned, with a written rubric for what counts as a pass.
An eval harness that runs the golden set on every model or prompt change and produces a diff a reviewer can read in under five minutes.
A latency budget broken down by hop — retrieval, model call, post-processing — with measured p50, p95, and p99 under representative load.
Observability that captures prompts, completions, token counts, model version, retrieval IDs, and downstream errors against a single request ID.
An idempotency strategy for any action the model can trigger, so retries are safe.
A model registry entry, even if the registry is a single YAML file, naming the model version, the prompt version, the eval results, and the rollback target.
A failure mode matrix listing what the system does when the model returns nothing, returns malformed output, returns a confident wrong answer, or times out.

None of those items is exotic. All of them are absent from the average demo, and all of them take real engineering time to add. The trick is to scope the surface area small enough that adding them is tractable inside the prototype phase rather than deferred to a hypothetical hardening sprint that never gets staffed.

How to scope an AI POC that becomes a real pilot

Most AI POC briefs are written as capability statements — the system shall summarize, classify, extract, route. Capability statements are necessary and insufficient. A scope that produces a deployable prototype adds three constraints to every capability: the input distribution it is responsible for, the failure modes it is allowed to exhibit, and the operational surface it must expose.

The input distribution constraint forces the team to define the golden set before writing the prompt. The failure mode constraint forces the team to decide, in advance, whether a hallucination is a P2 or a P0 in this context. The operational surface constraint forces the team to instrument the system before they tune it, which is the opposite of the demo-first instinct and the single highest-leverage change a program can make.

A useful scoping exercise: write the on-call runbook before writing the prompt. If the runbook cannot be written because the failure modes are undefined, the scope is not yet ready for a prototype. If the runbook can be written but its first step is escalate to the vendor, the scope has outsourced its operational surface and will not survive the pilot.

Why MLOps decides whether a prototype is deployable

MLOps is the word the industry uses for the unglamorous machinery that turns a model into a system: registries, evals, feature stores where relevant, prompt versioning, traffic shaping, shadow deployment, rollback. A prototype without that machinery can ship a single version once. A prototype with that machinery can ship a hundred versions over six months without the team losing the ability to reason about what changed.

The reason this matters at the prototype stage, not later, is that AI systems regress in ways traditional systems do not. A model provider patches a model and your extraction accuracy drops three points overnight. A prompt change improves one slice of the golden set and silently breaks another. A retrieval index re-embeds and the same query returns different documents. Without prompt regression coverage and a model registry, none of those events is detectable until a user complains, which in B2B contexts often means an account complains.

What to ask before approving an enterprise AI deployment

A buyer who has been shown a demo and is being asked to fund a deployment should ask a small number of specific questions before signing. The answers separate teams that have built a deployable prototype from teams that have built a slide with a working backend.

01What is in the golden set, who owns it, and how often is it run?
02What is the measured p99 latency under representative load, and which hop dominates it?
03What is the rollback procedure when a model or prompt change regresses the golden set, and how long does it take?
04What does the observability stack capture per request, and can an on-call engineer reconstruct a bad output from logs alone?
05Which failure modes are accepted, which are escalated, and where is that written down?
06What is the model registry entry for the current version, and where is the prior version stored?

If the answers are crisp, the team has built a prototype. If the answers are some variant of we will get to that in the pilot, the team has built a demo and is asking the budget to absorb the cost of converting it. That conversion is usually larger than the original build, and it is the line item most often missing from the proposal.

The honest scoping conversation

Demos have a legitimate role. They de-risk the capability question and they earn the meeting. The mistake is treating them as a milestone on the path to production rather than a separate artifact with a separate purpose. The work between a working demo and a deployable prototype is most of the work, and naming it explicitly in the scope is the cheapest intervention available.

For programs starting an AI pilot in the next quarter, the most useful single change is to write the on-call runbook and the golden set rubric before the first prompt is tuned. Everything downstream gets easier. Everything skipped gets paid for later, with interest, in the form of incidents that the team has no instrumentation to diagnose and no registry to roll back from. The choice is not whether to do the work. It is whether to do it before users depend on the system or after.

Authored by GET Team · GET AI Labs

← All research notes

Related notes

Continue reading.

Other notes from the same line of work.

RN-007Applied AI

Separating real capability from market hype

New AI models ship faster than evaluations of them. A working approach for assessing whether a frontier capability is actually production-ready, or whether the demo is doing all the work.

2023 · 07 · 7 min→

RN-017Adoption

Managing the risks of enterprise AI adoption

AI risk is not a variant of software risk, and treating it as one is how regulated programs fail. A taxonomy of the real exposures and how to contain them through the roadmap rather than after it.

2026 · 06 · 9 min→

RN-016Adoption

Measuring ROI on enterprise AI investments

Most AI return figures are unfalsifiable. A method for costing the full system, classifying which benefits are measurable, and instrumenting an initiative so its return can be defended after deployment.

2026 · 05 · 8 min→

Next step

Have a technical challenge worth investigating?

Bring us the problem. We will help determine what is possible, what is practical, and what should be built next.

Discuss a Technical Challenge Explore Capabilities

Response within two business days · NDAs available when required