The Real Ambient Scribe Question is Trust, Not Note Quality

Why ambient scribes keep coming up

I built a browser-based AI medical scribe recently, and the reason was quite simple: I wanted to see if people (including me) had been asking the wrong question.

Ambient “AI scribe” tools are often sold on note quality. But when you look at why these tools are even on the table, it’s hard not to notice the bigger pressure underneath: clinicians are spending a lot of time documenting, and a lot of that time is mediated by the EHR.

A classic time-and-motion study published in Annals of Internal Medicine found that during the office day, physicians spent 27.0% of their time on direct clinical face time with patients and 49.2% on EHR/desk work. Another widely cited “event log” study in Annals of Family Medicine framed it even more bluntly: primary care physicians spent nearly 2 hours on EHR tasks per hour of direct patient care.

So yes, I understand why scribes are having a moment. And there’s early evidence that (at least in some settings) ambient scribe platforms can reduce administrative burden and burnout. For example, a multi-centre quality improvement study in JAMA Network Open (263 clinicians across 6 health systems) reported that after 30 days of using an ambient AI scribe, the proportion reporting burnout fell from 51.9% to 38.8%, alongside improvements in cognitive task load and self-reported time spent documenting after hours.

But the more I read and the more I spoke to people in the space, the more I felt the core question wasn’t “can you generate a decent summary?”

It was trust.

The question I kept coming back to

Most hesitation around ambient scribes isn’t just about whether the text reads well. It’s about everything around the text:

Where does the data go?
Who can inspect what happened?
How do you review and correct the output without drowning in review burden?
How do you audit what the tool did, when, and why?
How does the output get back into real systems in a form that isn’t just another attachment?

If you’re in the NHS world, this framing isn't hypothetical. NHS England’s guidance on AI-enabled ambient scribing products explicitly raises issues like accuracy, contextual misinterpretation, completeness, and automation bias, and it pushes organisations to treat integration, governance, and safety as first-class work (not a footnote).

So instead of asking “can I build a cloud AI scribe?”, I wanted to test a narrower and slightly awkward question:

Could a useful chunk of an ambient scribe workflow live in the browser, with no project backend at all?

Why the browser is suddenly interesting again

A few years ago, “do it in the browser” would have been a nice demo and not much else.

What changed (or is changing) is that browsers are starting to expose on-device model capabilities as primitives. Chrome’s “Built-in AI” initiative explicitly positions client-side models as a way to offer AI features while protecting sensitive data and improving latency. And the Prompt API gives access to Gemini Nano in the browser (behind an evolving availability story), which is exactly the kind of thing that makes local-first prototypes suddenly feasible.

There are very real caveats. The Prompt API and related Built-in AI APIs have platform and hardware requirements (for example OS constraints and at least 22 GB free on the volume containing the Chrome profile). Even the setup ergonomics (flags, model download status, uneven rollouts) is still “prototype-shaped”.

But Chrome’s own docs make the on-device angle unambiguous: after the initial model download, subsequent use may be offline, and they state that no data is sent to Google or third parties when using the model.

That combination a browser platform people already have plus a plausible on-device generation path was enough to tempt me into building something concrete.

What I built and what it does

The project is called AI Medical Scribe. It’s open source, and it’s intentionally framed as a prototype.

The simplest way to say what it is: a browser-based front end for capturing a consultation, drafting documentation, and producing a structured handoff, without a project backend.

The current version includes live transcription, manual notes, timeline markers, on-device summarisation and document drafting, structured extraction into clinical “buckets”, confidence-aware review tooling, a local append-only audit log, and client-side FHIR export (with optional direct send to a configured endpoint).

To make the intent explicit: this wasn’t an attempt to build a “clinical tool”. It was an attempt to build a conversation starter that lets you stop arguing in abstractions.

That’s also why the feature set grew beyond transcription + summary. Because if you show clinicians or health IT people “here’s a summary”, the immediate response is rarely “wow, amazing”.

It’s usually:

How do I check this quickly?
What’s the provenance?
What changed since the last edit?
How do I prove what happened?
How do I get this into the EHR in a structured way?

Those are the real questions.

Local-first is not the same thing as safe

One thing I want to be very careful about here: “local-first” can be a better starting point for trust, but it doesn’t magically solve compliance, security, or governance.

Even if everything is “local”, you still need to answer serious questions about security controls, retention, access, and audit. NHS England’s guidance repeatedly pushes organisations toward concrete governance, information governance, and safety practices rather than vibes-based reassurance.

This is also where it helps to look at other open work in this space, because lots of people are exploring different points in the design space.

For example, OpenScribe is an open source medical scribe that records encounters, transcribes audio, and generates structured draft notes. Their default web deployment is explicitly a hybrid: local Whisper transcription plus Anthropic Claude for note generation, with a separate fully local desktop path. Open Medical Scribe takes a “pluggable providers” approach, including fully local operation (local transcription and local note generation via Ollama), but also optional cloud providers when privacy constraints allow.

My browser-only prototype isn’t “better” than these; it’s a different bet. It’s me asking: “what if the distribution mechanism is literally just a URL, and the trust boundary is as small as possible by default?”

The hard part is still review and handoff

Even if you solve capture and generation, the two hardest problems don’t go away:

**Review burden. \ If clinicians have to review everything with the same intensity as writing from scratch, you’re not saving time, you’re just moving the work around.

This isn’t theoretical. Work evaluating AI-generated clinical notes continues to find a tradeoff between thoroughness and risk. One study evaluating ambient LLM-generated notes found hallucinations in 31% of ambient notes vs 20% of clinician “gold” notes (while also noting ambient notes could be more thorough and better organised). That’s exactly the kind of result that makes “review tooling” feel more important than “make the paragraph nicer”.

**EHR integration. \ If output can’t land back in clinical systems in the right shape, it turns into a sidecar tool, and sidecar tools tend to die in procurement, rollout, or daily workflow friction.

NHS England’s guidance doesn’t mince words on this: it flags the importance of integration with EPR/EHR systems and even calls out standards like FHIR/HL7 in the context of successful adoption.

That feedback is why I added FHIR export to the prototype. Not because “FHIR export solves integration” (it doesn’t), but because it forces the question: what does a structured handoff look like, even in a toy system?

FHIR itself is clear about what a clinical “document” shape means: a Composition alone isn’t a document, it must be the first entry in a Bundle of type document, and referenced resources need to be included in the bundle. HL7’s clinical document guidance is similarly explicit about the “Composition first” pattern for document bundles.

That kind of structure is useful even as a prototype constraint, because it steers you away from “here’s some text” and toward “here’s something that could plausibly travel”.

What I think this experiment actually proved

Not that browser-based scribes are ready for clinical deployment.

Not that local-first means “safe” by default.

Not that the technology is anywhere near finished.

What it did prove (to me) is something narrower and more actionable:

We’re now close enough that you can build a credible local-first scribe workflow prototype in the browser, and once it exists, the conversations get better.

Instead of debating local-first ambient scribes as an abstract idea, you can point at a working thing and ask concrete questions:

Where exactly does data flow in the capture layer?
What does review look like when hallucinations are a known failure mode?
What does auditability mean for a tool that still needs human sign-off?
What does “handoff” mean if integration work is the real wall?

That’s the point of the experiment.

If you want to dig into the code, here are the key links (and a couple of related projects worth reading alongside it):

Everything else beyond that… clinical safety cases, full threat models, deployment patterns, and actual EHR integration, is real work, and I’m not pretending a prototype replaces it. But it does make the shape of the problem easier to see.