Beyond Benchmarks: Emulating Hospital Environments for Healthcare AI Testing

Testing healthcare AI in real-world clinical environments

Hassan Jahanandish, PhD, Onboard AI

Hospitals don’t struggle to deploy AI because performance benchmarks are missing. They struggle because benchmarks don’t predict how AI systems behave once embedded in real clinical workflows. We argue that emulating hospital environments, using FHIR-based systems loaded with representative patient data, is a more reliable way to evaluate healthcare AI tools.
‍

In Healthcare AI, Benchmarks Are Great. They’re Just Not Enough.

Benchmarks are great. They’re essential, even. They let us compare models, iterate quickly, and understand raw model capabilities in controlled settings.
‍

But in healthcare, benchmarks don’t tell you how an AI system will behave once it’s deployed inside a hospital.

And that distinction matters.
‍

Benchmarks tell us how models perform in isolation. Hospitals aren’t controlled environments, and most healthcare AI failures happen after that point.
‍

Where Healthcare AI Actually Fails

In practice, most healthcare AI failures aren’t about model intelligence.

More specifically, they’re about whether the system can access the right data at the right time, operate within real clinical workflows, and handle variation across infrastructure, patients, and policies.
‍

Benchmarks rarely test any of this.
‍

The same AI tool can look great on paper and perform very differently across hospitals. An academic medical center and a community hospital live in entirely different operational realities, and AI systems feel that difference immediately.

Benchmarks don’t.
‍

Why Hospitals Run Pilots (and Why That’s a Problem)

This is why hospitals run pilots for almost every AI tool they consider deploying.

A team of clinicians, informaticists, and IT staff spends months figuring out whether the system actually works there. Not in theory. Not in a demo. But in their environment.

Pilots work, but they’re expensive, slow, and hard to standardize. More importantly, they exist because hospitals don’t have better ways to evaluate AI systems under realistic conditions before involving frontline teams.
‍

If You Want Real Answers, Test in a Real Environment

If we want to understand how an AI system will behave in production, we need to test it somewhere that actually looks like production.
‍

That means emulating hospital environments.
‍

One of the most effective ways to do this is to test AI systems inside hospital-like environments such as EHR surfaces exposed through FHIR APIs, loaded with representative patient data. Then you let the AI system do what it’s supposed to do.
‍

Not answer benchmark questions.
Not run scripted demos.
But actually operate.
‍

This is where many evaluations fall short. They stop at model performance and never test whether the system can function reliably once it’s embedded in real clinical infrastructure.

At that point, evaluation shifts from “Is the model accurate?” to “Does this system reliably accomplish its goal?”
‍

That’s a much harder question, and a much more useful one.

Evidence from Realistic Evaluation

This isn’t a hypothetical concern.
‍

Recent work makes this painfully clear.
‍

In MedAgentBench, researchers evaluated LLM-powered agents on simple EHR tasks like retrieving lab results or ordering medications. Despite the tasks being straightforward, the best-performing agents only succeeded about 65–70% of the time.
‍

In a follow-up study, the same group redesigned the agents;better prompts, safer tools, explicit planning. Success rates jumped to over 90%.
‍

What changed wasn’t the underlying models.
‍

What changed was everything around them.
‍

This is exactly what hospitals see in the real world. AI systems don’t fail because the model isn’t smart enough. They fail because the surrounding systems, workflows, and data realities were never properly tested.

From Benchmarks to Behavior

At Onboard AI, we’re building evaluation infrastructure around this idea.
‍

We set up hospital-like environments (e.g., FHIR-based systems that behave like real EHRs) preloaded with representative patient data from the target health system. AI tools can integrate with these environments directly and be evaluated based on what they actually accomplish.
‍

Because the data is curated, these environments can also be used to stress-test systems: surfacing edge cases, workflow breakdowns, and potential biases long before deployment.
‍

Benchmarks still matter. They always will.
‍

But if you’re responsible for deploying AI inside a hospital, benchmark performance is simply table-stakes. To understand how AI systems will behave within your environment, and whether they’ll deliver real value, you need to test them in environments that look a lot more like hospitals.
‍

If this is a problem you’re interested in solving, we’d love to talk.

‍