May 2026 · Health AI diligence checklist

Before You Trust A Health AI Product

A practical screen for products that combine labs, wearables, medical records, and chat.

The health AI demo got easier. Connect a wearable. Pull in labs. Summarize a medical record. Ask a chatbot what the numbers mean. The screen can look useful in five minutes because the market is finally getting better input ports.

That is progress. It is not trust.

The hard question moved. It used to be: can the product get enough health context to say anything useful? Now it is: once the product has the context, what is it allowed to say?

Use this checklist before you buy, deploy, fund, or personally trust a health AI product. It is not medical advice or legal advice. It is the product review I wish I had run harder before building Phreable, a pre-ChatGPT personal health record effort that taught me how quickly fluent health software can outrun its evidence.

The Fast Screen

What exact claim is the product making?
What source did each fact come from?
Which values are measured, derived, inferred, or guessed?
How does the product handle stale, missing, contradictory, or low-quality data?
What does it refuse to answer?
What does it send to a clinician?
What would a user do differently because the product spoke?
What local evidence proves it works for this workflow or this person?
What changes when the model, prompt, retrieval layer, lab panel, sensor, or reference range changes?
Can the user or buyer audit what happened after the fact?

A product that cannot answer those questions may still be useful. It may still be worth testing. But it has not earned the posture the demo is asking you to take.

1. Name The Claim

Do not let "AI health platform" stay intact as a category. Break the product into claims:

summarizes medical records
explains lab values
combines labs with wearable data
suggests workouts, sleep changes, supplements, or food changes
drafts patient messages or clinical notes
supports diagnosis or treatment choice
writes back to an EHR
routes a patient to a clinician

Those are not the same claim. They do not carry the same evidence burden. Explaining a lab value is not the same as recommending a medication change. Summarizing a PDF is not the same as deciding what matters in the PDF.

What exactly is the product allowed to do?

If the answer changes between the website, sales demo, disclaimer, and actual product, that is the first finding.

2. Map The Inputs

More health data is not automatically better. A product that connects labs, wearables, medical records, nutrition logs, PDFs, photos, notes, and chat has more surface area for truth. It also has more surface area for error.

Ask for a source map:

Which EHR fields are read?
Which fields are written back?
Which records are summaries rather than source documents?
Which values came from a lab result, wearable algorithm, user entry, uploaded PDF, or image?
Which values are derived by the product?
Which values are inferred by the model?
Which values have timestamps, units, and inspectable provenance?

The product should not collapse those into one bucket called "your data." "Your LDL-C was 118 mg/dL in a Quest panel on March 4" is different from "your cardiovascular risk looks higher." One is a measured value with a source. The other is an interpretation. Both may be useful. Only one is the fact.

This matters more as the input floor improves. Google is moving Fitbit data, medical-record summaries, Health Connect, and Apple Health connections into its health surfaces. WHOOP has announced clinician access, EHR syncing, AI features, and lab offerings. ONC's TEFCA work is also meant to reduce barriers to health-record exchange.¹ Better aggregation raises the diligence bar. It does not remove it.

3. Treat Labs As Structured Data, Not Magic

Lab data is attractive because it is already structured. A result usually has a test name, value, unit, timestamp, and reference interval. Sometimes it has a LOINC code, a flag, or a short comment.² That structure is valuable. It is also easy to abuse.

Ask:

Does the product preserve the lab, unit, and reference interval?
Does it distinguish the lab's range from a guideline threshold?
Does it distinguish "normal" from "optimal"?
Does it show whether the value was directly measured or calculated?
Does it handle unit conversion visibly?
Does it compare against the user's prior values, not just the population range?
Does it say when two labs use different assays or reference intervals?
Does it make clear when no clinician has reviewed the result?

The reference range is the trap. "Normal" is a statistical statement about a reference population and an assay. It is not a statement that this value is ideal for this person. TSH, vitamin D, ferritin, testosterone, LDL-C, and other markers have lived through fights over cutoffs, targets, and interpretation. That does not make lab data useless. It makes the interpretation layer the product.

A trustworthy product should be able to say: this value is inside this lab's range; it has moved from your prior baseline; the clinical meaning is uncertain without context; here is what would change the interpretation.

That is less exciting than a giant green score. Good.

4. Treat Wearables As Instruments With Limits

Wearable data is not raw biology. It is sensor data plus algorithms.

Resting heart rate, HRV, sleep stages, strain, readiness, recovery, stress, blood pressure estimates, and energy scores do not all have the same evidentiary status. Some are closer to measurement. Some are model outputs. Some are composite scores with undocumented weights.

Ask:

Which wearable values are raw or near-raw?
Which values are derived?
Which values are proprietary composite scores?
What happens when the device is not worn?
What happens when sleep is misdetected?
What happens when exercise type is wrong?
Does the product expose confidence or data completeness?
Does it know when the wearable disagrees with a lab, symptom, or user report?
Does it treat a score as an input or as an authority?

The last question matters most. Composite scores are authority interfaces. A health AI product that imports those scores without source quality, confidence, or local validation is not reasoning from the body. It is outsourcing authority to another black box.

5. Show The Review Burden

Every health AI answer creates review work. Someone has to know whether the answer is source-bound, current, appropriate, and safe to act on. That reviewer may be the user, a clinician, a coach, a buyer's governance team, or nobody.

"Nobody" is still a review model. It is just a bad one.

Ask:

Who reviews the output before action?
What outputs can reach a patient directly?
What outputs can reach a clinician directly?
What outputs can be written into the chart?
What outputs can trigger a recommendation, bill, order, referral, or follow-up?
What does the product do when the reviewer disagrees?
What is logged and auditable?

The product may be safer if it says less. It may be more useful if it says more. The right answer depends on the claim, workflow, and review path. The wrong answer is pretending the review burden disappeared because the interface got prettier.

6. Demand Change Control

Health AI products change under your feet. The model changes. The prompt changes. The retrieval layer changes. The lab panel changes. The wearable firmware changes. The score formula changes. The medical-record connection changes. The clinician-review workflow changes.

Ask:

Which changes are logged?
Which changes are visible to users?
Which changes require revalidation, buyer notice, or approval?
Which changes affect historical comparisons?
Can a prior answer be reproduced?
Can a user see which version produced it?
Can the vendor roll back a harmful change?

In medical-device software, regulators have been moving toward predetermined change-control concepts for AI-enabled software. Consumer health products may sit outside that regime, but the product problem is the same.³ If a health AI system can change its behavior, the buyer or user needs to know which changes matter.

7. Test The Ugly Cases

Do not test the clean demo. Test the cases health products actually meet:

a lab value with a unit conversion problem
a result inside range but far from the user's prior baseline
two labs with different reference intervals
a wearable recovery score that disagrees with symptoms
a copied-forward diagnosis that is no longer true
a medication list with stale entries
a PDF with OCR errors
a patient message that includes anxiety and a real warning sign
a missing clinician comment
a clinician comment that contradicts the generic product explanation
an abnormal result released before the ordering doctor has reviewed it
a user asking whether to change medication, supplements, training, or food intake

The question is not whether the model can produce a good answer once. The question is whether the product knows which kind of answer it is allowed to produce when the input is ugly.

8. Ask What The User Does Next

Health AI does not fail only by being wrong. It can fail by being too confident, too vague, too anxious, too passive, or too eager to turn every number into a protocol.

What does the user do differently because this product spoke?

If the answer is "nothing," the product may be a toy or a nice explainer. If the answer is "changes behavior," the product needs a proof loop. If the answer is "changes medical care," the product needs clinical governance.

If the answer is "we are not responsible for what the user does," the product is still shaping behavior. The disclaimer does not make the influence disappear.

9. Require A Local Proof Plan

Benchmarks are not local proof. A health AI product can do well on a general benchmark and still fail in your specialty, EHR, population, language mix, note quality, lab panels, or user base.

Ask for a pilot plan that names:

the exact use case
the source data
the output type
the reviewer
the success metric
the failure metric
the escalation path
the stop condition
the update policy
the post-pilot decision rule

For a consumer product, run the same logic at one-person scale: what are you trying to learn, what data is strong enough to use, what action will you take, what result would make you stop, what result would make you ask a clinician, and what result would prove the product was not useful for this question?

Without that loop, more information becomes more confident guessing.

10. Keep The Exit Door Visible

The buyer or user should be able to leave.

Ask:

Can data be exported?
Can source documents be removed?
Can model memory be reset?
Can generated summaries be deleted?
Can EHR connections be revoked?
Can the user tell what was sent to third parties?
Can the buyer audit vendor access?
Can the product separate account deletion from clinical-record retention requirements?

The product that wants to become your health memory should be unusually clear about how memory ends.

A Simple Score

Green: The product names its claims, preserves source provenance, separates measured values from interpretations, exposes uncertainty, records versions, defines review paths, and has a local proof loop.

Yellow: The product may be useful, but the trust boundary is incomplete. Narrow the use case, require evidence, or keep the product in an advisory role.

Red: The product cannot say where facts came from, treats derived scores as truth, blurs wellness and clinical claims, hides model or update behavior, gives action without a review path, or relies on disclaimers to carry the safety burden.

What To Ask For

If you are buying, funding, or diligencing the product, ask for these artifacts:

claim inventory
source map
lab interpretation policy
wearable data-quality policy
model and retrieval evaluation summary
hallucination and refusal test results
clinician-review workflow
EHR read and write field list
change-control policy
incident and complaint workflow
local pilot scorecard
export, deletion, and access-revocation policy

If the vendor cannot produce them, that is information.

The Phreable Lesson

Phreable started with the right instinct and the wrong center of gravity. The instinct was that people needed a better way to understand their health against the life they were trying to preserve. The wrong assumption was that enough data plus a fluent model would become intelligence.

It does not.

Data access is the first problem. Source quality is the second. Claim authority is the third. Behavior change is the fourth. Review is the fifth. Change control is the sixth.

The new health AI stack is solving the first problem faster than I expected. That makes the rest more important, not less.

Market context: Google Health and Fitbit announcements on health data aggregation and Health Coach; WHOOP announcements on clinician access, EHR syncing, and AI features, Advanced Labs, and Specialized Panels; ONC's TEFCA page.
Lab context: Regenstrief's LOINC overview and research on reference intervals in clinical laboratory interpretation. QuestHealth and Function Health are useful examples of consumer lab access and interpretation surfaces: QuestHealth, QuestHealth FAQ, Function Health, and how it works.
Change-control context: FDA pages on AI-enabled software as a medical device and predetermined change-control plans.