May 2026 ยท Health AI diligence checklist

Before You Trust A Health AI Product

A practical screen for products that combine labs, wearables, medical records, and chat.

The health AI demo got easier. Connect a wearable. Pull in labs. Summarize a medical record. Ask a chatbot what the numbers mean. The screen can look useful in five minutes because the market is finally getting better input ports.

That is progress. It is not trust.

The hard question moved. It used to be: can the product get enough health context to say anything useful? Now it is: once the product has the context, what is it allowed to say?

Use this checklist before you buy, deploy, fund, or personally trust a health AI product. It is not medical advice or legal advice. It is the product review I wish I had run harder before building Phreable, a pre-ChatGPT personal health record effort that taught me how quickly fluent health software can outrun its evidence.

The Fast Screen

  1. What exact claim is the product making?
  2. What source did each fact come from?
  3. Which values are measured, derived, inferred, or guessed?
  4. How does the product handle stale, missing, contradictory, or low-quality data?
  5. What does it refuse to answer?
  6. What does it send to a clinician?
  7. What would a user do differently because the product spoke?
  8. What local evidence proves it works for this workflow or this person?
  9. What changes when the model, prompt, retrieval layer, lab panel, sensor, or reference range changes?
  10. Can the user or buyer audit what happened after the fact?

A product that cannot answer those questions may still be useful. It may still be worth testing. But it has not earned the posture the demo is asking you to take.

1. Name The Claim

Do not let "AI health platform" stay intact as a category. Break the product into claims:

Those are not the same claim. They do not carry the same evidence burden. Explaining a lab value is not the same as recommending a medication change. Summarizing a PDF is not the same as deciding what matters in the PDF.

What exactly is the product allowed to do?

If the answer changes between the website, sales demo, disclaimer, and actual product, that is the first finding.

2. Map The Inputs

More health data is not automatically better. A product that connects labs, wearables, medical records, nutrition logs, PDFs, photos, notes, and chat has more surface area for truth. It also has more surface area for error.

Ask for a source map:

The product should not collapse those into one bucket called "your data." "Your LDL-C was 118 mg/dL in a Quest panel on March 4" is different from "your cardiovascular risk looks higher." One is a measured value with a source. The other is an interpretation. Both may be useful. Only one is the fact.

This matters more as the input floor improves. Google is moving Fitbit data, medical-record summaries, Health Connect, and Apple Health connections into its health surfaces. WHOOP has announced clinician access, EHR syncing, AI features, and lab offerings. ONC's TEFCA work is also meant to reduce barriers to health-record exchange.1 Better aggregation raises the diligence bar. It does not remove it.

3. Treat Labs As Structured Data, Not Magic

Lab data is attractive because it is already structured. A result usually has a test name, value, unit, timestamp, and reference interval. Sometimes it has a LOINC code, a flag, or a short comment.2 That structure is valuable. It is also easy to abuse.

Ask:

The reference range is the trap. "Normal" is a statistical statement about a reference population and an assay. It is not a statement that this value is ideal for this person. TSH, vitamin D, ferritin, testosterone, LDL-C, and other markers have lived through fights over cutoffs, targets, and interpretation. That does not make lab data useless. It makes the interpretation layer the product.

A trustworthy product should be able to say: this value is inside this lab's range; it has moved from your prior baseline; the clinical meaning is uncertain without context; here is what would change the interpretation.

That is less exciting than a giant green score. Good.

4. Treat Wearables As Instruments With Limits

Wearable data is not raw biology. It is sensor data plus algorithms.

Resting heart rate, HRV, sleep stages, strain, readiness, recovery, stress, blood pressure estimates, and energy scores do not all have the same evidentiary status. Some are closer to measurement. Some are model outputs. Some are composite scores with undocumented weights.

Ask:

The last question matters most. Composite scores are authority interfaces. A health AI product that imports those scores without source quality, confidence, or local validation is not reasoning from the body. It is outsourcing authority to another black box.

5. Show The Review Burden

Every health AI answer creates review work. Someone has to know whether the answer is source-bound, current, appropriate, and safe to act on. That reviewer may be the user, a clinician, a coach, a buyer's governance team, or nobody.

"Nobody" is still a review model. It is just a bad one.

Ask:

The product may be safer if it says less. It may be more useful if it says more. The right answer depends on the claim, workflow, and review path. The wrong answer is pretending the review burden disappeared because the interface got prettier.

6. Demand Change Control

Health AI products change under your feet. The model changes. The prompt changes. The retrieval layer changes. The lab panel changes. The wearable firmware changes. The score formula changes. The medical-record connection changes. The clinician-review workflow changes.

Ask:

In medical-device software, regulators have been moving toward predetermined change-control concepts for AI-enabled software. Consumer health products may sit outside that regime, but the product problem is the same.3 If a health AI system can change its behavior, the buyer or user needs to know which changes matter.

7. Test The Ugly Cases

Do not test the clean demo. Test the cases health products actually meet:

The question is not whether the model can produce a good answer once. The question is whether the product knows which kind of answer it is allowed to produce when the input is ugly.

8. Ask What The User Does Next

Health AI does not fail only by being wrong. It can fail by being too confident, too vague, too anxious, too passive, or too eager to turn every number into a protocol.

What does the user do differently because this product spoke?

If the answer is "nothing," the product may be a toy or a nice explainer. If the answer is "changes behavior," the product needs a proof loop. If the answer is "changes medical care," the product needs clinical governance.

If the answer is "we are not responsible for what the user does," the product is still shaping behavior. The disclaimer does not make the influence disappear.

9. Require A Local Proof Plan

Benchmarks are not local proof. A health AI product can do well on a general benchmark and still fail in your specialty, EHR, population, language mix, note quality, lab panels, or user base.

Ask for a pilot plan that names:

For a consumer product, run the same logic at one-person scale: what are you trying to learn, what data is strong enough to use, what action will you take, what result would make you stop, what result would make you ask a clinician, and what result would prove the product was not useful for this question?

Without that loop, more information becomes more confident guessing.

10. Keep The Exit Door Visible

The buyer or user should be able to leave.

Ask:

The product that wants to become your health memory should be unusually clear about how memory ends.

A Simple Score

Green: The product names its claims, preserves source provenance, separates measured values from interpretations, exposes uncertainty, records versions, defines review paths, and has a local proof loop.

Yellow: The product may be useful, but the trust boundary is incomplete. Narrow the use case, require evidence, or keep the product in an advisory role.

Red: The product cannot say where facts came from, treats derived scores as truth, blurs wellness and clinical claims, hides model or update behavior, gives action without a review path, or relies on disclaimers to carry the safety burden.

What To Ask For

If you are buying, funding, or diligencing the product, ask for these artifacts:

If the vendor cannot produce them, that is information.

The Phreable Lesson

Phreable started with the right instinct and the wrong center of gravity. The instinct was that people needed a better way to understand their health against the life they were trying to preserve. The wrong assumption was that enough data plus a fluent model would become intelligence.

It does not.

Data access is the first problem. Source quality is the second. Claim authority is the third. Behavior change is the fourth. Review is the fifth. Change control is the sixth.

The new health AI stack is solving the first problem faster than I expected. That makes the rest more important, not less.


  1. Market context: Google Health and Fitbit announcements on health data aggregation and Health Coach; WHOOP announcements on clinician access, EHR syncing, and AI features, Advanced Labs, and Specialized Panels; ONC's TEFCA page.
  2. Lab context: Regenstrief's LOINC overview and research on reference intervals in clinical laboratory interpretation. QuestHealth and Function Health are useful examples of consumer lab access and interpretation surfaces: QuestHealth, QuestHealth FAQ, Function Health, and how it works.
  3. Change-control context: FDA pages on AI-enabled software as a medical device and predetermined change-control plans.