The Human in the Loop

The honest over-engineering story behind an HR-guided steady-ride trainer app.

1. The Itch

I was riding Zone 2 on a smart trainer at 180 watts. My heart rate started at 132 bpm. Twenty minutes later it was 142 bpm. Same trainer. Same legs. Same power. Different physiological cost.

That is cardiac drift. You can call it plasma volume shifts, thermoregulation, sympathetic drive, stroke volume decay, or just what every rider already feels in their chest after a long steady block. The number on the trainer stays flat. The number in your body does not.

ERG mode does not care. It holds 180 watts because that is what it was told to hold. It treats the rider like the last flexible part of a lathe. Put in the watts, get out the workout. From the trainer's point of view, the human is just a compliance problem.

That felt wrong to me. Not philosophically wrong. Mechanically wrong. I am a biomedical engineer. When I see a system with one half modeled in detail and the other half ignored, I start wanting to fix it the way a mechanic wants to chase a knock in an engine before it turns into a thrown rod.

The gap was obvious once I saw it. Smart trainers can control power with stupid precision. The rider's heart rate is the thing that tells you what stress the body is actually experiencing. The two talk to each other with a long, messy delay. Commercial apps close the loop around the machine and leave the human open loop.

That is why the idea would not leave me alone. What if the trainer responded to my heart rate instead of a fixed power target? What if Zone 2 meant my actual Zone 2 on this day, in this room, with this level of fatigue and this amount of cardiac drift, instead of some wattage that was right forty minutes ago or on a better day?

The market only made the itch worse. TrainerRoad had the price point and a workout library measured in tens of millions. Zwift had scale, about a million subscribers, and money to light on fire. MyWhoosh was free and backed by the UAE. All of them were better platforms than anything I could build alone. None of them were doing the one thing that looked technically interesting to me. None of them were closing the loop on heart rate.

That mattered because the physiology literature keeps making the same point in different ways. Power is a proxy for internal load. Heart rate is one internal-load signal. At 60% VO2max over 45 minutes, heart rate can rise 17% while power stays fixed. After two hours at 90% of VT1, the same rider can drift from 142 bpm to 151 bpm while the power associated with that effort falls from 217 watts to 196 watts. Every power-based app logs that as one stable workout. Your body does not.

The shipped product claim is narrower than the original itch. HR-guided resistance belongs on supported steady work when the trainer has usable headroom above its floor. Short intervals stay power-led with heart-rate guardrails, and floor-limited rides are labeled instead of dressed up as clean control.

So I said the dangerous sentence. I can model this.

That sentence usually means one of two things. Either you are about to find a clean abstraction and ship something simple. Or you are about to disappear into a rabbit hole because the thing you think is one system is actually three systems wearing a trench coat.

Zone Pedal became the second kind.

2. Down the Rabbit Hole

On February 14, 2026, the first live controller was exactly what any engineer would build on day one. A conventional PID loop. Sample heart rate. Compare it to the target. Push power up or down. Add some ramp limiting and a max-HR stop so nobody does anything too stupid. It was the obvious first draft.

It was also wrong for the same reason a new manager is wrong the first time they trust every contractor who says, "it's fine." Heart rate is late. By the time it tells you what your last power command did, you have already stacked several more power commands on top of it.

The published cycling data made that plain. The control-oriented plant is first-order plus dead time. Mean dead time is about 13.8 seconds. Typical time constant is on the order of 50 to 70 seconds. That means you can make a power change right now and not see the physiological bill for it until several control ticks later. If your controller is impatient, it is not controlling the rider. It is arguing with the ghost of what the rider was doing twenty seconds ago.

My first controller argued with ghosts.

The March 9, 2026 ride is where that became undeniable. The ride lasted 5.1 minutes. In 2.3 minutes of the failed section, power swung from 50 watts to 107 watts and back again. Sixteen oscillation cycles. The logs showed the controller making correction after correction while heart rate barely moved, then all the delayed response arriving at once. It was unrideable.

The uglier part was how it failed. The early live controller targeted the midpoint of the zone, not the edges. In a Zone 1 band of 122 to 140 bpm, a perfectly acceptable heart rate of 139 bpm looked like an error of 131 - 139 = -8 bpm. The controller thought I was too high while I was still in zone. Then an overshoot guard piled a second correction on top of the first. The logs caught the exact absurdity: a slope of 0.0002 bpm/s triggered a minimum 6 watt correction every tick. At 4 Hz, that is 24 watts per second of panic over a signal that was effectively flat. Power collapsed from 105 watts to the 50 watt floor in 14 seconds while heart rate sat there unchanged. Then it ramped back in when HR finally dropped by a beat or two.

That was not a tuning problem. That was a controller problem.

The fix was not better PID gains. The fix was changing the architecture. I stopped asking feedback to do the whole job and started asking it to do the part feedback is actually good at. Small corrections around a reasonable baseline. Cruise control, not autopilot.

That led to the feedforward-first design. On March 4, the loop turned into five mode-specific behaviors instead of one generic tick:

open_loop_warmup
calibrating
feedforward_trim
open_loop_recovery
power_assist_sprint

That split came straight from the physics. Warmup is not steady state. Recovery is not steady state. A 20 second sprint is definitely not steady state. HR-first control makes sense when the effort is long enough for the cardiovascular system to answer. It makes no sense when the interval is shorter than the lag.

The core inversion was simple on paper. If the rider's local heart-rate-to-power gain is k and the session intercept is beta0_day, then the steady-state power required for a target heart rate is P_ff = (HR_target - beta0_day - drift_estimate) / k. Get close with that, then trim.

The trim logic got simpler and better the moment I stopped chasing the zone midpoint. Deadband made the controller act like a person with patience. If HR is inside the zone, do nothing. If HR drops below the lower edge, add a little power. If it rises above the upper edge, cut a little power. In the live steady-state loop that became asymmetric trim: 0.3 W/bpm below the zone and 0.5 W/bpm above it, clipped to +/-20W. It is intentionally harder on the high side because overcooking the rider is worse than undercooking the workout.

The research kept pushing the same lesson. Hunt's published controller worked at a 5 second control cadence with about 2.5 to 3.1 bpm RMSE. My early loop was running at 4 Hz and behaving like it had somewhere urgent to be. The literature wanted patience. The rider wanted patience. The logs wanted patience.

So the controller got slower and more mode-aware. Short intervals moved to power-led logic because mean power, not HR, is what regulates a 20 second effort. Recovery stopped trying to chase heart rate down tick by tick and just got easy. Warmup got its own ceiling rules. A live drift heuristic even showed up in the product controller: after 600 seconds in zone and 60 seconds spent in the zone's upper quartile, shave off 1.5 watts every minute, capped at 15 watts. It was a heuristic, not a physiological model, but it was at least aimed in the right direction.

The guard logic also got real. Ignore heart rate spikes faster than 20 bpm/s. Stop if HR goes implausibly low or high. Warn if trainer power diverges from commanded power by more than 30 watts for 10 seconds. Drop 15 watts immediately if cadence collapses by more than 20 RPM in 5 seconds so the rider does not get dragged into an ERG death spiral.

This all felt like progress because it was progress. The March 9 oscillation told me direct midpoint-chasing was bullshit for a delayed human system. Feedforward plus deadband trim was better. The trouble was hidden in the word "feedforward."

Feedforward only works if the estimate is close.

3. The Identifiability Crisis

On March 11, 2026, I learned how little "close" matters when the estimate is wrong.

The ride report is blunt. Warmup started at 80 watts and ramped to 145 watts over six minutes. Because warmup was still effectively HR-blind, my heart rate was already around 150 bpm before the first real work phase even started. Then the controller flipped into aerobic base mode and ramped from 145 watts to a 208 watt feedforward target in 3.3 seconds.

My actual Zone 2 power that day was roughly 100 to 130 watts.

The emergency ceiling did fire. That sounds reassuring until you do the arithmetic. Seventy percent of 208 watts is about 146 watts. The app dropped me from "way too high" to "still too high." My max HR hit 163 bpm. The ride ended at 451 seconds. My note in the report was short and accurate: legs are cooked.

That ride is the center of this whole story because it killed the comforting lie that the remaining work was just controller tuning. It was not. The trim was not the primary problem. The primary problem was identification.

The published population mean for k is about 0.39 bpm/W. The published range is 0.18 to 0.80. That is a 4.4x spread. If you ship ride one using the population default and the rider happens to be on the wrong side of that spread, your feedforward is junk before the first correction loop even starts.

March 11 was that failure in real life. The controller used a value that looked reasonable in a table and insane on a bike. Once the feedforward landed at 208 watts, everything downstream was working with poisoned inputs. The emergency ceiling was relative to the poisoned input. The trim started from the poisoned input. The warmup transition fed into the poisoned input. It was all downstream of a bad estimate.

That is when the problem changed shape for me. Up to that point I was mostly thinking like a control engineer. Make the loop stable. Handle the delay. Separate steady state from short intervals. After March 11 I started thinking like someone building a product that physically pushes on a human body. If your estimate is wrong high, you do not get a pretty Bode plot. You get a rider breathing over the bars while the screen calmly tells you everything is within control limits.

I tried to salvage the identification problem the normal way. v3 proposed a structured warmup ramp and a joint fit of k, beta0, and tau using batch fitting. In a notebook, that sounds perfectly respectable. In real data, it turns into mud.

The adversarial review made the failure math impossible to ignore. For a ramp input, the response looks like HR(t) = beta0_day + k * [P_ramp(t - Td) - tau * ramp_rate * (1 - exp(-(t - Td)/tau))]. Once the transient dies down, the slope mostly tells you k * ramp_rate. The intercept smears together beta0_day and k * tau * ramp_rate. If tau is wrong, beta0_day moves with it. If beta0_day moves, the feedforward inversion moves. Now your baseline and your gain are taking turns lying to you.

Three independent reviews all landed on the same verdict. The Jacobian for the joint fit is ill-conditioned under realistic noise. Condition numbers in the 1000 to 9000 range. Rank-deficient behavior with ramp excitation alone. Same answer from different angles.

The adversarial review did not stop there. It also called out the parts that would have hurt a real rider if I kept pretending the identification issue was solved:

Warmup identification was ill-conditioned.
The emergency ceiling could still cook the rider because 70 percent of wrong is still wrong.
Drift was being double-counted in one version of the spec, so power would bleed down faster than physiology justified.
Optical sensors added 5 to 15 seconds of latency on top of the physiological dead time, which meant the controller could be predicting entirely inside the delay window.

Then came the edge cases. Beta blocker users. Atrial fibrillation noise. Integral wind-up during sensor dropout. Cadence collapse. Riders sharing an account and averaging each other into nonsense.

The research review made the bigger point even clearer. No published HR-power control system had solved a clean first-ride cold start. Hunt's group used a separate 30 minute identification session with square-wave excitation. The UNSW group used damped or constrained recursive least squares during open-loop phases. Nobody had a normal human warmup that magically produced reliable, personalized control on ride one.

I had tried to sneak past that fact by making the warmup do double duty as both a warmup and an identification experiment.

It failed.

That failure did something useful. It removed the fantasy that more cleverness inside the trim loop would save a bad baseline. It would not. A controller that only works when the estimate is already right is not a controller. It is wishful thinking in code.

4. The Bayesian Solution

On March 12, 2026, the redesign finally got honest.

The first honest move was to stop trying to estimate everything at once. If tau is what makes the ramp fit ill-conditioned, do not let tau float freely during the first ride. Replace the one hard estimation problem with a set of smaller ones you can actually solve.

That became the v4 warmup. Settle. Ramp. Dither.

Phase A is 150 seconds at 50 watts. Let the rider's heart rate settle. Measure HR_initial. Get a defensible baseline instead of pretending the first thirty seconds tell you anything.

Phase B is not an identification ramp anymore. It is just a bounded ramp to a conservative probe power P_probe. The probe itself is limited by the rider's onboarding power cap and the prior gain. The dither amplitude is scaled as A = clip(4.0 / k_prior, 8, 12). Low-gain riders get more excitation because their signal-to-noise ratio is worse. High-gain riders do not need to feel the trainer jerking around for no reason.

Phase C is where identification finally happens. Not with a monotone ramp. With zero-mean ternary dither around the probe level: -A, 0, +A, fifty seconds per level, two full cycles over five minutes. The mean power stays at P_probe. The perturbation is the signal. Baseline no longer has to fight gain for identifiability because the regression is now about deviations around a local operating point.

The second honest move was to stop pretending a point estimate is the same thing as knowledge.

The recursive estimator runs a small tau model bank with 70 second candidates. Each model carries its own baseline m and gain k, updates them with a damped Bayesian step, and gets reweighted by how well it predicts the observed heart rate. The output is not just mu_k. It is mu_k, sigma_k, and a coefficient of variation CV_k = sigma_k / mu_k.

That number changes the whole control story.

Feedforward authority is now earned. w_ff = clip((0.25 - CV_k) / 0.10, 0, 1). If the estimate is loose, w_ff goes to zero and the system falls back to feedback-dominant behavior. If the posterior tightens, feedforward fades in. The controller is allowed to say, "I do not know enough yet, so I am not going to act like I do."

The feedforward itself also stopped using the optimistic estimate. It now uses a conservative gain bound, k_safe = min(0.80, mu_k + 1.28 * sigma_k), because the conservative direction is to assume the rider needs less power than the mean estimate suggests. The local power command becomes P_ff = P_operating + w_ff * (HR_target - m_hat - drift_estimate) / k_safe.

That is the part I trust now. Not because it is elegant. Because it admits uncertainty and turns uncertainty into reduced control authority instead of into fake confidence.

The predictive supervisor closes the loop on that honesty. Every 5 seconds it projects heart rate forward about 45 seconds using worst-case parameters from the posterior. If the forecast threatens the ceiling, power increases get clipped before the rider gets there. If the threat gets close to max HR, power drops to 70 percent of current. If current HR exceeds max HR, the system goes to the floor.

The validation numbers are what made this architecture feel real instead of just cleaner. A dedicated calibration ride passed over 98% of eligible synthetic riders, with no weak-excitation failures. The standard first-ride warmup was much messier: about 84% eligible pass rate, with hundreds of weak-excitation runs called out explicitly. That contrast is the whole point. A dedicated calibration protocol can work very well. A normal warmup is useful but uncertain. The controller now knows the difference.

That changed the project from "estimate the rider" to "estimate the rider, quantify how wrong you might be, and behave accordingly."

That is a much better sentence to put anywhere near a human body.

5. The Over-Engineering Phase

At some point the control problem stopped being the only thing on my screen. Like every good hobby project, it started accreting side quests.

I built a weekly adaptation engine. The validation for it used a four-week cycle with progressive overload at 1.08x, recovery deload at 0.85x, hard-session ratios, efficiency-factor trends, recovery sensitivity. It looked serious because it was serious. It also had the fatal weakness of every small training-planning product trying to pick a fight with TrainerRoad. They have years of history and a mountain of workout data. I had a nice theory and a repo. You do not beat a data advantage with vibes and a more complicated rules engine.

I built a readiness system. Sleep quality, stress level, energy level, muscle soreness. Score it to 100. Adapt the session if the number falls below a threshold. The tests even looked clean: a score of 42 trims a session to 80 percent, a score of 32.5 trims it to 60 percent. That sounds smart until you remember what this app already measures for free. If your heart-rate-to-power relationship drifts, the ride tells you. If the probe power that should feel easy is already pushing HR up, the ride tells you. Asking the rider to fill out a form so the app can rediscover what the physiology will reveal five minutes later felt like paperwork, not product.

I built backend-era plumbing too. Session generation, persistence, learned physiology, cloud-shaped seams all through the project. Some of that is still visible in the tests and specs. The public site now says the opposite on purpose: no account, no subscription, no server, no cloud coaching layer. Local-only. One-time purchase. That reversal is not branding. It is a product decision made after staring at how much surface area a server adds when the control math itself wants to run on the phone.

Server means support. Server means cost. Server means privacy policy, account recovery, stale cache bugs, sync bugs, billing, data deletion, and eventually someone emailing you because their workout disappeared in a tunnel. None of that helps the one thing Zone Pedal is supposed to do, which is take heart rate, power, and cadence on a bike in a garage and make the trainer behave.

I also spent time asking whether the whole idea should collapse back into pure power ERG with a nicer UI. That would be the easy out. It is also the thing I was trying to escape. ERG is good at controlling a machine. The whole thesis of Zone Pedal is that the human is not a machine. The rider's HR-power gain changes with fitness, fatigue, heat, medication, and the plain fact that living systems drift. If I end up back at fixed watts for everything, then all I did was build another app that ignores the interesting half of the problem.

This phase was not wasted effort. It was how the boundaries got sharp. Every feature I cut taught me a version of the same lesson. If the feature does not improve the ride that is happening right now, on this device, with this body, it is probably decoration.

6. The Product Decision

Once I stopped pretending Zone Pedal needed to be a platform, the product got easier to see.

Cyclists do not need a thousand HR-adaptive workouts. They need the handful where heart-rate-aware control is actually better than fixed power.

Zone 2 without drift. That is the bread-and-butter session where a fixed wattage slowly stops matching the intended stress. This is the cleanest use case for feedforward plus trim.
Tabata with power targets. A 20 second work interval is too short for HR to control directly, so the work phase has to stay power-led. Heart rate still matters in the rests because it tells you how recovery is going.
Norwegian 4x4 with HR ceilings. Four minute intervals are long enough for HR-adaptive control to matter and short enough that launch logic matters too. This is the second showcase because the system can back power off before the rider blows through the ceiling.

That is the product on the site now. Thirteen workouts across five categories: Calibration, Aerobic Threshold, Endurance, Power-Led Intervals, and High Intensity. No weekly plans. No cloud coaching layer. No account. No subscription. The math runs on the phone. Data stays on the phone. The app is sold as a one-time purchase.

That decision was less romantic than the earlier versions of the project. It was also smarter. TrainerRoad, Zwift, and MyWhoosh are platform businesses. I was never going to out-platform them. Zone Pedal only has a reason to exist if it is narrower and more technically opinionated than they are.

So I kept the thing that felt different and cut the rest.

7. The Math

The rider model that survived is simple enough to explain and annoying enough to matter.

At the steady-state level, heart rate behaves like a first-order-plus-dead-time plant. The static map is HR_ss = beta0_day + k * P, where beta0_day is the session intercept and k is the rider's local gain in bpm/W. The dynamic part is a lag after a pure delay: after dead time Td, heart rate moves toward HR_ss with time constant tau. For cycling, the repo centers that around about Td = 14s, tau_up = 50s, and tau_down = 80s, with wide rider-to-rider spread.

That delay is why a naive feedback loop fails. You are always steering by looking in the rear-view mirror.

The feedforward inversion is the obvious answer once you trust the parameters enough: P_ff = (HR_target - beta0_day - drift_estimate) / k. In the v4 architecture, it becomes local and uncertainty-aware: P_ff = P_operating + w_ff * (HR_target - m_hat - drift_estimate) / k_safe. Same idea, more conservative implementation. Start from the current operating point, scale authority by confidence, and use the conservative bound on k, not the optimistic one.

Feedback still matters. It just has a smaller job. In steady aerobic work the deadband trim is:

below zone: trim = 0.3 * (zone_min - HR)
above zone: trim = 0.5 * (zone_max - HR)
inside zone: trim = 0

That trim is clipped to +/-20W, then a very slow bias term u_bias integrates residual error on a 400s time constant with a 30W clamp. The controller is allowed to notice a persistent miss. It is not allowed to start thrashing because one sample came in hot.

The drift model moved from a live heuristic to an explicit intensity-aware estimate. The defaults are 0.10 bpm/min in Zones 1 and 2, 0.20 in Zone 3, 0.30 in Zone 4, and about 0.35 for mixed HIIT sessions, with partial reset during recovery intervals. Older live logic used a cruder trick that still taught me something useful: after 600 seconds in zone and 60 seconds in the zone's upper quartile, step power down by 1.5W each minute, capped at 15W. Even the heuristic version was trying to answer the same physiological fact. Constant power does not mean constant internal load.

The emergency logic is where the product earns the right to exist. If HR > zone_max + 5, the command is clipped to the minimum of three ceilings:

relative ceiling: 0.70 * P_ff
k-based ceiling: ((zone_max - m_hat) / k_safe) * 0.80
onboarding absolute cap

That triple ceiling exists because March 11 proved a single relative ceiling is not enough. 70% of a bad estimate is still a bad estimate.

The predictor sitting above all of that is the part I would insist on if I had to cut everything else. It looks forward about 45 seconds with worst-case posterior parameters. If the forecast says the rider will threaten the zone ceiling or approach max HR, the supervisor clips power before the physiological response lands. That is the difference between react-and-correct and predict-then-correct. One is how you chase the system. The other is how you respect the delay.

The published benchmark I kept comparing against is not "perfect HR control." It is much more practical. Hunt's work reports about 2.5 to 3.1 bpm RMSE with a 5 second control cadence, and Waldron's 2024 cyclist study reported automated HR clamping at 2.8 bpm RMSE versus 3.2 bpm for manual adjustment. That is the right scale for honesty. If someone promises +/-1 bpm on a living human with transport delay, noise, drift, and sweat involved, they are selling magic.

Zone Pedal's philosophy ended up different from the classic control papers even when the goal was similar. The literature mostly says, "react carefully and filter heavily." Zone Pedal ended up saying, "predict conservatively, then trim carefully." Same target. Different attitude.

8. Closing

The part of this project that stayed with me was not the controller architecture. It was the line between interesting and honest.

A lot of software lets you be wrong in private. You get a bad chart. A failed request. A weird UI state. A rider's heart rate is not that kind of problem. If the feedforward is wrong high, you do not get a warning in a log file. You cook the rider. If the emergency ceiling is tied to the bad estimate, then 70 percent of garbage is still garbage. If your sensor latency assumption is wrong, your prediction lands inside dead time and calls itself foresight.

The failure cases are what made the project real. March 9 taught me that a delayed human system punishes impatience. March 11 taught me that personalization is not a nice-to-have layer on top of control. It is the control problem. The reviews taught me that clever math does not become trustworthy because it looks clean in a notebook. It becomes trustworthy when it survives the ugly cases, admits uncertainty, and still behaves conservatively.

That changed how I think about building any software that touches physiology. The hard part is not squeezing one more feature out of the model. The hard part is deciding what the model is allowed to do when it does not know enough.

I started this project because ERG mode ignored the human. I kept going because the human turned out to be the interesting part. Not the trainer. Not the app. The body in the loop.

The shipped app has grown a little since the strip-to-ship cut, but only in the directions the ride kept asking for. There is a Discover Your Cardiac Response calibration ride. There is a Your Body profile with VO2max and DFA-alpha1 aerobic-threshold estimates. There is .zwo import, because people already have workouts they care about. Those are product additions, not a return to platform ambition. They still point at the same constraint: help this ride understand this body better.

That realization pointed past this project. Not to another coaching platform. To the bigger problem hiding underneath all of this: how to model biological systems honestly enough that software can interact with them without lying to itself.

Mark Koivuniemi is a biomedical engineer in Colorado building software that models human physiology. Zone Pedal is available on the App Store.