How does personalized self-supervised learning work for wearables?

Similar to how models like GPT predict the next word, personalized self-supervised learning for wearables predicts missing bio-signal data based on surrounding data. This creates a personalized foundation model for each user, which can then be fine-tuned for specific health outcome predictions.

Why is it important to consider evaluation metrics beyond those provided by default in scikit-learn?

Default metrics like precision and recall can be misleading because they are sensitive to population prevalence. Using metrics like sensitivity and specificity, or carefully considering the decision threshold, is crucial for ensuring AI models are useful and trusted in real-world clinical settings.

What are the implications of choosing different AI model evaluation thresholds?

Adjusting the decision threshold of an AI model can drastically change its output. For example, increasing sensitivity might catch more cases but lead to more false positives and alert fatigue, while increasing specificity reduces false positives but might miss some true cases.

What lessons were learned from the Parkinson's digital assessment study?

Key lessons from the Parkinson's study included the importance of accounting for medication timing, considering whether symptom prediction is more useful than diagnosis, recognizing asymmetric disease onset, and ensuring the AI solution fits within the clinical workflow to avoid increasing clinician burden.

How can bias in AI models for healthcare be addressed?

Bias can be addressed through various quantitative metrics like disparate impact, equalized odds, and equal opportunity, and by algorithmically mitigating it. However, the critical question is not just *if* a model can be made less biased, but *what* bias metric should be optimized for in different contexts.

What is the difference between engagement and adherence in digital therapeutics?

Engagement refers to basic user interaction, like opening an app or performing simple tasks within it. Adherence, on the other hand, means actually following the recommended behavior change suggested by the therapeutic intervention, which is more challenging to measure and achieve.

Key Moments

Stanford CS547 HCI Seminar | Spring 2026 | HCI and Human-Centered AI for Digital Health

Stanford Online

Education8 min read51 min video

May 20, 2026|246 views|8

Stanford Stanford Online

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

On this page

TL;DR

AI models for digital health can underperform dramatically due to default technical choices, leading to issues like clinician alert fatigue and patient disengagement.

Key Insights

Personalized AI models, training a separate model per person, significantly outperform generalized models in predicting health events, as seen in a stress data set where personalized models showed much better performance.

The choice of evaluation metrics for AI models is critical; while default metrics like precision, recall, and F1 score are common, they can be sensitive to population prevalence and may not align with clinical utility, as demonstrated by the sepsis prediction model's low precision despite high sensitivity and specificity.

Easy-to-implement AI modeling decisions, such as altering decision thresholds, can dramatically affect outcomes; a model with the same learned weights can shift from high sensitivity (catching all cancer cases but with false positives) to high specificity (only predicting cancer with high confidence, missing some cases).

Apple Watch's hypertension prediction model prioritized high specificity (around 92%) over sensitivity (low 40s) after consulting clinicians about alert fatigue, demonstrating the value of tuning based on user feedback.

Community engagement and co-design are crucial for digital health tools: a Parkinson's assessment initially overlooked medication timing, disease asymmetry, and clinician workflow fit until input was gathered from patients and clinicians.

Algorithmic bias mitigation is complex, with multiple quantitative fairness metrics (e.g., disparate impact, equalized odds) that often conflict, requiring careful consideration of which metric to optimize for in specific contexts.

Personalized AI for Just-in-Time Health Interventions

The field of human-centered AI for digital health focuses on leveraging data from wearables and mobile devices to predict intervenable health events, rather than solely for diagnosis. The core concept is 'just-in-time' digital interventions, aiming to deliver support precisely when needed. This often involves predicting repeat health events, such as substance use cravings or blood pressure spikes. Unlike traditional AI models trained on large, general datasets, the approach central to this research is personalized machine learning, where a separate AI model is trained for each individual. This personalization is enabled by self-supervised learning, akin to how foundation models like GPT are trained on vast amounts of text by predicting the next word. Applied to wearables, this means training models to predict missing bio-signal segments or infer relationships between different physiological data streams (e.g., heart rate, accelerometry, skin temperature) for each user. This personalized, self-supervised approach significantly reduces the need for explicit data labeling and allows models to better capture individual physiological variations. For instance, personalized models demonstrated substantially better performance compared to generalized models on a stress dataset, requiring very few labeled data points for convergence.

The failure of AI without HCI: Nurse stress data challenges

Despite the promise of personalized AI, significant challenges arise when human behavior and real-world complexities are involved. An illustrative example involved a dataset of nurses annotating their stress levels during the COVID-19 pandemic. Despite the theoretical ability of personalized models to account for individual differences, the project encountered severe issues: inconsistency in labeling across different nurses, and even within the same nurse over time. This inconsistency rendered the personalized models ineffective. This situation highlights that AI innovation alone is insufficient to overcome fundamental problems rooted in human behavior and the limitations of data collection in stressful environments. These failures underscore the critical need for Human-Computer Interaction (HCI) principles to bridge the gap between AI capabilities and human realities in digital health applications.

Reducing user burden while maintaining intervention efficacy

A primary HCI challenge in digital health is minimizing the burden on users while ensuring the effectiveness of AI-driven interventions. This involves addressing how to personalize AI models without overwhelming patients (or clinicians) with data entry or complex interactions. Approaches like active learning, where the AI requests calibration data when it is least confident in its predictions, are being explored. However, even active learning needs refinement, as user availability and context (e.g., driving) must be considered. A more layered approach is proposed, integrating AI model outputs with user preferences and passive sensing (like location data) to infer context and optimize intervention timing. Understanding the causal relationships between intervention timing, content, user burden, receptivity, engagement, and adherence is crucial. This framework suggests that reducing burden directly impacts receptivity and subsequently adherence, emphasizing the importance of user-centric design in digital therapeutics.

The critical role of evaluation metrics and decision thresholds

The choice of AI evaluation metrics profoundly impacts how model performance is perceived and utilized, often with significant consequences for end-users. Standard metrics like precision, recall, and F1-score, readily available in libraries like scikit-learn, can be misleading if not understood in context. Precision, for example, measures the proportion of positive predictions that are actually correct, but it is sensitive to the prevalence of the condition in the population. A model that appears to perform well in a general screening population might fare poorly in a specialized clinic due to different base rates of the condition. To mitigate this, terms like sensitivity (recall) and specificity, which are less sensitive to prevalence, are often preferred in healthcare. However, the interpretation and clinical utility of these metrics are not always straightforward. The lecture emphasized that a single AI model can exhibit different behaviors based solely on the chosen decision threshold. By adjusting this threshold, a model can be optimized for high sensitivity (catching more true positives, but increasing false positives) or high specificity (reducing false positives, but potentially missing true positives). This choice has direct implications for patient outcomes and clinician experience.

Sepsis prediction and alert fatigue: A case of low precision

The perils of relying solely on default AI metrics were starkly illustrated by an Epic sepsis prediction model deployed in hospitals. While metrics like sensitivity (86%) and specificity (81%) appeared promising, the model's precision was a mere 34%. This meant that for every three positive sepsis alerts generated by the model, only one was a true sepsis case. Consequently, clinicians experienced severe alert fatigue, becoming desensitized to the notifications and eventually ignoring them. This eroded trust in the system, rendering a potentially useful AI tool ineffective. Clinicians, facing this dilemma, often prefer high specificity to avoid unnecessary alarms and the associated workload, even if it means missing a few true positives. This scenario underscores that technical performance metrics must align with the practical needs and realities of the clinical workflow to ensure adoption and effectiveness.

Apple Watch hypertension model: Prioritizing specificity

Another compelling example of prioritizing specific evaluation metrics comes from Apple's hypertension prediction model. The model exhibits low sensitivity (around 40%) but very high specificity (about 92%). This deliberate choice was made after consulting with clinicians who highlighted the critical issue of alert fatigue. By optimizing for high specificity, Apple ensured that when the watch flags a potential hypertension risk, it is highly likely to be a genuine concern. This approach, though sacrificing the detection of some true cases (lower sensitivity), prevents overwhelming users and discrediting the system with false alarms. This demonstrates that with the same underlying AI model, manipulating the decision threshold can lead to drastically different, contextually appropriate applications, highlighting the importance of understanding and engaging with end-user requirements during the design and evaluation phases.

The complexity of algorithmic fairness and stakeholder engagement

The development of AI for health extends beyond technical performance to encompass fairness and equity. In a Parkinson's assessment project, initial AI models performed better on Mac devices than Windows, and for right-handed versus left-handed individuals, highlighting potential social determinants of health at play. Addressing algorithmic bias involves navigating a landscape of diverse quantitative metrics like disparate impact, equalized odds, and equal opportunity, each with different implications and often conflicting with one another. While models can be engineered to be less biased, the critical question remains: what metric should be optimized for? This decision is context-dependent and requires deep HCI research to determine preferable approaches for various scenarios. Furthermore, the Parkinson's study revealed the profound importance of stakeholder engagement. Initially, the assessment failed to account for crucial factors like medication timing, disease asymmetry, and crucially, its fit within clinical workflows. Clinicians, already burnt out, need tools that seamlessly integrate into their practices without increasing their burden. Co-designing with patients and clinicians from the outset is essential to ensure that AI solutions are not only technically sound but also practically useful, actionable, and aligned with real-world needs.

Regulatory and practical considerations for AI deployment

The deployment of AI in digital health also involves legal and regulatory considerations. For regulated medical devices, such as those requiring FDA approval, the specific model parameters and performance metrics must be locked and re-approved if changed. This regulatory oversight ensures a level of predictability and accountability. In contexts where FDA approval is not required, the question of liability—whether it lies with the AI model, the deploying clinician, or another entity—is an ongoing discussion. Beyond regulation, the practical integration of AI into existing systems like electronic health records (EHRs) presents significant HCI challenges focused on workflow fit. While clinicians may desire more granular data from patients, the design must meticulously balance information delivery with preventing additional stress or burden. This iterative process, involving community engagement, studying limitations, and adapting designs based on feedback, is fundamental to developing effective and trustworthy AI for digital health.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Organizations

●Books

●Drugs & Medications

●People Referenced

Common Questions

Human-centered AI for digital health focuses on using AI to predict intervenable health events and deliver timely digital interventions. It emphasizes understanding user behavior and reducing user burden while maintaining intervention efficacy and engagement.

Topics

Health & Longevity AI & Machine Learning Technology & Innovation Human-computer Interaction Wearable Technology Machine Learning Evaluation Personalized Medicine Algorithmic Bias Digital Therapeutics Patient Engagement Clinician Workflow

Mentioned in this video

Organizations

UCSF

The speaker works in human-centered AI for digital health at this institution.

FDA

The regulatory body that may require approval for AI models used in health products, influencing model specifications and changes.

NIH

National Institutes of Health, where the speaker's lab submits grant proposals for follow-up studies to account for lessons learned.

Hawaii Parkinson's Association

An organization whose members and president, Jerry Boster, were instrumental in co-designing and informing the Parkinson's digital assessment.

Products

Apple Watch

A smartwatch that includes a hypertension prediction model, optimized for specificity to avoid alert fatigue.

Fitbit

A wearable device used for collecting data such as stress and blood pressure.

Drugs & Medications

Levodopa

A common medication prescribed for Parkinson's disease to control motor symptoms, which posed a confounder in early research due to its cyclic effect.

Software & Apps

ChatGPT

An example of a foundation model and fine-tuned model used to explain self-supervised learning.

Companies

OpenAI

The organization that trained the GPT foundation model by mining text from the internet.

People

Jerry Boster

President of the Hawaii Parkinson's Association who was heavily involved in the co-design of the Parkinson's digital assessment.

Books

npj Mental Health Research

A journal where a paper on predicting blood pressure spikes using Fitbit and stress data is set to be published.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free