Key Moments
Stanford CS547 HCI Seminar | Spring 2026 | HCI and Human-Centered AI for Digital Health
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
AI models for digital health can underperform dramatically due to default technical choices, leading to issues like clinician alert fatigue and patient disengagement.
Key Insights
Personalized AI models, training a separate model per person, significantly outperform generalized models in predicting health events, as seen in a stress data set where personalized models showed much better performance.
The choice of evaluation metrics for AI models is critical; while default metrics like precision, recall, and F1 score are common, they can be sensitive to population prevalence and may not align with clinical utility, as demonstrated by the sepsis prediction model's low precision despite high sensitivity and specificity.
Easy-to-implement AI modeling decisions, such as altering decision thresholds, can dramatically affect outcomes; a model with the same learned weights can shift from high sensitivity (catching all cancer cases but with false positives) to high specificity (only predicting cancer with high confidence, missing some cases).
Apple Watch's hypertension prediction model prioritized high specificity (around 92%) over sensitivity (low 40s) after consulting clinicians about alert fatigue, demonstrating the value of tuning based on user feedback.
Community engagement and co-design are crucial for digital health tools: a Parkinson's assessment initially overlooked medication timing, disease asymmetry, and clinician workflow fit until input was gathered from patients and clinicians.
Algorithmic bias mitigation is complex, with multiple quantitative fairness metrics (e.g., disparate impact, equalized odds) that often conflict, requiring careful consideration of which metric to optimize for in specific contexts.
Personalized AI for Just-in-Time Health Interventions
The field of human-centered AI for digital health focuses on leveraging data from wearables and mobile devices to predict intervenable health events, rather than solely for diagnosis. The core concept is 'just-in-time' digital interventions, aiming to deliver support precisely when needed. This often involves predicting repeat health events, such as substance use cravings or blood pressure spikes. Unlike traditional AI models trained on large, general datasets, the approach central to this research is personalized machine learning, where a separate AI model is trained for each individual. This personalization is enabled by self-supervised learning, akin to how foundation models like GPT are trained on vast amounts of text by predicting the next word. Applied to wearables, this means training models to predict missing bio-signal segments or infer relationships between different physiological data streams (e.g., heart rate, accelerometry, skin temperature) for each user. This personalized, self-supervised approach significantly reduces the need for explicit data labeling and allows models to better capture individual physiological variations. For instance, personalized models demonstrated substantially better performance compared to generalized models on a stress dataset, requiring very few labeled data points for convergence.
The failure of AI without HCI: Nurse stress data challenges
Despite the promise of personalized AI, significant challenges arise when human behavior and real-world complexities are involved. An illustrative example involved a dataset of nurses annotating their stress levels during the COVID-19 pandemic. Despite the theoretical ability of personalized models to account for individual differences, the project encountered severe issues: inconsistency in labeling across different nurses, and even within the same nurse over time. This inconsistency rendered the personalized models ineffective. This situation highlights that AI innovation alone is insufficient to overcome fundamental problems rooted in human behavior and the limitations of data collection in stressful environments. These failures underscore the critical need for Human-Computer Interaction (HCI) principles to bridge the gap between AI capabilities and human realities in digital health applications.
Reducing user burden while maintaining intervention efficacy
A primary HCI challenge in digital health is minimizing the burden on users while ensuring the effectiveness of AI-driven interventions. This involves addressing how to personalize AI models without overwhelming patients (or clinicians) with data entry or complex interactions. Approaches like active learning, where the AI requests calibration data when it is least confident in its predictions, are being explored. However, even active learning needs refinement, as user availability and context (e.g., driving) must be considered. A more layered approach is proposed, integrating AI model outputs with user preferences and passive sensing (like location data) to infer context and optimize intervention timing. Understanding the causal relationships between intervention timing, content, user burden, receptivity, engagement, and adherence is crucial. This framework suggests that reducing burden directly impacts receptivity and subsequently adherence, emphasizing the importance of user-centric design in digital therapeutics.
The critical role of evaluation metrics and decision thresholds
The choice of AI evaluation metrics profoundly impacts how model performance is perceived and utilized, often with significant consequences for end-users. Standard metrics like precision, recall, and F1-score, readily available in libraries like scikit-learn, can be misleading if not understood in context. Precision, for example, measures the proportion of positive predictions that are actually correct, but it is sensitive to the prevalence of the condition in the population. A model that appears to perform well in a general screening population might fare poorly in a specialized clinic due to different base rates of the condition. To mitigate this, terms like sensitivity (recall) and specificity, which are less sensitive to prevalence, are often preferred in healthcare. However, the interpretation and clinical utility of these metrics are not always straightforward. The lecture emphasized that a single AI model can exhibit different behaviors based solely on the chosen decision threshold. By adjusting this threshold, a model can be optimized for high sensitivity (catching more true positives, but increasing false positives) or high specificity (reducing false positives, but potentially missing true positives). This choice has direct implications for patient outcomes and clinician experience.
Sepsis prediction and alert fatigue: A case of low precision
The perils of relying solely on default AI metrics were starkly illustrated by an Epic sepsis prediction model deployed in hospitals. While metrics like sensitivity (86%) and specificity (81%) appeared promising, the model's precision was a mere 34%. This meant that for every three positive sepsis alerts generated by the model, only one was a true sepsis case. Consequently, clinicians experienced severe alert fatigue, becoming desensitized to the notifications and eventually ignoring them. This eroded trust in the system, rendering a potentially useful AI tool ineffective. Clinicians, facing this dilemma, often prefer high specificity to avoid unnecessary alarms and the associated workload, even if it means missing a few true positives. This scenario underscores that technical performance metrics must align with the practical needs and realities of the clinical workflow to ensure adoption and effectiveness.
Apple Watch hypertension model: Prioritizing specificity
Another compelling example of prioritizing specific evaluation metrics comes from Apple's hypertension prediction model. The model exhibits low sensitivity (around 40%) but very high specificity (about 92%). This deliberate choice was made after consulting with clinicians who highlighted the critical issue of alert fatigue. By optimizing for high specificity, Apple ensured that when the watch flags a potential hypertension risk, it is highly likely to be a genuine concern. This approach, though sacrificing the detection of some true cases (lower sensitivity), prevents overwhelming users and discrediting the system with false alarms. This demonstrates that with the same underlying AI model, manipulating the decision threshold can lead to drastically different, contextually appropriate applications, highlighting the importance of understanding and engaging with end-user requirements during the design and evaluation phases.
The complexity of algorithmic fairness and stakeholder engagement
The development of AI for health extends beyond technical performance to encompass fairness and equity. In a Parkinson's assessment project, initial AI models performed better on Mac devices than Windows, and for right-handed versus left-handed individuals, highlighting potential social determinants of health at play. Addressing algorithmic bias involves navigating a landscape of diverse quantitative metrics like disparate impact, equalized odds, and equal opportunity, each with different implications and often conflicting with one another. While models can be engineered to be less biased, the critical question remains: what metric should be optimized for? This decision is context-dependent and requires deep HCI research to determine preferable approaches for various scenarios. Furthermore, the Parkinson's study revealed the profound importance of stakeholder engagement. Initially, the assessment failed to account for crucial factors like medication timing, disease asymmetry, and crucially, its fit within clinical workflows. Clinicians, already burnt out, need tools that seamlessly integrate into their practices without increasing their burden. Co-designing with patients and clinicians from the outset is essential to ensure that AI solutions are not only technically sound but also practically useful, actionable, and aligned with real-world needs.
Regulatory and practical considerations for AI deployment
The deployment of AI in digital health also involves legal and regulatory considerations. For regulated medical devices, such as those requiring FDA approval, the specific model parameters and performance metrics must be locked and re-approved if changed. This regulatory oversight ensures a level of predictability and accountability. In contexts where FDA approval is not required, the question of liability—whether it lies with the AI model, the deploying clinician, or another entity—is an ongoing discussion. Beyond regulation, the practical integration of AI into existing systems like electronic health records (EHRs) presents significant HCI challenges focused on workflow fit. While clinicians may desire more granular data from patients, the design must meticulously balance information delivery with preventing additional stress or burden. This iterative process, involving community engagement, studying limitations, and adapting designs based on feedback, is fundamental to developing effective and trustworthy AI for digital health.
Mentioned in This Episode
●Products
●Software & Apps
●Companies
●Organizations
●Books
●Drugs & Medications
●People Referenced
Common Questions
Human-centered AI for digital health focuses on using AI to predict intervenable health events and deliver timely digital interventions. It emphasizes understanding user behavior and reducing user burden while maintaining intervention efficacy and engagement.
Topics
Mentioned in this video
The speaker works in human-centered AI for digital health at this institution.
The regulatory body that may require approval for AI models used in health products, influencing model specifications and changes.
National Institutes of Health, where the speaker's lab submits grant proposals for follow-up studies to account for lessons learned.
An organization whose members and president, Jerry Boster, were instrumental in co-designing and informing the Parkinson's digital assessment.
More from Stanford Online
View all 58 summaries
48 minStanford CS153 Frontier Systems | The AI Native Company: How One Founder Becomes a 1000x Engineer
73 minStanford CS25: Transformers United V6 I Distinct Modes of Generalization from Parameters and Context
58 minStanford Robotics Seminar ENGR319 | Spring 2026 | Integrated Learning and Planning
35 minStanford MS&E435 | Spring 2026 | Economics of Generative AI
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free