Decentralized Web Salon Lalana Kagal Emily Jiang Irene Tenison Alice Chen

MIT OpenCourseWareMIT OpenCourseWare
Education4 min read47 min video
Nov 6, 2023|993 views|12
Save to Pod

Key Moments

TL;DR

MIT researchers discuss decentralized web, ethical AI, and privacy-preserving ML on microcontrollers.

Key Insights

1

The MIT Decentralized Information Group researches data ownership, policy compliance, bias mitigation, privacy, and accountability.

2

Decentralized web architectures like Solid and blockchain are explored for data control and accountability.

3

Retrieval Augmented Generation (RAG) offers a promising approach for privacy-preserving, interpretable LLM querying in healthcare.

4

A hierarchical RAG system is proposed for querying distributed clinical data while enforcing access controls.

5

On-device training and Federated Learning are explored for privacy-preserving AI on resource-constrained IoT microcontrollers.

6

Challenges in microcontroller ML include memory/power constraints, data heterogeneity, and communication bottlenecks.

CORE RESEARCH PILLARS

The MIT Decentralized Information Group's research, spanning two decades, is driven by two primary challenges: ensuring data ownership and control for users, and addressing the ethical use and misuse of data. This encompasses critical areas such as policy compliance, bias mitigation in AI, maintaining user privacy, and establishing accountability when data-related issues arise. These fundamental principles guide the group's exploration of various computing paradigms to develop more secure and user-centric data systems.

DECENTRALIZED WEB AND DATA CONTROL

The group has significantly contributed to the decentralization of the web through frameworks like Solid, which aims to give users more control over their data. Their work extends to decentralized file storage solutions and ensuring policy compliance across distributed data sources. Additionally, research into databases and blockchain technology focuses on enabling accountability and enforcing usage control policies when analyzing data spread across multiple distributed databases, highlighting a commitment to robust data governance.

AI AND PRIVACY-PRESERVING MACHINE LEARNING

Currently, the group is investigating AI, particularly machine learning. Acquiring large, high-quality datasets is crucial for robust models, but this data often originates from distributed sources. The key challenge is combining this diverse data in a privacy-preserving manner, adhering to policies, and mitigating bias. This involves developing techniques to handle sensitive information and ensure fairness without compromising model performance or data integrity.

DISTRIBUTED QUESTION ANSWERING FOR CLINICAL DATA

One project focuses on a distributed question-answering system for clinical data, enabling natural language queries across collaborating hospitals and research labs. This system utilizes Retrieval Augmented Generation (RAG), a technique that improves LLM responses by incorporating external documents during query time. RAG enhances interpretability by citing sources, aids in catching hallucinations, and allows for domain adaptation without costly model fine-tuning, making it ideal for sensitive healthcare applications governed by privacy laws like HIPAA.

HIERARCHICAL RETRIEVAL AND ACCESS CONTROL

To address the distributed nature of clinical data, a hierarchical RAG system is proposed. This architecture modifies the retrieval step to federate retrieval across multiple levels, potentially with a retriever per hospital or department. Access control, managed via OAuth, ensures that users (like doctors) can only access data they are authorized to see. Clinical BERT embeddings are used at the leaf nodes for domain-specific retrieval, aiming to balance comprehensive querying with stringent privacy and security protocols.

FEDERATED LEARNING ON MICROCONTROLLERS

The second project explores applying machine learning to the vast network of IoT devices, specifically on microcontrollers which are extremely memory and power constrained. Federated Learning is employed, allowing models to be trained collaboratively across these devices without raw data ever leaving the device. This approach addresses privacy concerns inherent in collecting sensitive IoT data centrally and aims to leverage underutilized edge data for applications ranging from healthcare monitoring to industrial automation.

CHALLENGES IN MICROCONTROLLER ML

Implementing Federated Learning on microcontrollers presents significant challenges. These include adapting existing ML algorithms for extreme resource constraints through techniques like quantization and sparse updates, managing data and model heterogeneity across devices, and overcoming communication bottlenecks. Ensuring secure aggregation of model parameters, preventing information leakage, and handling device dropouts are critical for the success and reliability of these decentralized learning systems.

PRIVACY MECHANISMS AND SECURITY CONSIDERATIONS

Both projects place a strong emphasis on privacy and security. In the clinical domain, a hierarchical RAG system with OAuth aims to prevent unauthorized data access. For IoT devices, Federated Learning inherently protects data by keeping it on local devices. Techniques like data perturbation, differential privacy, and careful de-identification of datasets like MIMIC-III are considered to further mitigate risks. Robustness against potential security threats, such as prompt engineering attacks, is also a key consideration.

MODEL INTERPRETABILITY AND INTERACTION

Interpretability is crucial, especially in healthcare. RAG provides a foundational level by citing source documents used in generating responses. However, there's a recognized need for more sophisticated interpretability and user interaction design, ensuring that the 'first arrow'—the conversation between the user and the system—is intuitive and trustworthy. This involves not just providing answers but also explaining how they were derived, enabling users to feel confident in the results.

FUTURE DIRECTIONS AND EVALUATION

The research is currently in early design phases, seeking feedback on proposed solutions. Evaluation metrics focus on federation, security, robustness (noise rejection, negative rejection), and information integration. While direct physician input is resource-prohibitive for the current project timeline, evaluations will leverage large de-identified datasets like MIMIC-III. Future work aims to refine these decentralized and privacy-preserving AI systems for broader real-world application.

Common Questions

Clinical data is dynamic, distributed across institutions, and heavily protected by privacy laws like HIPAA. LLMs can also lack interpretability and are prone to hallucination, making them risky for healthcare decisions.

Topics

Mentioned in this video

More from MIT Open Learning

View all 113 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free