Decentralized Web Salon Lalana Kagal Emily Jiang Irene Tenison Alice Chen
Key Moments
MIT researchers discuss decentralized web, ethical AI, and privacy-preserving ML on microcontrollers.
Key Insights
The MIT Decentralized Information Group researches data ownership, policy compliance, bias mitigation, privacy, and accountability.
Decentralized web architectures like Solid and blockchain are explored for data control and accountability.
Retrieval Augmented Generation (RAG) offers a promising approach for privacy-preserving, interpretable LLM querying in healthcare.
A hierarchical RAG system is proposed for querying distributed clinical data while enforcing access controls.
On-device training and Federated Learning are explored for privacy-preserving AI on resource-constrained IoT microcontrollers.
Challenges in microcontroller ML include memory/power constraints, data heterogeneity, and communication bottlenecks.
CORE RESEARCH PILLARS
The MIT Decentralized Information Group's research, spanning two decades, is driven by two primary challenges: ensuring data ownership and control for users, and addressing the ethical use and misuse of data. This encompasses critical areas such as policy compliance, bias mitigation in AI, maintaining user privacy, and establishing accountability when data-related issues arise. These fundamental principles guide the group's exploration of various computing paradigms to develop more secure and user-centric data systems.
DECENTRALIZED WEB AND DATA CONTROL
The group has significantly contributed to the decentralization of the web through frameworks like Solid, which aims to give users more control over their data. Their work extends to decentralized file storage solutions and ensuring policy compliance across distributed data sources. Additionally, research into databases and blockchain technology focuses on enabling accountability and enforcing usage control policies when analyzing data spread across multiple distributed databases, highlighting a commitment to robust data governance.
AI AND PRIVACY-PRESERVING MACHINE LEARNING
Currently, the group is investigating AI, particularly machine learning. Acquiring large, high-quality datasets is crucial for robust models, but this data often originates from distributed sources. The key challenge is combining this diverse data in a privacy-preserving manner, adhering to policies, and mitigating bias. This involves developing techniques to handle sensitive information and ensure fairness without compromising model performance or data integrity.
DISTRIBUTED QUESTION ANSWERING FOR CLINICAL DATA
One project focuses on a distributed question-answering system for clinical data, enabling natural language queries across collaborating hospitals and research labs. This system utilizes Retrieval Augmented Generation (RAG), a technique that improves LLM responses by incorporating external documents during query time. RAG enhances interpretability by citing sources, aids in catching hallucinations, and allows for domain adaptation without costly model fine-tuning, making it ideal for sensitive healthcare applications governed by privacy laws like HIPAA.
HIERARCHICAL RETRIEVAL AND ACCESS CONTROL
To address the distributed nature of clinical data, a hierarchical RAG system is proposed. This architecture modifies the retrieval step to federate retrieval across multiple levels, potentially with a retriever per hospital or department. Access control, managed via OAuth, ensures that users (like doctors) can only access data they are authorized to see. Clinical BERT embeddings are used at the leaf nodes for domain-specific retrieval, aiming to balance comprehensive querying with stringent privacy and security protocols.
FEDERATED LEARNING ON MICROCONTROLLERS
The second project explores applying machine learning to the vast network of IoT devices, specifically on microcontrollers which are extremely memory and power constrained. Federated Learning is employed, allowing models to be trained collaboratively across these devices without raw data ever leaving the device. This approach addresses privacy concerns inherent in collecting sensitive IoT data centrally and aims to leverage underutilized edge data for applications ranging from healthcare monitoring to industrial automation.
CHALLENGES IN MICROCONTROLLER ML
Implementing Federated Learning on microcontrollers presents significant challenges. These include adapting existing ML algorithms for extreme resource constraints through techniques like quantization and sparse updates, managing data and model heterogeneity across devices, and overcoming communication bottlenecks. Ensuring secure aggregation of model parameters, preventing information leakage, and handling device dropouts are critical for the success and reliability of these decentralized learning systems.
PRIVACY MECHANISMS AND SECURITY CONSIDERATIONS
Both projects place a strong emphasis on privacy and security. In the clinical domain, a hierarchical RAG system with OAuth aims to prevent unauthorized data access. For IoT devices, Federated Learning inherently protects data by keeping it on local devices. Techniques like data perturbation, differential privacy, and careful de-identification of datasets like MIMIC-III are considered to further mitigate risks. Robustness against potential security threats, such as prompt engineering attacks, is also a key consideration.
MODEL INTERPRETABILITY AND INTERACTION
Interpretability is crucial, especially in healthcare. RAG provides a foundational level by citing source documents used in generating responses. However, there's a recognized need for more sophisticated interpretability and user interaction design, ensuring that the 'first arrow'—the conversation between the user and the system—is intuitive and trustworthy. This involves not just providing answers but also explaining how they were derived, enabling users to feel confident in the results.
FUTURE DIRECTIONS AND EVALUATION
The research is currently in early design phases, seeking feedback on proposed solutions. Evaluation metrics focus on federation, security, robustness (noise rejection, negative rejection), and information integration. While direct physician input is resource-prohibitive for the current project timeline, evaluations will leverage large de-identified datasets like MIMIC-III. Future work aims to refine these decentralized and privacy-preserving AI systems for broader real-world application.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
●Studies Cited
●Concepts
●People Referenced
Common Questions
Clinical data is dynamic, distributed across institutions, and heavily protected by privacy laws like HIPAA. LLMs can also lack interpretability and are prone to hallucination, making them risky for healthcare decisions.
Topics
Mentioned in this video
Leads the decentralized information group at Cale, discussing research on data ownership, policy compliance, bias, privacy, and accountability.
Introduced Retrieval Augmented Generation (RAG) in 2020.
A large, de-identified patient EHR dataset from 2001-2012, used for evaluating the Federated RAG model.
A survey by this organization projects about 125 billion active IoT devices by 2030.
An LLM being considered for the generation step in the RAG system, though noted to be less impressive and prone to rambling compared to newer models.
PhD student at the decentralized information group, discusses machine learning on microcontrollers and the 'Tiny' project for federated learning on IoT devices.
Lana Kagal leads the decentralized information group at this organization.
An embedding model fine-tuned on clinical notes, planned for use at the leaf nodes of the retrieval hierarchy.
An instruction-tuned LLM considered for the generation step in the RAG system, noted for potentially good performance.
From the Library Innovation Lab, shares insights on RAG models for legal information retrieval and emphasizes the importance of the human-computer interaction aspect.
More from MIT Open Learning
View all 113 summaries
2 minWhy are nuclear power plants so expensive in the U.S.?
2 minThe science behind fake snow
1 minPaula Hammond: From curiosity to historic leadership at MIT
2 minGhost Trees Explained: What they reveal about climate change
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free