Key Moments
How AI on the Cloud Is Changing Everything | Narendra Mangala | TEDxGaya College of Engineering
Key Moments
AI is automating data pipelines from raw input to executive summaries, but its intelligence must be coupled with robust governance and ethical considerations to build trust.
Key Insights
The Medallion architecture (bronze, silver, gold layers) transforms raw data into business insights, with adoption leading to a 40% reduction in data processing failures.
Databricks Unity Catalog provides centralized governance, enabling granular access control and PII tagging, crucial for managing data across 50+ business units.
AI adoption creates a need for 'ML Ops ready pipelines' where data pipelines are designed specifically for training machine learning models at scale.
Prompt-driven data transformations, using LLMs like ChatGPT 4, can generate production-ready ELT pipeline code for straightforward tasks with minimal editing, acting as an accelerator.
Microsoft Fabric offers a unified platform and single logical data lake (One Lake) for AI and BI workloads, potentially reducing data duplication and governance issues.
Agentic data pipelines, where AI agents autonomously manage data flow, quality, and issue resolution, reduce mean time to failure discovery by 60%.
The evolution from data silos to intelligent cloud ecosystems
The landscape of enterprise data processing has dramatically shifted over the past 15 years. Previously, data resided in isolated silos – sales data in one system, finance in another, operations in a third. This fragmentation made holistic business analysis incredibly difficult, with C-suite executives waiting days for analysts to gather and process data to answer basic questions like why revenue dropped in a specific region. The backbone of this era was Extract, Transform, Load (ETL) processes, which were notoriously brittle, prone to failure (often at 3:00 a.m.), and inflexible to schema changes. The arrival of cloud computing platforms like AWS, Azure, and GCP, initially met with skepticism regarding security and control, fundamentally changed this paradigm. Instead of building infrastructure to fit data, cloud philosophy allowed infrastructure to be scaled to fit the data, marking a revolution in data management. My research, spanning from 2021 to 2026, has focused on building the infrastructure that enables this transformation, moving data from dusty servers to critical decision-making tables, with AI and the cloud acting as the invisible engine.
The Medallion architecture for data refinement
A key challenge in the early cloud adoption phase (around 2021) was handling the sheer volume of messy, raw data flowing into enterprise systems and ensuring it was trustworthy for business executives. The solution lay in architectural design, specifically the Medallion architecture. This layered approach functions like an oil refinery. The 'bronze' layer ingests raw, unfiltered data directly from sources, preserving it as evidence. The 'silver' layer refines this data by cleaning duplicates, correcting data types, and applying business rules. The 'gold' layer then enriches and aggregates this refined data, creating business-ready insights for dashboards, reports, and machine learning models. My research in 2021 demonstrated that optimizing ETL pipelines with the Medallion architecture on platforms like Azure Data Lake led to a 40% reduction in data processing failures and significantly faster time to insights, providing data with a defined journey and purpose. This architectural pattern is crucial for transforming chaotic data streams into reliable business intelligence.
Performance benchmarks and the importance of processing language
As data volumes grew into billions of rows, even minor improvements in processing efficiency became critical. Recognizing that milliseconds count in large-scale data transformations, I conducted research in 2021 benchmarking PySpark against Scala for distributed data processing. The choice of processing language can drastically alter pipeline runtime, distinguishing between a 2-hour job and a 20-minute one. Hundreds of tests were performed across various data volumes, transformation complexities, and cluster configurations. These findings influenced how enterprises select processing frameworks, shifting focus from trendy options to measurably fast solutions tailored to specific workloads. This emphasis on performance ensures that data pipelines can keep pace with the demands of modern business analytics.
Establishing trust through data governance
Building efficient data pipelines is only part of the equation; earning business trust is paramount. Data trust extends beyond accuracy to encompass robust governance – understanding who accessed what data, when, and why. This includes identifying Personally Identifiable Information (PII), ensuring compliance with regulations like GDPR, and enforcing internal policies. My research shifted focus in 2022 to address this critical governance gap, with Databricks Unity Catalog becoming a pivotal tool. Unity Catalog acts as a centralized control tower, cataloging, tagging, and securing every data asset. It allows for granular access controls, enabling the tagging of PII columns for automated encryption. A particularly challenging use case involved implementing a unified governance framework across 50 diverse business units, each with unique data assets and compliance needs. The solution adopted was a federated model, balancing local autonomy with centralized oversight and enterprise-level policy enforcement across all domains. This approach bridges the divide between independent data management and consistent organizational standards.
AI's demand for ML-ready pipelines
By 2023, AI had dominated boardroom conversations, presenting both excitement and anxiety for data engineers. AI promised unprecedented capabilities, from real-time anomaly detection to generating SQL from natural language. However, AI models are data-hungry, requiring clean, structured, and versioned data. If data pipelines are not architected for AI consumption, the results are not just poor AI but also expensive and unreliable AI. My research in 2023 focused on 'ML Ops ready pipelines' – pipelines designed from inception with AI model training at scale in mind. The 'gold' layer of the Medallion architecture evolved into a 'feature store,' a curated repository of pre-computed attributes vital for ML models (e.g., customer purchasing frequency, average basket size). My 2023 paper documented how to build these semantic data sets for ML training on platforms like Azure Data Bricks, ensuring that AI initiatives have the high-quality data they need to succeed.
AI as an accelerator for pipeline development
The year 2024 brought a new frontier: AI assisting in the creation of data pipelines themselves. My research explored prompt-driven data transformations, testing large language models (LLMs) like ChatGPT 4's ability to generate ETL pipeline code from plain English instructions. For straightforward transformations, such as joining customer transaction data with product catalogs to calculate monthly spending, the LLM-generated code was remarkably production-ready with minimal manual edits. However, for complex, domain-specific logic or custom business rules, human expertise remained indispensable. This highlighted a crucial distinction: AI acts as a powerful accelerator, not a wholesale replacement for human data engineering expertise. This finding is critical for understanding the practical integration of AI into existing workflows.
Navigating the platform landscape: Data bricks vs. Microsoft Fabric
By 2024, the enterprise data engineering landscape was dominated by two major platforms: Azure Data Bricks, known for its developer-first approach, raw compute power, and ML flexibility, and Microsoft Fabric, a newer, all-in-one platform aiming to unify data engineering, BI, and AI within a single governed environment. My comparative research in 2024 aimed to provide practical guidance to organizations grappling with this choice. Data Bricks excels in performance, ML depth, and flexibility, while Fabric shines in unified governance, end-to-end integration (from ingestion to Power BI), and a potentially lower total cost of ownership for mid-sized enterprises. The key insight for practitioners is that the question isn't which platform is universally 'better,' but which is the optimal fit for an organization's specific data maturity, team skills, and strategic AI ambitions over the next three years. This nuanced comparison helps organizations make informed decisions about their core data infrastructure.
The architectural breakthrough of unified data lakes
Microsoft Fabric's One Lake, a unified storage layer, became a focus of my 2025 research, representing a significant architectural breakthrough. For the first time, a single logical data lake could serve both AI workloads and business intelligence reporting without data duplication, complex synchronization pipelines, or the notorious governance headaches associated with data scattered across multiple systems. This capability, while seemingly simple, is revolutionary for anyone who has managed data silos, offering a streamlined approach to data management that reduces complexity and enhances efficiency.
Agentic data pipelines for autonomous operations
In 2025, my research ventured into 'agentic data pipelines,' where AI agents, rather than human engineers or pre-written scripts, make decisions about data flow, transformation, and routing. These agents monitor data quality in real time, investigate anomalies, identify root causes, and either fix issues autonomously or escalate with detailed diagnoses. The first enterprise-scale implementation of this pattern, documented in my 2025 paper, 'Agentic Data Pipelines, Autonomous ELT Orchestration Using AI Agents on Microsoft Fabric and Databricks,' yielded extraordinary results. Organizations adopting agentic orchestration saw a 60% reduction in the mean time to discover pipeline failures, as the AI agents operate continuously and efficiently. Furthermore, research into real-time feature engineering for streaming AI workloads addressed the critical need for AI systems to act on data that is seconds old, not days. This enables applications like fraud detection to analyze transactions in real-time, enriching them with contextual features and making decisions within milliseconds, a leap from outdated batch processing methods.
Embedding responsibility and compliance into AI architecture
As data systems become faster, smarter, and more autonomous, the ethical implications are becoming urgent. My 2026 research, 'Responsible AI Data Architecture, Embedding GDPR and PII conform compliance into ML Ops Pipelines at enterprise scale,' directly tackles these critical questions. Who is accountable when an AI pipeline errs? What safeguards prevent biased historical data from creating bias at an industrial scale? How can GDPR compliance be maintained when AI models are trained on data potentially containing individuals' personal details without explicit consent? The core argument is that compliance cannot be an afterthought; it must be an architectural principle embedded from the initial 'bronze' layer through to model training. Practices such as PII tagging in Unity Catalog, differential privacy in feature engineering, comprehensive audit trails, and integrated consent management at the ingestion layer are now essential entry points for any regulated industry operating with AI. Trustworthy data, underpinned by ethical and compliant architectural principles, is the foundation for responsible AI and a smarter world.
Mentioned in This Episode
●Software & Apps
●Companies
●Concepts
●People Referenced
Common Questions
AI systems can now automatically process vast amounts of sales data overnight, detect spending patterns, flag risks, and summarize findings into clear bullet points, eliminating the need for manual analyst work and speeding up decision-making.
Topics
Mentioned in this video
The platform on which research documented how to build gold layer semantic data sets for ML model training on Databricks.
A cloud platform mentioned as part of the revolution in cloud computing that offered a new philosophy for data infrastructure.
Platform used in research optimizing ETL pipelines with Medallion architecture, demonstrating a 40% reduction in data processing failures.
A processing language benchmarked against Scala for distributed data transformations, crucial for processing billions of rows efficiently.
A programming language benchmarked against PySpark for distributed data transformations, with findings influencing enterprise processing framework choices.
A centralized governance layer that catalogs, tags, and personalizes data, enabling granular access control and PII protection.
A tool released in late 2022 that brought AI into mainstream boardroom conversations, exciting and anxiating data engineers.
A challenger all-in-one platform that unifies data engineering, BI, and AI, excelling in unified governance and lower TCO for mid-sized enterprises.
A business intelligence tool that integrates with platforms like Microsoft Fabric for end-to-end data visualization.
Microsoft Fabric's unified storage layer, allowing a single logical data lake for AI and BI workloads without data duplication or complex synchronization.
A service used with PySpark for real-time feature engineering of streaming AI workloads, enabling decisions based on data seconds old.
A component that brought governance and trust to data assets, strengthening AI pipelines.
A three-layer system (bronze, silver, gold) for refining crude data into structured insights, improving data processing and time to insights.
A concept related to building and maintaining machine learning models in production, requiring pipelines designed for scale, repetition, and reliability.
Pipelines where AI agents, not humans or scripts, make decisions about data flow, transformation, and routing, monitoring quality and fixing issues autonomously.
Research focusing on embedding GDPR and PII compliance into ML Ops pipelines at enterprise scale, making compliance an architectural principle.
More from TEDx Talks
View all 59 summaries
23 minTraining Your Mind to Break Free from Darkness | Karthik Raghavendran | TEDxIISERBPR
32 minHow Artificial Intelligence Is Shaping Our Future | Ahmed Moataz | TEDxLuxor STEM School Youth
76 minFrom Hypergrowth to Human Leadership: The Inside Story of Tesla and More | Jon McNeill | TEDxBoston
21 minHow to decode others without writing them off | MŌRIAH | TEDxNashvilleWomen
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free