What was the main challenge for data in large enterprises before the cloud?

Before the cloud, data was trapped in silos, making it difficult and time-consuming to gather and analyze for holistic business understanding. ETL jobs were also prone to failure, causing significant delays.

What is the Medallion architecture and its purpose?

The Medallion architecture is a three-layer system (bronze, silver, gold) that acts as a data refinery. It processes raw data through cleaning and enrichment stages to produce reliable business insights for dashboards, reports, and machine learning models.

Why is data governance crucial for AI?

Data governance is essential for AI because feeding a machine learning model biased or ungoverned data can lead to confidently wrong answers at scale, posing a significant risk. Governance ensures data accuracy, access control, and compliance with privacy laws.

What are ML Ops ready pipelines?

ML Ops ready pipelines are data pipelines specifically designed to support the training of machine learning models in production. They ensure that the vast amounts of clean, structured, and versioned data required by ML models are consistently available.

Can AI generate data pipeline code?

Yes, AI models like ChatGPT-4 can generate functionally production-ready ELT pipeline code from plain English descriptions for straightforward transformations. However, for complex domain-specific logic, human expertise remains essential.

What is the difference between Azure Databricks and Microsoft Fabric?

Azure Databricks excels in raw compute performance and ML flexibility, while Microsoft Fabric offers unified governance, end-to-end integration, and a lower total cost of ownership, particularly for mid-sized enterprises.

What are agentic data pipelines and why are they important?

Agentic data pipelines use AI agents to autonomously make decisions about data flow, quality monitoring, and issue resolution. This significantly reduces the mean time to recover from failures because the AI operates 24/7 without human intervention.

How is responsible AI achieved in data architecture?

Responsible AI requires embedding compliance and ethics as core architectural principles from the start. This includes PII tagging, differential privacy, audit trails, and consent management integrated throughout the data pipeline and model training process.

Key Moments

How AI on the Cloud Is Changing Everything | Narendra Mangala | TEDxGaya College of Engineering

TED

Nonprofits & Activism8 min read29 min video

Apr 30, 2026|146 views|1

Academic Acceptance Achievement Business Change Choice English Freedom TEDxTalks [TEDxEID:68609]

Save to Pod

Key Moments

On this page

TL;DR

AI is automating data pipelines from raw input to executive summaries, but its intelligence must be coupled with robust governance and ethical considerations to build trust.

Key Insights

The Medallion architecture (bronze, silver, gold layers) transforms raw data into business insights, with adoption leading to a 40% reduction in data processing failures.

Databricks Unity Catalog provides centralized governance, enabling granular access control and PII tagging, crucial for managing data across 50+ business units.

AI adoption creates a need for 'ML Ops ready pipelines' where data pipelines are designed specifically for training machine learning models at scale.

Prompt-driven data transformations, using LLMs like ChatGPT 4, can generate production-ready ELT pipeline code for straightforward tasks with minimal editing, acting as an accelerator.

Microsoft Fabric offers a unified platform and single logical data lake (One Lake) for AI and BI workloads, potentially reducing data duplication and governance issues.

Agentic data pipelines, where AI agents autonomously manage data flow, quality, and issue resolution, reduce mean time to failure discovery by 60%.

The evolution from data silos to intelligent cloud ecosystems

The landscape of enterprise data processing has dramatically shifted over the past 15 years. Previously, data resided in isolated silos – sales data in one system, finance in another, operations in a third. This fragmentation made holistic business analysis incredibly difficult, with C-suite executives waiting days for analysts to gather and process data to answer basic questions like why revenue dropped in a specific region. The backbone of this era was Extract, Transform, Load (ETL) processes, which were notoriously brittle, prone to failure (often at 3:00 a.m.), and inflexible to schema changes. The arrival of cloud computing platforms like AWS, Azure, and GCP, initially met with skepticism regarding security and control, fundamentally changed this paradigm. Instead of building infrastructure to fit data, cloud philosophy allowed infrastructure to be scaled to fit the data, marking a revolution in data management. My research, spanning from 2021 to 2026, has focused on building the infrastructure that enables this transformation, moving data from dusty servers to critical decision-making tables, with AI and the cloud acting as the invisible engine.

The Medallion architecture for data refinement

A key challenge in the early cloud adoption phase (around 2021) was handling the sheer volume of messy, raw data flowing into enterprise systems and ensuring it was trustworthy for business executives. The solution lay in architectural design, specifically the Medallion architecture. This layered approach functions like an oil refinery. The 'bronze' layer ingests raw, unfiltered data directly from sources, preserving it as evidence. The 'silver' layer refines this data by cleaning duplicates, correcting data types, and applying business rules. The 'gold' layer then enriches and aggregates this refined data, creating business-ready insights for dashboards, reports, and machine learning models. My research in 2021 demonstrated that optimizing ETL pipelines with the Medallion architecture on platforms like Azure Data Lake led to a 40% reduction in data processing failures and significantly faster time to insights, providing data with a defined journey and purpose. This architectural pattern is crucial for transforming chaotic data streams into reliable business intelligence.

Performance benchmarks and the importance of processing language

As data volumes grew into billions of rows, even minor improvements in processing efficiency became critical. Recognizing that milliseconds count in large-scale data transformations, I conducted research in 2021 benchmarking PySpark against Scala for distributed data processing. The choice of processing language can drastically alter pipeline runtime, distinguishing between a 2-hour job and a 20-minute one. Hundreds of tests were performed across various data volumes, transformation complexities, and cluster configurations. These findings influenced how enterprises select processing frameworks, shifting focus from trendy options to measurably fast solutions tailored to specific workloads. This emphasis on performance ensures that data pipelines can keep pace with the demands of modern business analytics.

Establishing trust through data governance

Building efficient data pipelines is only part of the equation; earning business trust is paramount. Data trust extends beyond accuracy to encompass robust governance – understanding who accessed what data, when, and why. This includes identifying Personally Identifiable Information (PII), ensuring compliance with regulations like GDPR, and enforcing internal policies. My research shifted focus in 2022 to address this critical governance gap, with Databricks Unity Catalog becoming a pivotal tool. Unity Catalog acts as a centralized control tower, cataloging, tagging, and securing every data asset. It allows for granular access controls, enabling the tagging of PII columns for automated encryption. A particularly challenging use case involved implementing a unified governance framework across 50 diverse business units, each with unique data assets and compliance needs. The solution adopted was a federated model, balancing local autonomy with centralized oversight and enterprise-level policy enforcement across all domains. This approach bridges the divide between independent data management and consistent organizational standards.

AI's demand for ML-ready pipelines

By 2023, AI had dominated boardroom conversations, presenting both excitement and anxiety for data engineers. AI promised unprecedented capabilities, from real-time anomaly detection to generating SQL from natural language. However, AI models are data-hungry, requiring clean, structured, and versioned data. If data pipelines are not architected for AI consumption, the results are not just poor AI but also expensive and unreliable AI. My research in 2023 focused on 'ML Ops ready pipelines' – pipelines designed from inception with AI model training at scale in mind. The 'gold' layer of the Medallion architecture evolved into a 'feature store,' a curated repository of pre-computed attributes vital for ML models (e.g., customer purchasing frequency, average basket size). My 2023 paper documented how to build these semantic data sets for ML training on platforms like Azure Data Bricks, ensuring that AI initiatives have the high-quality data they need to succeed.

AI as an accelerator for pipeline development

The year 2024 brought a new frontier: AI assisting in the creation of data pipelines themselves. My research explored prompt-driven data transformations, testing large language models (LLMs) like ChatGPT 4's ability to generate ETL pipeline code from plain English instructions. For straightforward transformations, such as joining customer transaction data with product catalogs to calculate monthly spending, the LLM-generated code was remarkably production-ready with minimal manual edits. However, for complex, domain-specific logic or custom business rules, human expertise remained indispensable. This highlighted a crucial distinction: AI acts as a powerful accelerator, not a wholesale replacement for human data engineering expertise. This finding is critical for understanding the practical integration of AI into existing workflows.

Navigating the platform landscape: Data bricks vs. Microsoft Fabric

By 2024, the enterprise data engineering landscape was dominated by two major platforms: Azure Data Bricks, known for its developer-first approach, raw compute power, and ML flexibility, and Microsoft Fabric, a newer, all-in-one platform aiming to unify data engineering, BI, and AI within a single governed environment. My comparative research in 2024 aimed to provide practical guidance to organizations grappling with this choice. Data Bricks excels in performance, ML depth, and flexibility, while Fabric shines in unified governance, end-to-end integration (from ingestion to Power BI), and a potentially lower total cost of ownership for mid-sized enterprises. The key insight for practitioners is that the question isn't which platform is universally 'better,' but which is the optimal fit for an organization's specific data maturity, team skills, and strategic AI ambitions over the next three years. This nuanced comparison helps organizations make informed decisions about their core data infrastructure.

The architectural breakthrough of unified data lakes

Microsoft Fabric's One Lake, a unified storage layer, became a focus of my 2025 research, representing a significant architectural breakthrough. For the first time, a single logical data lake could serve both AI workloads and business intelligence reporting without data duplication, complex synchronization pipelines, or the notorious governance headaches associated with data scattered across multiple systems. This capability, while seemingly simple, is revolutionary for anyone who has managed data silos, offering a streamlined approach to data management that reduces complexity and enhances efficiency.

Agentic data pipelines for autonomous operations

In 2025, my research ventured into 'agentic data pipelines,' where AI agents, rather than human engineers or pre-written scripts, make decisions about data flow, transformation, and routing. These agents monitor data quality in real time, investigate anomalies, identify root causes, and either fix issues autonomously or escalate with detailed diagnoses. The first enterprise-scale implementation of this pattern, documented in my 2025 paper, 'Agentic Data Pipelines, Autonomous ELT Orchestration Using AI Agents on Microsoft Fabric and Databricks,' yielded extraordinary results. Organizations adopting agentic orchestration saw a 60% reduction in the mean time to discover pipeline failures, as the AI agents operate continuously and efficiently. Furthermore, research into real-time feature engineering for streaming AI workloads addressed the critical need for AI systems to act on data that is seconds old, not days. This enables applications like fraud detection to analyze transactions in real-time, enriching them with contextual features and making decisions within milliseconds, a leap from outdated batch processing methods.

Embedding responsibility and compliance into AI architecture

As data systems become faster, smarter, and more autonomous, the ethical implications are becoming urgent. My 2026 research, 'Responsible AI Data Architecture, Embedding GDPR and PII conform compliance into ML Ops Pipelines at enterprise scale,' directly tackles these critical questions. Who is accountable when an AI pipeline errs? What safeguards prevent biased historical data from creating bias at an industrial scale? How can GDPR compliance be maintained when AI models are trained on data potentially containing individuals' personal details without explicit consent? The core argument is that compliance cannot be an afterthought; it must be an architectural principle embedded from the initial 'bronze' layer through to model training. Practices such as PII tagging in Unity Catalog, differential privacy in feature engineering, comprehensive audit trails, and integrated consent management at the ingestion layer are now essential entry points for any regulated industry operating with AI. Trustworthy data, underpinned by ethical and compliant architectural principles, is the foundation for responsible AI and a smarter world.

Mentioned in This Episode

●Software & Apps

●Companies

●Concepts

●People Referenced

Common Questions

AI systems can now automatically process vast amounts of sales data overnight, detect spending patterns, flag risks, and summarize findings into clear bullet points, eliminating the need for manual analyst work and speeding up decision-making.

Topics

Ai-Ethics AI & Machine Learning Technology & Innovation Cloud Computing Data Governance Data Engineering Machine Learning Operations Data Architecture Real-time Data Processing

Mentioned in this video

People

Narendra Mangalam

The speaker and researcher whose 5-year work from 2021 to 2026 focuses on building the infrastructure for AI on the cloud.

Software & Apps

Azure

The platform on which research documented how to build gold layer semantic data sets for ML model training on Databricks.

AWS

A cloud platform mentioned as part of the revolution in cloud computing that offered a new philosophy for data infrastructure.

Azure Data Lake

Platform used in research optimizing ETL pipelines with Medallion architecture, demonstrating a 40% reduction in data processing failures.

PySpark

A processing language benchmarked against Scala for distributed data transformations, crucial for processing billions of rows efficiently.

Scala

A programming language benchmarked against PySpark for distributed data transformations, with findings influencing enterprise processing framework choices.

Databricks Unity Catalog

A centralized governance layer that catalogs, tags, and personalizes data, enabling granular access control and PII protection.

ChatGPT

A tool released in late 2022 that brought AI into mainstream boardroom conversations, exciting and anxiating data engineers.

Microsoft Fabric

A challenger all-in-one platform that unifies data engineering, BI, and AI, excelling in unified governance and lower TCO for mid-sized enterprises.

Power BI

A business intelligence tool that integrates with platforms like Microsoft Fabric for end-to-end data visualization.

One Lake

Microsoft Fabric's unified storage layer, allowing a single logical data lake for AI and BI workloads without data duplication or complex synchronization.

Azure Event Hubs

A service used with PySpark for real-time feature engineering of streaming AI workloads, enabling decisions based on data seconds old.

Unity Catalog

A component that brought governance and trust to data assets, strengthening AI pipelines.

Concepts

Medallion architecture

A three-layer system (bronze, silver, gold) for refining crude data into structured insights, improving data processing and time to insights.

ML Ops

A concept related to building and maintaining machine learning models in production, requiring pipelines designed for scale, repetition, and reliability.

Agentic data pipelines

Pipelines where AI agents, not humans or scripts, make decisions about data flow, transformation, and routing, monitoring quality and fixing issues autonomously.

Responsible AI Data Architecture

Research focusing on embedding GDPR and PII compliance into ML Ops pipelines at enterprise scale, making compliance an architectural principle.

Legislation & Policy

GDPR

A key compliance requirement that must be embedded as an architectural principle in AI data architectures, not an afterthought.

Companies

Databricks

A platform for large-scale data engineering and ML, known for its raw compute performance, ML flexibility, and engineering depth.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free