Key Moments

Jitendra Malik: Computer Vision | Lex Fridman Podcast #110

Lex FridmanLex Fridman
Science & Technology4 min read102 min video
Jul 21, 2020|72,729 views|1,915|101
Save to Pod
TL;DR

Jitendra Malik discusses computer vision, its challenges, and the future, emphasizing learning like a child and the importance of interaction.

Key Insights

1

Computer vision is chronically underestimated due to the subconscious nature of human vision, leading to a "fallacy of the successful first step."

2

True understanding in computer vision requires confronting higher-level cognitive reasoning, especially for dynamic and unpredictable scenarios like autonomous driving.

3

Learning like a child, with multimodal input, interactivity, and exploration, is crucial for developing robust AI systems beyond current supervised methods.

4

The ultimate goal of computer vision is to guide action, mirroring the evolutionary link between perception and movement in biological systems.

5

Video understanding is significantly behind image recognition, partly due to computational challenges, but progress is accelerating.

6

Understanding the "long-form" behavior of agents in videos, including their goals and intentionality, remains a major unsolved problem.

7

Explainability in AI is complex; while important in critical applications, humans are also black boxes, suggesting a balance between performance and interpretability.

8

Current AI threats are not just future AGIs but also present-day issues like biases in recommendation systems and the potential for errors in deployed AI.

THE UNDERESTIMATED CHALLENGE OF COMPUTER VISION

Jitendra Malik highlights that computer vision is inherently more complex than it appears, largely because human visual processing is subconscious. This effortless perception creates a 'fallacy of the successful first step,' where initial progress is rapid, but achieving near-perfect results becomes exponentially harder and time-consuming. Unlike tasks like chess or theorem proving, which engage conscious thought, vision operates largely below our awareness, obscuring its true difficulty. This underestimation has historically led AI researchers to misjudge the problem's scope.

THE NECESSITY OF COGNITIVE REASONING AND PREDICTION

Malik emphasizes that truly solving computer vision, especially in real-world applications like autonomous driving, requires sophisticated cognitive reasoning. Peripheral processing can handle simpler tasks, but unpredictable scenarios demand higher-level understanding. This includes predicting the future based on current observations and understanding the potential behaviors of other agents, which he considers part of perception. The challenge is that these systems need to go beyond recognizing what is, to anticipating what *will* happen to ensure safe and effective action.

LEARNING LIKE A CHILD: A PARADIGM SHIFT

Current AI, particularly in computer vision, often relies on "tabula rasa" supervised learning with vast datasets. Malik contrasts this with human learning, where foundational visual understanding is built from birth through exploration and interaction. He advocates for AI systems that learn incrementally, drawing on multimodal inputs (vision, touch, sound), interactivity, and a deeper understanding of physics. This child-like learning approach, including self-supervision and active experimentation, is seen as essential for developing more robust and generalizable intelligence.

THE FUNDAMENTAL LINK BETWEEN VISION AND ACTION

Fundamentally, Malik posits that perception, particularly vision, evolved to guide action. From simple biological systems seeking food or avoiding predators to complex human activities, vision's primary purpose is to enable purposeful interaction with the environment. This principle suggests that the ultimate goal of computer vision should be to create systems capable of acting effectively in the world, whether it's navigating a robot or making nuanced decisions in complex scenarios.

THE FRONTIER OF VIDEO UNDERSTANDING AND LONG-FORM ANALYSIS

While significant progress has been made in static image recognition, video understanding, especially long-form analysis, lags considerably. Maliks suggests video recognition is about a decade behind object recognition in terms of performance. The challenges involve not just processing more data but understanding temporal dynamics, agent behavior, goals, and intentionality over extended periods. This requires integrating sophisticated cognitive concepts like 'schemas' and 'scripts,' moving beyond short-term action recognition to a deeper comprehension of dynamic scenes.

CHALLENGES IN 3D UNDERSTANDING AND THE ROLE OF EXPLAINABILITY

Achieving rich, unified 3D understanding from varied viewpoints remains an open problem. Current methods often rely on multi-view geometry or single-view supervised learning with artificial 3D models, neither of which fully captures how humans build internal 3D representations. Regarding AI's 'black box' nature, Malik notes that humans are also largely unexplainable. While explainability is crucial in critical fields like medicine, he believes high-performing, even opaque, systems are valuable, acknowledging that narrative explanations can bridge the gap between system performance and human trust.

THE FUTURE OF INTELLIGENCE AND PRESENT-DAY AI RISKS

Malik is optimistic about the *possibility* of achieving human-level or superhuman intelligence but views it as a long-term endeavor, likely beyond the next 20 years, with many "unknown unknowns." He stresses that AI risks are not confined to future superintelligence; they are present today. Issues like algorithmic bias in recommendation systems and the deployment of AI in critical applications (e.g., self-driving cars) necessitate continuous vigilance concerning safety, fairness, and unintended consequences. The power of AI at scale, as seen in social media algorithms, already exerts significant influence.

MENTORSHIP, PROBLEM SELECTION, AND THE ART OF THE SOLUBLE

Reflecting on his career, Malik emphasizes the importance of fortunate timing in scientific progress and the crucial role of mentorship. He believes his contribution to students lies in fostering a sense of 'taste' for picking the right problems—those that are not yet solved but are 'solvable' with a "soft underbelly." Combined with intellectual breadth, drawing from psychology and neuroscience, he aims to guide students towards impactful research. He highlights his sense of pride in students achieving sustained success long after their mentorship.

Common Questions

Computer vision is often underestimated because much of human visual processing happens subconsciously, making it seem effortless. Historically, AI researchers have also fallen victim to the 'fallacy of the successful first step,' where initial progress seems easy but achieving robust, high-performance systems is extremely challenging and time-consuming.

Topics

Mentioned in this video

People
Peter Medawar

British Nobel laureate who wrote 'The Art of the Soluble,' which describes research as finding problems that are not yet solved but approachable.

Seymour Papert

At MIT, he proposed the 'Summer Vision Project' in 1966, outlining many computer vision tasks that are still relevant today.

Andrej Karpathy

Works with Elon Musk on Tesla's Autopilot system.

Yann LeCun

Mentioned as a friend who was comfortable with deep learning systems working robustly even before the broader community's acceptance.

Alison Gopnik

A colleague who co-authored the book 'The Scientist in the Crib,' which refers to children as scientists performing controlled experiments to build causal models.

Alan Turing

His paper 'Computing Machinery and Intelligence' and the Turing Test are referenced, particularly his suggestion to simulate a child's mind.

Elon Musk

Working with Andrej Karpathy on Tesla's Autopilot system, which is a vision-based approach to autonomous driving.

Hans Moravec

His calculations from the 1990s about computing power comparable to the brain are referenced, noting that modern GPUs might be reaching those levels.

Judea Pearl

Mentioned as someone who has extensively discussed the neglect of causality in AI, describing deep learning successes as merely 'curve fitting.'

Noam Chomsky

His belief that language may be at the core of human cognition is discussed, while Malik argues vision is more fundamental.

David Hilbert

Proposed 23 open problems in mathematics in 1900, which inspired the idea of 'Hilbert problems of computer vision.'

Ernst Dickmanns

Demonstrated autonomous driving capabilities in freeway conditions in the 1980s in Munich.

Fyodor Dostoevsky

Writer of 'The Idiot,' from which the concluding quote of the podcast episode is taken.

Donald Rumsfeld

His quote about 'known knowns, known unknowns, and unknown unknowns' is used to frame the current challenges in AI research.

More from Lex Fridman

View all 505 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free