How much 'common sense reasoning' is needed for true vision understanding?

Vision operates at all levels, from sensation to perception to cognition, with no artificial divides. Higher-level cognitive reasoning, including common sense, is often necessary for real applications, especially for handling edge cases in complex scenarios like autonomous driving. The amount needed depends heavily on the specific application's robustness requirements.

What is the difference between human learning for driving and current AI learning paradigms?

Human learning for driving is not 'tabula rasa' (blank slate). By age 16, humans are 'visual geniuses' with an established repertoire of visual knowledge. Driver education primarily teaches control and typical traffic situations, not fundamental visual interpretation. Current AI often uses supervised, tabula rasa learning, requiring far more data than humans for similar capabilities, suggesting a need for richer learning techniques.

What is the general problem of computer vision, according to Jitendra Malik?

Jitendra Malik philosophically defines the general problem of computer vision as sensing the world in a way that helps guide action. Drawing from biology, he emphasizes that perception for biological systems always serves action, evolving from simple food-finding or predator-avoidance loops to complex models of the external world.

Why has computer vision historically focused on static images over video?

Historically, the focus on static images was a pragmatic choice dictated by limited computational capabilities and storage. Processing an entire video stream was too demanding for researchers. This is changing, and significant progress in video understanding is expected as compute resources become more abundant.

How can AI systems learn like a child, leveraging interactivity and multimodal data?

Learning like a child involves active, exploratory learning through interaction with the environment, which can establish causal links rather than just correlations. This can be achieved through real robotics, where physical agents interact with the world, or through advanced, photorealistic simulation environments like Habitat, which provide rich, self-supervisory multimodal data (e.g., tactile, visual, auditory).

What are the 'Three R's' of computer vision?

Jitendra Malik proposes the 'Three R's': Recognition (labeling objects), Reconstruction (inferring 3D models from images, similar to inverse graphics), and Reorganization (finding and structuring entities in a scene, of which segmentation is a part). These are deeply interconnected and not isolated problems as often treated in the past.

Is natural language understanding or computer vision more fundamental for human intelligence?

Jitendra Malik argues that vision is more fundamental than language, both in evolutionary time (phylogeny) and child development (ontogeny). Vision and spatial intelligence developed millions of years ago, allowing for object manipulation and a larger brain size, on which language later built its constructs.

What are some modern-day 'Hilbert Problems' for computer vision?

Malik identifies 'long-form video understanding' (tracking agents, goals, intentionality, and predicting behavior over extended periods) and rich '3D understanding' (learning 3D shape and properties of objects from varied viewpoints without unnatural CAD model supervision) as two major unsolved problems, moving beyond atomic actions or basic 3D reconstruction.

Should we be concerned about AI's existential threat, or its present-day risks?

Jitendra Malik believes fears of superintelligent AGI in the near future are unwarranted, but emphasizes that we should be worried about AI today. He argues that current AI systems are already deployed in critical areas (medical diagnosis, self-driving cars, recommender systems) and can cause biases, unfair decisions, or harm, requiring continuous vigilance and responsibility for their consequences.

Key Moments

Jitendra Malik: Computer Vision | Lex Fridman Podcast #110

Lex Fridman

Science & Technology4 min read102 min video

Jul 21, 2020|73,248 views|1,924|101

jitendra malik deep learning artificial intelligence agi ai ai podcast artificial intelligence podcast lex fridman lex podcast lex mit lex ai lex jre

Save to Pod

Key Moments

TL;DR

Jitendra Malik discusses computer vision, its challenges, and the future, emphasizing learning like a child and the importance of interaction.

Key Insights

Computer vision is chronically underestimated due to the subconscious nature of human vision, leading to a "fallacy of the successful first step."

True understanding in computer vision requires confronting higher-level cognitive reasoning, especially for dynamic and unpredictable scenarios like autonomous driving.

Learning like a child, with multimodal input, interactivity, and exploration, is crucial for developing robust AI systems beyond current supervised methods.

The ultimate goal of computer vision is to guide action, mirroring the evolutionary link between perception and movement in biological systems.

Video understanding is significantly behind image recognition, partly due to computational challenges, but progress is accelerating.

Understanding the "long-form" behavior of agents in videos, including their goals and intentionality, remains a major unsolved problem.

Explainability in AI is complex; while important in critical applications, humans are also black boxes, suggesting a balance between performance and interpretability.

Current AI threats are not just future AGIs but also present-day issues like biases in recommendation systems and the potential for errors in deployed AI.

THE UNDERESTIMATED CHALLENGE OF COMPUTER VISION

Jitendra Malik highlights that computer vision is inherently more complex than it appears, largely because human visual processing is subconscious. This effortless perception creates a 'fallacy of the successful first step,' where initial progress is rapid, but achieving near-perfect results becomes exponentially harder and time-consuming. Unlike tasks like chess or theorem proving, which engage conscious thought, vision operates largely below our awareness, obscuring its true difficulty. This underestimation has historically led AI researchers to misjudge the problem's scope.

THE NECESSITY OF COGNITIVE REASONING AND PREDICTION

Malik emphasizes that truly solving computer vision, especially in real-world applications like autonomous driving, requires sophisticated cognitive reasoning. Peripheral processing can handle simpler tasks, but unpredictable scenarios demand higher-level understanding. This includes predicting the future based on current observations and understanding the potential behaviors of other agents, which he considers part of perception. The challenge is that these systems need to go beyond recognizing what is, to anticipating what *will* happen to ensure safe and effective action.

LEARNING LIKE A CHILD: A PARADIGM SHIFT

Current AI, particularly in computer vision, often relies on "tabula rasa" supervised learning with vast datasets. Malik contrasts this with human learning, where foundational visual understanding is built from birth through exploration and interaction. He advocates for AI systems that learn incrementally, drawing on multimodal inputs (vision, touch, sound), interactivity, and a deeper understanding of physics. This child-like learning approach, including self-supervision and active experimentation, is seen as essential for developing more robust and generalizable intelligence.

THE FUNDAMENTAL LINK BETWEEN VISION AND ACTION

Fundamentally, Malik posits that perception, particularly vision, evolved to guide action. From simple biological systems seeking food or avoiding predators to complex human activities, vision's primary purpose is to enable purposeful interaction with the environment. This principle suggests that the ultimate goal of computer vision should be to create systems capable of acting effectively in the world, whether it's navigating a robot or making nuanced decisions in complex scenarios.

THE FRONTIER OF VIDEO UNDERSTANDING AND LONG-FORM ANALYSIS

While significant progress has been made in static image recognition, video understanding, especially long-form analysis, lags considerably. Maliks suggests video recognition is about a decade behind object recognition in terms of performance. The challenges involve not just processing more data but understanding temporal dynamics, agent behavior, goals, and intentionality over extended periods. This requires integrating sophisticated cognitive concepts like 'schemas' and 'scripts,' moving beyond short-term action recognition to a deeper comprehension of dynamic scenes.

CHALLENGES IN 3D UNDERSTANDING AND THE ROLE OF EXPLAINABILITY

Achieving rich, unified 3D understanding from varied viewpoints remains an open problem. Current methods often rely on multi-view geometry or single-view supervised learning with artificial 3D models, neither of which fully captures how humans build internal 3D representations. Regarding AI's 'black box' nature, Malik notes that humans are also largely unexplainable. While explainability is crucial in critical fields like medicine, he believes high-performing, even opaque, systems are valuable, acknowledging that narrative explanations can bridge the gap between system performance and human trust.

THE FUTURE OF INTELLIGENCE AND PRESENT-DAY AI RISKS

Malik is optimistic about the *possibility* of achieving human-level or superhuman intelligence but views it as a long-term endeavor, likely beyond the next 20 years, with many "unknown unknowns." He stresses that AI risks are not confined to future superintelligence; they are present today. Issues like algorithmic bias in recommendation systems and the deployment of AI in critical applications (e.g., self-driving cars) necessitate continuous vigilance concerning safety, fairness, and unintended consequences. The power of AI at scale, as seen in social media algorithms, already exerts significant influence.

MENTORSHIP, PROBLEM SELECTION, AND THE ART OF THE SOLUBLE

Reflecting on his career, Malik emphasizes the importance of fortunate timing in scientific progress and the crucial role of mentorship. He believes his contribution to students lies in fostering a sense of 'taste' for picking the right problems—those that are not yet solved but are 'solvable' with a "soft underbelly." Combined with intellectual breadth, drawing from psychology and neuroscience, he aims to guide students towards impactful research. He highlights his sense of pride in students achieving sustained success long after their mentorship.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Organizations

●Books

●Concepts

●People Referenced

Common Questions

Computer vision is often underestimated because much of human visual processing happens subconsciously, making it seem effortless. Historically, AI researchers have also fallen victim to the 'fallacy of the successful first step,' where initial progress seems easy but achieving robust, high-performance systems is extremely challenging and time-consuming.

Topics

Artificial Intelligence Neuroscience & the Brain Mindset & Self-Improvement AI & Machine Learning Neural Networks Deep Learning Child Development Autonomous Driving Machine Learning Computer Vision Cognitive Science

Mentioned in this video

Companies

Tesla

The company producing the Autopilot system for autonomous driving that the speaker owns and provides examples of its safety features and limitations.

YouTube

The speaker notes that even at Google's scale, processing all YouTube video content remains very challenging from a computer vision perspective.

Uber

Mentioned in the context of a self-driving car incident in Arizona that caused a fatality, highlighting real-world AI risks.

Amazon

Mentioned as a large company that still struggles with video computing at scale, despite advancements in computer vision.

Netflix

Used as an example of how human perception is tied to action, specifically subscribing to a service based on enjoyment.

Facebook

Mentioned as a large company that still struggles with video computing at scale, and where Jitendra Malik's group developed the Habitat simulation environment.

Google

Mentioned as a large company that still struggles with video computing at scale, despite advancements in computer vision.

Twitter

Cited as an example of a platform using recommendation systems powered by machine learning algorithms that control access to information and influence ideas.

Software & Apps

ResNet-50

A deep neural network with 50 layers, used as an example of common AI architecture in computer vision, contrasted with the shallower networks in the human brain.

Autopilot

Tesla's vision-based system for autonomous driving, featuring eight cameras and a single multi-task neural network called Hydronet.

Hydronet

Tesla's multi-task neural network used in their Autopilot system, designed to handle multiple vision tasks while forming a common core representation.

Habitat

A visually photorealistic simulation environment developed at Facebook AI Research, designed for training agents in virtual houses and urban spaces.

ChatGPT

Mentioned as an example of a specific software.

Books

The Scientist in the Crib

A book co-authored by Alison Gopnik, which conceptualizes children as scientists who perform controlled experiments to develop causal models of the world.

The Idiot

A novel by Fyodor Dostoevsky, from which the concluding quote 'Beauty will save the world' is taken.

The Art of the Soluble

A book by Nobel laureate Peter Medawar, which defines research as the art of finding problems that are not yet solved but are approachable.

Concepts

Hilbert Problems of Computer Vision

A concept proposed by Jitendra Malik, inspired by David Hilbert's mathematical challenges, to define key open problems in computer vision, such as long-form video understanding and comprehensive 3D understanding.

Summer Vision Project

A 1966 proposal by Seymour Papert at MIT to have 10 students solve computer vision over a summer, highlighting the underestimation of the problem's difficulty.

GPU

Graphics Processing Units, which now have amazing computing power, potentially comparable to the human brain based on 1990s calculations, but are far more power-hungry.

Turing Test

A test of a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. Malik suggests it's no longer the right way to channel AI research.

Riemann hypothesis

A famous unsolved problem in mathematics, mentioned as an example of a Hilbert problem.

People

Peter Medawar

British Nobel laureate who wrote 'The Art of the Soluble,' which describes research as finding problems that are not yet solved but approachable.

Seymour Papert

At MIT, he proposed the 'Summer Vision Project' in 1966, outlining many computer vision tasks that are still relevant today.

Andrej Karpathy

Works with Elon Musk on Tesla's Autopilot system.

Yann LeCun

Mentioned as a friend who was comfortable with deep learning systems working robustly even before the broader community's acceptance.

Alison Gopnik

A colleague who co-authored the book 'The Scientist in the Crib,' which refers to children as scientists performing controlled experiments to build causal models.

Alan Turing

His paper 'Computing Machinery and Intelligence' and the Turing Test are referenced, particularly his suggestion to simulate a child's mind.

Elon Musk

Working with Andrej Karpathy on Tesla's Autopilot system, which is a vision-based approach to autonomous driving.

Hans Moravec

His calculations from the 1990s about computing power comparable to the brain are referenced, noting that modern GPUs might be reaching those levels.

Judea Pearl

Mentioned as someone who has extensively discussed the neglect of causality in AI, describing deep learning successes as merely 'curve fitting.'

Noam Chomsky

His belief that language may be at the core of human cognition is discussed, while Malik argues vision is more fundamental.

David Hilbert

Proposed 23 open problems in mathematics in 1900, which inspired the idea of 'Hilbert problems of computer vision.'

Ernst Dickmanns

Demonstrated autonomous driving capabilities in freeway conditions in the 1980s in Munich.

Fyodor Dostoevsky

Writer of 'The Idiot,' from which the concluding quote of the podcast episode is taken.

Donald Rumsfeld

His quote about 'known knowns, known unknowns, and unknown unknowns' is used to frame the current challenges in AI research.

Organizations

MIT

The institution where Seymour Papert proposed the Summer Vision Project.

Stanford University

Had approaches to autonomous driving in the 2000s.

Carnegie Mellon University

Had approaches to autonomous driving in the 1990s.

Facebook AI Research

The research group where Jitendra Malik's team worked on the Habitat simulation environment.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free