What is Open Information Extraction?

Open Information Extraction, a technique Etzioni pioneered, aims to extract structured information from any sentence on the web, moving beyond narrowly focused extraction tasks. The motto was 'no sentence left behind.'

What is Semantic Scholar and how does it help researchers?

Semantic Scholar is a project by the Allen Institute for AI that uses AI to help researchers navigate the growing volume of scientific papers. It offers features like 'extreme summaries' (TLDRs) and automatically extracts figures to save researchers time.

How did Semantic Scholar contribute during the COVID-19 pandemic?

During the COVID-19 pandemic, Semantic Scholar, in collaboration with the White House and other partners, quickly compiled and made available a massive collection of over 200,000 research papers and pre-prints, enabling faster analysis and discovery.

What is the 'dirty little secret' of Big Data for startups?

The 'dirty little secret' is not just needing lots of data, but often requiring substantial amounts of labeled data. Startups need to carefully consider where their data will come from and how it will be labeled.

How do large language models like GPT-3 avoid the need for extensive manual labeling?

Large language models leverage the inherent sequential nature of language. By masking words in a sentence and predicting them, they effectively self-label their training data, which is a key reason for their success.

What is 'Green AI' and why is it important?

Green AI focuses on model efficiency and cost-effectiveness, rather than just maximizing size. This approach makes AI development more accessible to researchers who don't have access to massive computational resources.

Should AI be regulated, and how?

Etzioni suggests regulating specific AI applications rather than basic research. He emphasizes auditing AI systems for bias and fairness in their practical use cases, rather than imposing overly prescriptive rules or demanding potentially misleading explanations.

What's the best advice for someone wanting to start a career in NLP?

Build a strong foundation in statistics, computer science, and machine learning. Supplement this with online courses, and crucially, gain hands-on experience by working on real-world problems and datasets.

Key Moments

Heroes of NLP: Oren Etzioni

DeepLearning.AI

Science & Technology6 min read35 min video

Oct 13, 2020|3,207 views|95|4

AI Machine Learning Data AI for Good Andrew Ng AI2

Save to Pod

Key Moments

TL;DR

Oren Etzioni discusses NLP's evolution, Semantic Scholar, ethical AI, and advice for aspiring professionals.

Key Insights

Deep learning's unexpected commercial success stemmed from fundamental AI research questions.

Open Information Extraction aimed to map vast web data into comprehensive knowledge bases.

Semantic Scholar uses AI to help researchers navigate the overwhelming volume of scientific publications.

The COVID-19 pandemic highlighted the importance of AI in rapidly processing and disseminating critical research.

Successful AI startups require meticulous planning around data acquisition and labeling strategies.

Self-labeling data in sequential processes, like language modeling, is crucial for training large models.

The growth of AI models likely to continue, balanced by a focus on efficiency and 'Green AI'.

Career paths in AI/NLP can be optimized for compensation/excitement (industry) or freedom/fundamental questions (academia).

Regulation should focus on specific AI applications rather than basic research, emphasizing auditing over explanations.

Aspiring NLP professionals should build strong fundamentals, leverage online resources, and gain practical experience.

ORIGINS AND EVOLUTION OF AI INTEREST

Oren Etzioni's fascination with Artificial Intelligence began in high school after reading Douglas Hofstadter's 'Gödel, Escher, Bach.' This sparked a profound interest in the fundamental questions surrounding intelligence and the creation of intelligent machines. His early exploration involved learning Lisp, an ancient programming language, which he found 'endlessly fun.' This foundational interest continued into college, where he focused on computer science as the pathway to AI. He notes how the pursuit of these fundamental intellectual questions has unexpectedly led to powerful technologies and commercial success, particularly with the rise of deep learning.

PIONEERING OPEN INFORMATION EXTRACTION

Etzioni was a pioneer in 'open information extraction' (Open IE) from the web. The core idea was to map unstructured sentences from the web into structured database tuples, moving beyond narrowly defined event extraction like M&A or terrorist events. His motto, 'no sentence left behind,' aimed to extract information from any sentence to build a comprehensive knowledge base. This required generalizing techniques beyond traditional supervised learning by developing unsupervised methods that could learn from the vast, diverse language used online. He observed linguistic invariants and regularities in how relationships are expressed, providing strong signals for learning algorithms.

SEMANTIC SCHOLAR: NAVIGATING SCIENTIFIC LITERATURE

As CEO of the Allen Institute for AI (AI2), Etzioni leads initiatives focused on 'AI for the common good.' One significant project is Semantic Scholar, designed to combat the 'Moore's Law of scientific publication'—the rapid exponential growth of research papers. Semantic Scholar uses AI to help scientists and the public access relevant papers more efficiently. It offers features like 'extreme summaries' (TLDRs) for papers and uses computer vision to extract figures and tables, enabling users to quickly assess a paper's relevance. This saves researchers valuable time in an increasingly deluge of information.

AI'S ROLE IN THE COVID-19 PANDEMIC

Semantic Scholar played a crucial role during the COVID-19 pandemic. In early 2020, the White House reached out to AI2 due to their capabilities in processing large collections of papers. AI2, in collaboration with organizations like the Chan Zuckerberg Initiative and Microsoft, rapidly created and maintains a corpus of over 200,000 COVID-19 related papers, making it machine-readable. This open dataset, known as CORD-19, enabled AI systems to accelerate research and answer critical questions about the virus far more quickly than traditional methods allowed.

DATA AND LABELING IN AI STARTUPS

Etzioni emphasizes the critical importance of data for AI-based startups. He highlights the 'dirty little secret' that success often hinges not just on big data, but on plentiful labels. Entrepreneurs must carefully consider where their data will come from and how it will be labeled. He uses his experience with Farecast, a successful airfare prediction company, as an example. Farecast generated 'a trillion labeled data points' by leveraging the sequential nature of temporal data; as flight prices changed, predictions were automatically validated over time, creating a self-labeling dataset without manual intervention.

SELF-LABELING DATA AND THE RISE OF LARGE MODELS

The concept of self-labeling data extends to modern NLP, particularly with large language models like BERT and GPT-3. These models effectively 'label themselves' by predicting masked words within sentences. The inherent sequential nature of language allows for a form of self-supervision: if a model predicts a word, it can compare its prediction to the actual word in the corpus. This capability has fueled the dramatic success and growth of these models, enabling them to learn from vast amounts of text with less reliance on explicit human labeling, a significant breakthrough for the field.

THE FUTURE TREND OF MODEL AND DATA SCALING

While acknowledging his past predictions on model size plateauing were incorrect, Etzioni believes the trend toward larger models and datasets will continue due to the persistent hunger for performance. However, he anticipates a dual trajectory: continued scaling alongside a growing focus on optimization. This includes developing more data-efficient strategies and computational efficiencies. He draws an analogy to chess, which evolved from requiring supercomputers to running on laptops via better algorithms, alongside simultaneous scaling up in complexity to games like Go, suggesting both brute-force scaling and refinement will characterize AI's future.

EMERGING FOCUS ON 'GREEN AI' AND ACCESSIBILITY

Recognizing the significant computational cost and barrier to entry for massive models, AI2 is exploring 'Green AI.' This approach emphasizes efficiency and accessibility, aiming to achieve state-of-the-art results with fewer resources. The goal is to enable researchers to work effectively with smaller budgets and datasets, fostering broader participation. Concepts like 'NLP in a Box' explore delivering powerful NLP capabilities on devices like laptops or phones, addressing privacy concerns and intermittent connectivity-related challenges for edge computing.

ACADEMIA VS. INDUSTRY CAREER PATHS

Etzioni likens choosing between academia and industry in AI to optimizing for different goals. The private sector, particularly startups, often appeals to those optimizing for compensation and adrenaline-fueled excitement, similar to a car race or poker game. Conversely, academia offers maximum freedom to pursue fundamental intellectual questions deeply and on one's own terms, without external pressures. He has experienced both, valuing the deep, uninterrupted intellectual exploration in academia and the exhilarating challenge of building and succeeding with a team in the commercial sector.

REGULATING AI: APPLICATIONS OVER RESEARCH

Discussing AI regulation, Etzioni strongly advocates for regulating specific applications rather than basic underlying research. He warns against legislating values into technology. Bias in NLP, for instance, is a serious issue, but the focus should be on how that bias manifests in applications like resume scanning software, which can be legally challenged and audited. Regulating the applications ensures accountability for their problematic impacts, while allowing fundamental research to flourish and ensuring that technological advancements are not stifled by overly broad regulations.

AUDITING AND TRANSPARENCY IN AI

Etzioni highlights the critical difference between demanding explanations from AI models and enabling auditing. Deep learning models, with their vast parameters, may struggle to provide truly understandable explanations. He suggests that mandating an 'right to audit' is more practical and robust. This allows regulatory agencies or third parties access to audit model behavior for bias or fairness. This marketplace of ideas, involving various stakeholders like journalists and nonprofits, can provide checks and balances, fostering greater transparency and accountability than potentially inscrutable or misleading explanations.

ADVICE FOR ASPIRING NLP PROFESSIONALS

For those looking to enter or grow in NLP, Etzioni stresses the importance of mastering fundamental skills in statistics, computer science, and machine learning. He recommends leveraging cost-efficient and accessible online courses, including DeepLearning.AI's NLP specialization. Crucially, he emphasizes that theoretical knowledge must be complemented by hands-on practice. Taking on real problems with actual datasets is essential to truly understand concepts, troubleshoot challenges, and potentially uncover new ideas or inventions, solidifying learning through direct experience.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Books

●Studies Cited

●Concepts

●People Referenced

Building a Career in NLP: Advice from Oren Etzioni

Practical takeaways from this episode

Do This

Master the fundamentals: statistics, computer science, and machine learning.

Utilize accessible online courses for cost-efficient learning.

Gain practical experience by taking on real problems with datasets.

Focus on efficiency and cost in AI development (Green AI).

When choosing between academia and industry, optimize for what matters most: freedom for fundamental research or compensation/adrenaline for startups.

Regulate AI applications, not basic research.

Advocate for auditing mechanisms over potentially misleading explanations for AI models.

Avoid This

Don't solely rely on 'flavor of the month' trends in AI; build a strong foundation.

Don't underestimate the importance of data sources and labeling for AI startups.

Avoid rushing to build massive models without considering efficiency and accessibility.

Do not try to legislate values into technology; regulate specific applications where bias is problematic.

Don't expect regulators to provide perfect, understandable explanations from complex deep learning models; focus on audit rights instead.

Common Questions

Oren Etzioni became fascinated with AI in high school after reading Douglas Hofstadter's 'Gödel, Escher, Bach', which raised fundamental questions about intelligence. He then began studying Lisp before college and pursued computer science at Harvard.

Topics

Information Extraction Semantic Scholar Green AI

Mentioned in this video

Books

Call of the Wild

A book similar to "Gödel, Escher, Bach" that can inspire interest in AI.

Gödel, Escher, Bach

A book by Douglas Hofstadter that inspired Oren Etzioni's fascination with AI and the nature of intelligence.

Studies & Research

CORD-19

The open dataset created for COVID-19 research, used in Kaggle competitions.

People

Oren Etzioni

CEO of the Allen Institute for Artificial Intelligence and Professor at the University of Washington, a prominent figure in NLP.

Companies

Faircast

Etzioni's most successful company, which predicted airfare prices and was acquired by Microsoft; it achieved a trillion labeled data points.

Organizations

Madrona Venture Group

An investment firm where Oren Etzioni holds a 'revenge of honor' position.

Allen Institute for Artificial Intelligence

A non-profit organization founded by Paul Allen, focused on AI for the common good, led by Oren Etzioni.

Concepts

Explanation-Based Learning

An area of AI research Etzioni worked on, which Andrew Ng heard about via Tom Mitchell.

Open Information Extraction

A technique Etzioni pioneered, aiming to extract information from any sentence on the web.

NLP in a Box

An initiative focused on delivering NLP capabilities on limited devices like laptops or phones.

Green AI

A concept discussed by the Allen Institute for AI, focusing on model efficiency and cost-effectiveness to increase accessibility.

Geek of the Year

An award received by Oren Etzioni.

Software & Apps

Semantic Scholar

A project by the Allen Institute for AI that uses AI to help scientists access and navigate scientific papers, with the motto 'cut through the clutter'.

Elmo

A language model mentioned as an example of models that benefit from self-labeling data.