How does Bruce Schatz define semantics in the context of knowledge federation?

Schatz distinguishes between syntax (bits in a file), structure (parts of a document), semantics (meaning or context), and pragmatics (task-dependent usage). He focuses on scalable semantics, which involves understanding context and relationships between entities across broad datasets.

What are the challenges of federating all the world's knowledge?

Challenges include syntactic complexity (merging different query syntaxes and results), structural ambiguity (defining authorship or roles uniformly), and semantic depth (truly understanding meaning vs. just context). The sheer scale and informal nature of much online data exacerbate these issues.

What is scalable semantics?

Scalable semantics is an engineering approach that aims to achieve deep meaning across a broad range of data. It moves beyond narrow, deep AI systems to handle diverse topics by focusing on entities and their contexts, allowing for broader application even if it means less absolute precision.

How does the BSpace system work?

BSpace utilizes scalable semantics to create and manipulate 'spaces' of knowledge. It allows users to navigate, merge, and analyze collections of information, moving beyond word-based search to concept-based interaction, dynamically processing data on the fly.

What is involved in semantic federation?

Semantic federation involves going inside phrases to understand their meaning and context. It focuses on matching similar concepts across different data sources uniformly, enabling deeper analysis than traditional syntax or structure federation.

How does BSpace handle context and meaning compared to traditional search engines?

BSpace moves beyond just extracting entities; it builds context graphs showing how entities co-occur. This allows for suggestion facilities and deeper exploration of related concepts, rather than just returning a list of links like traditional search engines.

What are the main operations within the BSpace system?

Key operations include 'extract' (identifying distinguishing terms), 'mapping' (breaking down collections into clusters), 'space algebra' (merging or intersecting spaces), and 'summarization' (providing context-aware summaries of entities within a space).

What is the future vision for organizing all the world's knowledge?

The future likely involves not one massive database, but many small, dynamic, and interconnected 'spaces' or collections, potentially distributed across many servers, mirroring virtual worlds in interactivity.

What is the proposed grand project for leveraging semantic knowledge?

A grand project could involve capturing all digital communication and data within a university, using it for research and education rather than advertising, potentially building a semantically-based social network.

Can semantic relationships help debug language itself?

Yes, by analyzing patterns and regularity within language, semantic systems can potentially identify incoherent expressions or flag areas where terminology is insufficient, though human interpretation often remains superior to automated analysis.

How do systems like PubMed handle entity synonyms automatically?

PubMed uses pre-compiled translation tables for common synonyms and scientific terms. Automated systems can also employ linguistic processing and heuristics to find equivalent terms, though human-generated lists are generally more accurate.

Key Moments

Towards Telesophy: Federating All the World' s Knowledge

Google Talks

Education6 min read66 min video

Aug 22, 2012|145 views|2

googlevideo

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

The internet has connected us to vast knowledge, but current search engines only organize it, not truly understand or synthesize it. We need to move beyond 'tele' (far) to 'sophie' (wisdom) to truly leverage information for problem-solving.

Key Insights

The evolution of the net has progressed from data transmission to information retrieval, and is now moving towards knowledge navigation in the 'Interspace'.

Google excels at organizing and accessing knowledge, but does little in the subsequent stages of analysis and synthesis for problem-solving.

Semantics federation involves understanding the meaning of phrases within documents and matching similar concepts across different sources.

Scalable semantics aims to achieve deep meaning across a broad range of topics, moving from 'meaning type 869' to recognizing broader entities like people, places, and things.

The BSpace system demonstrates 'space algebra' for manipulating and merging collections of knowledge, enabling on-the-fly analysis and summarization.

Future systems like 'hive mind' aim for true knowledge synthesis and problem-solving, building on current federated knowledge approaches.

The evolution from 'cyberspace' to 'interspace' knowledge

Bruce Schatz's 2007 talk at Google outlines the progression of the internet's capabilities, moving from early data transmission (Internet) to information retrieval (Web), and now towards deeper knowledge navigation within an 'Interspace'. He notes that while systems like Google have largely achieved the goal of federating all the world's knowledge for access and organization, they fall short in the crucial next steps of analysis, synthesis, and problem-solving. This gap represents the remaining half of the journey towards a 'hive mind' or collective wisdom, implying that while we can access information from afar ('tele'), we have yet to achieve true wisdom ('sophie') from it. Schatz emphasizes that Google, despite its success, is operating at a research level from about 10 years prior, and to remain competitive, it must engage with the more advanced stages of knowledge processing.

The linguistic hierarchy: syntax, structure, and semantics

Schatz introduces a linguistic framework for understanding knowledge federation. Syntax refers to the raw data, like bits in a file or words in a document. Structure involves identifying the parts of a document, such as author names, introduction, or methods sections, enabling more targeted searches. Semantics delves into the meaning of phrases, moving beyond mere context. While meaning is often considered static (e.g., a gene's function), its interpretation can be context-dependent (pragmatics). Schatz highlights that current systems often substitute context for true meaning, a pragmatic approach that has proven effective in systems like Google, which leverage the context of web links. The ultimate goal, 'pragmatics,' involves using knowledge for task-dependent applications, which is complex but crucial for practical problem-solving, such as in healthcare.

Federation across knowledge levels: from syntax to semantics

Schatz details different types of federation. Syntax federation, pioneered by systems like Telesphere and largely employed by Google, involves sending the same query to multiple sources, which requires managing network access, query syntax, and result merging, especially to eliminate duplicates. Structure federation, demonstrated by the DELIVER project, enables structured queries based on document parts (e.g., finding papers with 'nanostructures' in the figure caption within the last 10 years). This requires uniform markup, which is challenging due to varying definitions of authors or creators across different media. Semantics federation, the focus of his talk, aims to extract and match meaning from phrases across distributed data, a far more complex task. Currently, structure federation has not significantly penetrated mass systems, with a limited amount of correctly structured text available online.

Scalable semantics: bridging breadth and depth

The concept of 'scalable semantics' is an oxymoron, as semantics implies deep meaning while scalability requires broad coverage. Historically, semantics focused on deep parsing of specific topics. However, research, particularly from DARPA programs, found that broad approaches, like identifying entities (people, places, things) and noun phrases, scaled better and became more practical with increasing machine speed. Semantics has thus shifted from understanding a phrase's exact meaning to recognizing its type and its co-occurrence with other entities. This shift makes semantics an engineering problem: working globally by understanding all possible knowledge but acting locally by analyzing narrow collections precisely. This necessitates moving away from centralized, monolithic index systems towards distributed approaches to handle the complexity and variety of information.

Entities: identifying and tagging information units

Identifying entities is crucial for scalable semantics. This can be done through hand-tagged markup (like XML for the Semantic Web) or more practically through automatic machine tagging using training sets. The process involves extracting phrases, recognizing parts of speech, and then identifying entities like people, places, or specific domain terms (e.g., genes in biology). While manually tagged data is ideal, the informal nature of much online content necessitates automatic methods. Biology and medicine provide good examples, with entities like genes or protein kinases being frequently mentioned. However, entities vary in their ease of tagging, with organism names being straightforward while behaviors or functions are more challenging, requiring larger training sets. A significant challenge is the domain-specificity of entities, requiring separate efforts for biology, medicine, physics, and everyday subjects.

Context graphs and concept navigation for enhanced retrieval

Extracted entities can be used to build 'context graphs' that map the co-occurrence of terms within a collection. This graph can enhance search by suggesting related terms if a direct search fails, acting as a sophisticated suggestion facility. Schatz illustrates how advances in computing power, particularly the rise of clusters of workstations and then supercomputers, enabled the processing of larger collections and the in-memory computation of these vast relationship graphs. This allows for 'on-the-fly' analysis, such as clustering data or finding inner-related graphs, without requiring extensive pre-computation that traditional centralized systems demanded. This shift from pre-computation to dynamic, real-time analysis on powerful, distributed hardware is key to handling the scale of global knowledge.

BSpace: a system for dynamic knowledge manipulation

The BSpace system, developed by Schatz's team, exemplifies a new paradigm for knowledge interaction, moving beyond traditional search. It focuses on creating and manipulating 'spaces' – dynamic collections of knowledge. Key operations include 'extracting' distinguishing terms from a space, 'mapping' to break down and cluster documents within it, and performing 'space algebra' such as intersection and merging. This allows users to navigate and refine knowledge iteratively. For instance, a search for 'behavioral maturation' in insects can be refined by automatically identifying key terms, clustering results, and then intersecting those clusters with other relevant spaces. The system can also dynamically summarize entities within a given space on the fly, providing a deeper understanding than a simple list of search results. This approach emphasizes interactive exploration and manipulation of knowledge rather than passive retrieval.

The future: hive minds and semantically-based social networks

Schatz envisions future systems evolving towards 'hive minds,' capable of true knowledge synthesis and problem-solving. This involves moving away from centralizing all knowledge towards a distributed network of 'spaces' that can be dynamically manipulated. He proposes grand projects, like capturing all the knowledge generated within a university (emails, documents, communications) to build semantically-based social networks that facilitate deep understanding and collaboration, rather than just sharing content. Such a system could have profound implications for education, research, and even social interaction, moving beyond current paradigms to a more integrated and intelligent use of collective knowledge. He concludes by noting that while understanding bees' disappearance is a complex problem, the current scientific community lacks a definitive answer, reflecting the broader challenge of deep knowledge synthesis.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Organizations

●Concepts

●People Referenced

Common Questions

Telesophy is a concept introduced by Bruce Schatz aiming to federate all the world's knowledge. It's described as having two parts: 'tele' (broadcasting/access) and 'sophie' (wisdom/analysis), with current systems excelling at the former but needing to advance in the latter.

Topics

Mindset & Self-Improvement AI & Machine Learning Technology & Innovation Science & Mathematics Distributed Systems Natural Language Processing Information Retrieval Knowledge Federation Scalable Semantics Concept Navigation Data Mining

Mentioned in this video

Products

G Phone

A rumored phone product, part of the hypothetical Google project to capture university data, with the caveat that it was not yet announced.

Concepts

Telesophy

A term introduced by Bruce Schatz in the 1980s, representing a project focused on federating all the world's knowledge, with the 'tele' part (broadcasting) being more advanced than the 'sophie' part (wisdom/analysis).

Semantic Web

A web of data that can be more easily processed by machines, discussed as an attempt to create languages for structure and semantics, though not yet widely adopted.

Companies

Yahoo

Mentioned as having a strategy of classifying all web knowledge, similar to classification efforts that could be applied to entities across different subject areas.

Google

The company hosting the talk, serving as a prime example of a large-scale knowledge organization system that the speaker critiques for its focus on access and organization over analysis and synthesis.

Bellcore

Where Bruce Schatz introduced the concept of Telesophy in the 1980s.

IBM

Mentioned in the context of supercomputer limitations in the past, contrasting with the rise of PCs and distributed computing.

Microsoft

Where the DARPA-funded concept space project members went after the project ended; suggested that their work might appear in future Windows versions.

Elsevier

A publisher whose digital library project failed, contrasted with the success of the University of Illinois's project due to better data cleaning and tagging.

Organizations

PubMed

A medical literature database used as an example to illustrate synonym recognition automation and the role of human curation in data quality.

CANIS

Community Architectures for Network Information Systems, a project directed by Bruce Schatz at the University of Illinois.

Medline

A database of biomedical literature, mentioned in comparison to the size of the web in 1998 and used as an example for the BSpace system's data source.

NCSA

National Center for Supercomputing Applications, where a large-scale computation for information retrieval and entity relation discovery was performed.

National Center for Supercomputing Applications

Located in the same area as the University of Illinois, mentioned in relation to Bruce Schatz's early work and computing advancements.

Xerox PARC

A renowned research lab where Bruce Schatz first presented his ideas on federating knowledge over 20 years prior to the talk.

DARPA

Funded a project that developed concept space systems, but later pulled the plug in 2000, leading the project members to Microsoft.

Software & Apps

Mosaic

One of the first web browsers, derived from Tim Berners-Lee's work, which emerged about 10 years after early federated search concepts were explored.

Caenorhabditis elegans

A tiny worm with 50 cells, mentioned as the first living organism whose genome was completely sequenced. Bruce Schatz introduced the concept of creating a database for research on this organism.

DELIVER

A project Bruce Schatz worked on at UIUC, focused on digital libraries.

BSpace

A system developed by Bruce Schatz that organizes concept spaces to assist in deeper understanding of knowledge, particularly in biology and medicine.

Drosophila

The fruit fly, mentioned as an organism studied within the BSpace system, specifically in relation to behavioral maturation and gene summaries.

Google Books

A program the University of Illinois library has joined, providing context for the grand project idea of capturing and relating university knowledge.

Cyc

An ambitious, largely failed attempt to encode common sense knowledge for automatic reasoning, discussed in comparison to automated template derivation methods.

Gmail

A free email service offered as part of the hypothetical grand project to capture and utilize university knowledge.

People

Tim Berners-Lee

The inventor of the World Wide Web, whose work inspired the development of browsers like Mosaic.

Greg Chesson

Mentioned by the host in jest for wearing a coat.

Bruce Schatz

The speaker, a director at the University of Illinois, Champaign-Urbana, presenting on telesophy, scalable semantics, and concept navigation.

Mark Andreessen

Mentioned as being involved with the first browsers (Mosaic) during Bruce Schatz's time.

Vint Cerf

Mentioned by Bruce Schatz in the context of discussing the semantic web, indicating prior conversations or familiarity.

Legislation & Policy

DARPA Trek program

A program where base technologies for scalable semantics were developed, focused on identifying potential terrorists by reading newspaper articles.

Media

Neopets

A virtual world game mentioned as an example of online environments children engage with, contrasted with the more advanced concept spaces discussed.

Second Life

A virtual world platform mentioned as an example of online environments, contrasted by the speaker as potentially something older audiences might disengage from compared to newer, more complex systems.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free