Key Phrase Indexing With Controlled Vocabularies

Google TalksGoogle Talks
Education3 min read45 min video
Aug 22, 2012|3,549 views|27|545
Save to Pod

Key Moments

TL;DR

Controlled vocabularies guide keyphrases; K++ blends extraction with semantics.

Key Insights

1

Human indexers show limited agreement on terms, but semantic relations can boost consistency across a collection.

2

K++ combines candidate phrase extraction with a controlled vocabulary and machine learning to select keyphrases.

3

Incorporating semantic relations (exact matches, general relatedness, hierarchical relations) improves evaluation metrics over pure keyword matching.

4

Lexical knowledge (lexical chains and lexical nets) is proposed to improve candidate generation and document topic coverage.

5

Current automatic methods lag human-human consistency but show potential; data size is a major limiting factor.

6

Evaluation blends traditional metrics with semantic-aware measures to gauge system quality against human indexing.

INTRODUCTION AND MOTIVATION

Keyphrase indexing with controlled vocabularies is the focus of the talk, building on Craig Naval’s earlier keyphrase extraction work. The PhD project, funded by Google, aims to assign domain terms from a controlled vocabulary to documents so they reflect main topics. A key experiment involves the Food and Agriculture Organization (FAO) using Agribok, a vocabulary with 17,000 descriptors and 11,000 linked non-descriptors to promote consistency (e.g., obesity and overweight). The vocabulary is hierarchical and includes related terms, broader/narrow relations, and non-specific related terms. This setup highlights how controlled vocabularies structure domain knowledge for indexing.

EXPERIMENTAL SETUP AND FINDINGS

The FAO experiment uses 10 documents and six professional indexers who assign terms from Agribok. Each indexer selects between five and eleven terms, totaling 33 terms per document, with overweight being the only term agreed upon by all six indexers. Several terms are agreed upon by at least three indexers, while most terms are chosen by only one. A visualization shows how terms relate semantically within Agribok, revealing that terms connected to more concepts tend to be more significant for the document. This setup underscores the challenge of automatic indexing relying solely on exact term matches.

HUMAN CONSISTENCY AND EVALUATION METRICS

Inter-indexer consistency is traditionally measured by comparing overlaps in term assignment across indexers, often revealing low agreement on topics but some alignment on concepts. The talk discusses standard library-like measures as well as adaptations that incorporate semantic relatedness. A vector-based approach models each indexer’s term set as a binary vector over the vocabulary and computes cosine similarity. Weights capture exact matches, general relatedness, and hierarchical relations. The results show baseline consistency around 0.38 with standard measures, 0.49 with a pure vector approach, and about 0.51 when semantic relations are incorporated, illustrating the benefit of semantics but also the gap to human consistency.

K++: A HYBRID KEYPHRASE APPROACH

K++ blends extraction with controlled vocabulary mapping and machine learning. It begins by extracting candidate noun phrases and maps them to valid terms in Agribok, using a pseudo-phrase normalization that removes stopwords, stems, and orders terms alphabetically so variations map to a single phrase. If a term is a non-descriptor, its corresponding descriptor is used. A Naive Bayes model is trained on manually indexed data to score candidates and select the most significant phrases. Features include term frequency, first occurrence, phrase length, and semantic degree, with the top probability phrases chosen as the final keyphrases.

SEMANTIC RELATIONS AND LEXICAL KNOWLEDGE

To improve the model, semantic relations are integrated via two matrices: a symmetrical general relatedness matrix and a hierarchical generality matrix (asymmetrical). Weights are tuned so their sum equals one, and the measure combines exact matches, general relatedness, and hierarchical relations to yield a single consistency score across all indexers and documents. The results show modest gains from semantics: general relatedness around 0.22 and hierarchical relations around 0.15. The talk also introduces lexical knowledge approaches—lexical chains and lexical nets—as potential enhancements for candidate generation and topic coverage, illustrated with examples like natural disasters.

FUTURE CHALLENGES, DATA COLLECTION, AND CONCLUSION

A major challenge is the small dataset: only 10 documents, making robust evaluation difficult and raising questions about generalizability. The speaker discusses strategies to obtain larger data, such as enlisting students to index class articles or leveraging publicly indexed web data (e.g., delicious bookmarks) to bootstrap training data. Despite K++ achieving progress, it remains below human consistency, especially relative to inter-indexer agreement, but the approach offers a clear path to improvements through lexical chains/nets and domain adaptation. The ultimate goal is an automatic system that matches or exceeds human consistency while staying scalable.

Keyphrase Indexing with Controlled Vocabularies — Quick Dos and Don'ts

Practical takeaways from this episode

Do This

Map candidates to the controlled vocabulary early (include non-descriptors via descriptor mapping).
Use a mix of syntactic features (e.g., TF, first occurrence) with semantic cues (relations in the vocabulary) to score candidates.
Evaluate using both exact-match and semantically-aware metrics (e.g., consider related terms as hits).
Leverage training data with human-annotated phrases; use Naive Bayes or similar probabilistic models for scoring.
Consider lexical chains or lexical nets to improve topic coverage and identify cohesive structures.

Avoid This

Rely solely on surface-form phrases; ignore semantic relationships in the vocabulary.
Assume all high-frequency phrases are the most important without considering their connections to other terms.
Ignore the need for a consistent evaluation baseline (hard to compare to humans if you don’t measure inter-indexer consistency).

Common Questions

AgroVOC (Agriok in the talk) is a domain-specific controlled vocabulary with thousands of descriptors and linked non-descriptors that help indexers assign consistent terms to agricultural documents. The study uses its 17,000 descriptors to map terms for 10 documents indexed by six FAO professionals, illustrating how phrases map to a structured domain ontology.

Topics

Mentioned in this video

More from GoogleTalksArchive

View all 13 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free