Key Phrase Indexing With Controlled Vocabularies
Key Moments
Controlled vocabularies guide keyphrases; K++ blends extraction with semantics.
Key Insights
Human indexers show limited agreement on terms, but semantic relations can boost consistency across a collection.
K++ combines candidate phrase extraction with a controlled vocabulary and machine learning to select keyphrases.
Incorporating semantic relations (exact matches, general relatedness, hierarchical relations) improves evaluation metrics over pure keyword matching.
Lexical knowledge (lexical chains and lexical nets) is proposed to improve candidate generation and document topic coverage.
Current automatic methods lag human-human consistency but show potential; data size is a major limiting factor.
Evaluation blends traditional metrics with semantic-aware measures to gauge system quality against human indexing.
INTRODUCTION AND MOTIVATION
Keyphrase indexing with controlled vocabularies is the focus of the talk, building on Craig Naval’s earlier keyphrase extraction work. The PhD project, funded by Google, aims to assign domain terms from a controlled vocabulary to documents so they reflect main topics. A key experiment involves the Food and Agriculture Organization (FAO) using Agribok, a vocabulary with 17,000 descriptors and 11,000 linked non-descriptors to promote consistency (e.g., obesity and overweight). The vocabulary is hierarchical and includes related terms, broader/narrow relations, and non-specific related terms. This setup highlights how controlled vocabularies structure domain knowledge for indexing.
EXPERIMENTAL SETUP AND FINDINGS
The FAO experiment uses 10 documents and six professional indexers who assign terms from Agribok. Each indexer selects between five and eleven terms, totaling 33 terms per document, with overweight being the only term agreed upon by all six indexers. Several terms are agreed upon by at least three indexers, while most terms are chosen by only one. A visualization shows how terms relate semantically within Agribok, revealing that terms connected to more concepts tend to be more significant for the document. This setup underscores the challenge of automatic indexing relying solely on exact term matches.
HUMAN CONSISTENCY AND EVALUATION METRICS
Inter-indexer consistency is traditionally measured by comparing overlaps in term assignment across indexers, often revealing low agreement on topics but some alignment on concepts. The talk discusses standard library-like measures as well as adaptations that incorporate semantic relatedness. A vector-based approach models each indexer’s term set as a binary vector over the vocabulary and computes cosine similarity. Weights capture exact matches, general relatedness, and hierarchical relations. The results show baseline consistency around 0.38 with standard measures, 0.49 with a pure vector approach, and about 0.51 when semantic relations are incorporated, illustrating the benefit of semantics but also the gap to human consistency.
K++: A HYBRID KEYPHRASE APPROACH
K++ blends extraction with controlled vocabulary mapping and machine learning. It begins by extracting candidate noun phrases and maps them to valid terms in Agribok, using a pseudo-phrase normalization that removes stopwords, stems, and orders terms alphabetically so variations map to a single phrase. If a term is a non-descriptor, its corresponding descriptor is used. A Naive Bayes model is trained on manually indexed data to score candidates and select the most significant phrases. Features include term frequency, first occurrence, phrase length, and semantic degree, with the top probability phrases chosen as the final keyphrases.
SEMANTIC RELATIONS AND LEXICAL KNOWLEDGE
To improve the model, semantic relations are integrated via two matrices: a symmetrical general relatedness matrix and a hierarchical generality matrix (asymmetrical). Weights are tuned so their sum equals one, and the measure combines exact matches, general relatedness, and hierarchical relations to yield a single consistency score across all indexers and documents. The results show modest gains from semantics: general relatedness around 0.22 and hierarchical relations around 0.15. The talk also introduces lexical knowledge approaches—lexical chains and lexical nets—as potential enhancements for candidate generation and topic coverage, illustrated with examples like natural disasters.
FUTURE CHALLENGES, DATA COLLECTION, AND CONCLUSION
A major challenge is the small dataset: only 10 documents, making robust evaluation difficult and raising questions about generalizability. The speaker discusses strategies to obtain larger data, such as enlisting students to index class articles or leveraging publicly indexed web data (e.g., delicious bookmarks) to bootstrap training data. Despite K++ achieving progress, it remains below human consistency, especially relative to inter-indexer agreement, but the approach offers a clear path to improvements through lexical chains/nets and domain adaptation. The ultimate goal is an automatic system that matches or exceeds human consistency while staying scalable.
Mentioned in This Episode
●Software & Apps
●Tools
●People Referenced
Keyphrase Indexing with Controlled Vocabularies — Quick Dos and Don'ts
Practical takeaways from this episode
Do This
Avoid This
Common Questions
AgroVOC (Agriok in the talk) is a domain-specific controlled vocabulary with thousands of descriptors and linked non-descriptors that help indexers assign consistent terms to agricultural documents. The study uses its 17,000 descriptors to map terms for 10 documents indexed by six FAO professionals, illustrating how phrases map to a structured domain ontology.
Topics
Mentioned in this video
Lexical knowledge base referenced as a resource in lexical nets/chains (used in NLP for semantic relations).
lexical resource mentioned alongside WordNet as a semantic resource (roto-/roto-saurus style reference).
Researcher who worked on the predecessor keyphrase extraction algorithm; described as the supervisor’s prior student.
Web bookmarking service mentioned as a potential source of annotated training data (data gathered from web pages).
More from GoogleTalksArchive
View all 13 summaries
58 minEverything is Miscellaneous
54 minStatistical Aspects of Data Mining (Stats 202) Day 7
63 minMysteries of the Human Genome
47 minAccessing Legacy Documents in the iPod Age
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free