How does this similarity measure differ from traditional graph similarity methods?

Unlike traditional methods that rely on path counts, this approach captures the 'structured position' of a node within its graph, allowing for high similarity between disconnected nodes that play analogous roles.

What is the computational cost of generating the similarity matrix?

The computation is generally inexpensive, with a cost linear in the number of edges for many applications. However, if comparing a graph to itself, it can become quadratic in the number of nodes.

How is the similarity matrix method related to John Kleinberg's HITS algorithm?

The method is an extension of the HITS algorithm, which assigns hub and authority scores to web pages. The new method extends this by associating more than two scores to nodes, allowing for more nuanced similarity comparisons.

What are the key applications of this similarity matrix technique?

Applications include analyzing telecommunication networks, identifying communities, extracting synonyms from dictionaries, and understanding web graph structures.

Can this method be applied to weighted graphs?

Yes, the method can be extended to weighted graphs, where the weights on the arcs (e.g., call frequency in a phone network) are incorporated into the similarity calculation.

How is the similarity matrix normalized, and what does a perfect similarity score look like?

There isn't a 'perfect similarity' score like 1 universally. The matrix is often normalized such that the sum of squares equals one, making it relative. The focus is on the relative values of entries within the matrix.

Can this method find the best match for an entire graph, not just individual nodes?

Finding an exact match for an entire graph is related to the computationally hard problem of graph isomorphism. While the method can approximate matches and potentially rediscover structures, it doesn't solve exact graph isomorphism.

Key Moments

Extracting Information from Large Graphs by Computing Similarities Between Node

Google Talks

Education5 min read43 min video

Aug 22, 2012|1,616 views|12

googlevideo

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

A new algorithm can compute node similarity in large graphs by analyzing their structural positions, costing only linear time per edge but potentially taking quadratic time when comparing a graph to itself.

Key Insights

The proposed similarity measure focuses on a node's structural position within a graph, differing from traditional methods that rely on path counts or proximity.

The computation cost of the similarity matrix is linear in the number of edges for comparing different graphs and can extend to weighted graphs.

The method is an extension of John Kleinberg's HITS algorithm, which assigns 'hub' and 'authority' scores to web pages.

The algorithm can be applied to identify communities in telecommunication networks and was able to distinguish between French and Flemish speaking regions in Belgium.

In synonym extraction from dictionaries, the central score method shows strong performance compared to other automatic methods and human-made dictionaries, though it struggles with multi-word terms.

The approach converges to a dominant eigenmatrix and is an efficient method for large-scale graph analysis, even if it doesn't solve the graph isomorphism problem.

A novel approach to graph node similarity

This talk introduces a new method for quantifying similarity between nodes in graphs, focusing on their structural positions rather than traditional metrics like path counts. Unlike existing methods that define similarity based on closeness or the number of paths between nodes, this technique captures how nodes are embedded structurally within their respective networks. This means two nodes can be considered highly similar even if they have no direct connections, provided they play analogous roles within their network structures. This approach is particularly useful for identifying nodes that share similar functions or community roles, even across different graphs. The resulting similarity matrix is inexpensive to compute, with a cost linear to the number of edges, making it suitable for very large graphs.

Extending Kleinberg's HITS algorithm

The proposed method is an extension of John Kleinberg's HITS (Hyperlink-Induced Topic Search) algorithm. HITS assigns two scores to each web page: an 'authority' score (for pages with valuable content) and a 'hub' score (for pages that link to good authorities). A page is a good hub if it points to good authorities, and a good authority if it is pointed to by good hubs. This recursive definition is solved iteratively by assigning initial scores and repeatedly updating them. The process converges to dominant eigenvectors of a matrix derived from the graph's adjacency matrix, fundamentally linked to singular value decomposition. The innovation lies in generalizing this two-score system to a multi-dimensional score system based on a smaller 'pattern' graph, thus expanding the concept of similarity beyond web page authority and hub roles.

The generalized similarity computation

The core idea of the generalized method is to use a small, representative graph (let's call it the 'pattern graph') to define a set of desired structural roles. For a three-node pattern graph with 'beginning,' 'center,' and 'end' nodes, each node in a large target graph is assigned three scores corresponding to its similarity to these roles. The update rule for these scores is a linear transformation, analogous to HITS, where a node's score for a particular role is derived from the scores of nodes it connects to (or is connected from) in the target graph, based on the pattern graph's structure. For instance, a node's 'beginning' score might increase if it points to nodes with high 'center' scores in the target graph. This iterative process converges to a 'dominant eigenmatrix' which, when properly normalized, forms the similarity matrix between the nodes of the target graph and the roles defined by the pattern graph.

Application in telecommunication network analysis

One significant application demonstrated is the analysis of cellular phone networks. Using call data from 2 million customers and 600 million calls over six months, the researchers aimed to understand network structure, identify potential customer churn, community detection, and facilitate viral marketing. They represented the network as a large graph where nodes are customers and edges represent calls. By applying the similarity measure, they identified key 'leader' nodes and communities within the network. Strikingly, when data on spoken languages (French and Flemish) was overlaid, the community structures identified by the algorithm revealed distinct geographic separations, even highlighting the bilingual nature of Brussels, validating the algorithm's ability to capture meaningful structural patterns.

Automatic synonym extraction from dictionaries

The method is also applied to automatically extract synonyms from monolingual dictionaries. Here, words are nodes, and an edge exists from word U to word V if V appears in the definition of U. A subgraph is constructed around a target word by considering words in its definition and words that use the target word in their definitions. The 'central score' (similarity to the pattern graph's central node) of nodes within this subgraph is then computed. Nodes with high central scores are ranked as potential synonyms. Experiments on dictionaries like Webster's showed this 'central score' method outperformed other automatic synonym extraction techniques, although it has limitations in handling compound words. The output shows a ranked list of potential synonyms, providing a computationally efficient way to augment or verify dictionary entries.

Computational efficiency and limitations

The algorithm's computational cost is a key advantage, being linear in the number of edges for comparing distinct graphs. However, when comparing a graph to itself (e.g., to find similarities within a single network), the complexity becomes quadratic in the number of nodes, as the output similarity matrix is N x N. While the method is efficient and converges quickly in practice, often requiring only a few iterations, it does not solve the NP-hard graph isomorphism problem. This means it can approximate structural similarities but cannot definitively determine if two graphs are identical under node relabeling. The normalization of the similarity matrix is also a point of consideration; the current sum-of-squares normalization to one is for relative comparison, and other normalizations are possible.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Books

●Concepts

●People Referenced

Common Questions

The similarity matrix computes how structurally similar nodes are within a graph or between different graphs, focusing on their position and role rather than just proximity or path counts.

Topics

Technology & Innovation Science & Mathematics Machine Learning Network Analysis Computational Mathematics Graph Theory Data Mining Node Similarity Algorithm Development

Mentioned in this video

Companies

Microsoft

Company whose human-made dictionary for the word 'sugar' provided a comparison point for the automatic synonym extraction method.

Belgian mobile phone company

A company in Belgium that provided data for analyzing call patterns and identifying communities within their network.

People

John Kleinberg

Researcher from Cornell who introduced the HITS method, which the new similarity method extends.

Vincent Blondel

Professor of applied mathematics at Universite Catholique de Louvain, who developed the similarity matrix method discussed.

Books

Le Petit Robert

A French dictionary that is considering implementing the discussed synonym extraction method in its CD-ROM version.

SIAM review

Journal where a paper co-authored by Maureen and Vincent Blondel on the similarity matrix tool was published.

Organizations

MIT

Massachusetts Institute of Technology, where Vincent Blondel is doing a sabbatical year.

Universite Catholique de Louvain

University where Vincent Blondel is a professor of applied mathematics and head of the Department of Mathematical Engineering.

Concepts

PageRank

An algorithm used by Google Search to rank websites in their search engine results, mentioned as an alternative to HITS.

Software & Apps

Google Scholar

A citation index for searching scholarly literature, used as a way to demonstrate the attention received by Kleinberg's method.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free