Key Moments

Extracting Information from Large Graphs by Computing Similarities Between Node

Google TalksGoogle Talks
Education5 min read43 min video
Aug 22, 2012|1,616 views|12
Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

TL;DR

A new algorithm can compute node similarity in large graphs by analyzing their structural positions, costing only linear time per edge but potentially taking quadratic time when comparing a graph to itself.

Key Insights

1

The proposed similarity measure focuses on a node's structural position within a graph, differing from traditional methods that rely on path counts or proximity.

2

The computation cost of the similarity matrix is linear in the number of edges for comparing different graphs and can extend to weighted graphs.

3

The method is an extension of John Kleinberg's HITS algorithm, which assigns 'hub' and 'authority' scores to web pages.

4

The algorithm can be applied to identify communities in telecommunication networks and was able to distinguish between French and Flemish speaking regions in Belgium.

5

In synonym extraction from dictionaries, the central score method shows strong performance compared to other automatic methods and human-made dictionaries, though it struggles with multi-word terms.

6

The approach converges to a dominant eigenmatrix and is an efficient method for large-scale graph analysis, even if it doesn't solve the graph isomorphism problem.

A novel approach to graph node similarity

This talk introduces a new method for quantifying similarity between nodes in graphs, focusing on their structural positions rather than traditional metrics like path counts. Unlike existing methods that define similarity based on closeness or the number of paths between nodes, this technique captures how nodes are embedded structurally within their respective networks. This means two nodes can be considered highly similar even if they have no direct connections, provided they play analogous roles within their network structures. This approach is particularly useful for identifying nodes that share similar functions or community roles, even across different graphs. The resulting similarity matrix is inexpensive to compute, with a cost linear to the number of edges, making it suitable for very large graphs.

Extending Kleinberg's HITS algorithm

The proposed method is an extension of John Kleinberg's HITS (Hyperlink-Induced Topic Search) algorithm. HITS assigns two scores to each web page: an 'authority' score (for pages with valuable content) and a 'hub' score (for pages that link to good authorities). A page is a good hub if it points to good authorities, and a good authority if it is pointed to by good hubs. This recursive definition is solved iteratively by assigning initial scores and repeatedly updating them. The process converges to dominant eigenvectors of a matrix derived from the graph's adjacency matrix, fundamentally linked to singular value decomposition. The innovation lies in generalizing this two-score system to a multi-dimensional score system based on a smaller 'pattern' graph, thus expanding the concept of similarity beyond web page authority and hub roles.

The generalized similarity computation

The core idea of the generalized method is to use a small, representative graph (let's call it the 'pattern graph') to define a set of desired structural roles. For a three-node pattern graph with 'beginning,' 'center,' and 'end' nodes, each node in a large target graph is assigned three scores corresponding to its similarity to these roles. The update rule for these scores is a linear transformation, analogous to HITS, where a node's score for a particular role is derived from the scores of nodes it connects to (or is connected from) in the target graph, based on the pattern graph's structure. For instance, a node's 'beginning' score might increase if it points to nodes with high 'center' scores in the target graph. This iterative process converges to a 'dominant eigenmatrix' which, when properly normalized, forms the similarity matrix between the nodes of the target graph and the roles defined by the pattern graph.

Application in telecommunication network analysis

One significant application demonstrated is the analysis of cellular phone networks. Using call data from 2 million customers and 600 million calls over six months, the researchers aimed to understand network structure, identify potential customer churn, community detection, and facilitate viral marketing. They represented the network as a large graph where nodes are customers and edges represent calls. By applying the similarity measure, they identified key 'leader' nodes and communities within the network. Strikingly, when data on spoken languages (French and Flemish) was overlaid, the community structures identified by the algorithm revealed distinct geographic separations, even highlighting the bilingual nature of Brussels, validating the algorithm's ability to capture meaningful structural patterns.

Automatic synonym extraction from dictionaries

The method is also applied to automatically extract synonyms from monolingual dictionaries. Here, words are nodes, and an edge exists from word U to word V if V appears in the definition of U. A subgraph is constructed around a target word by considering words in its definition and words that use the target word in their definitions. The 'central score' (similarity to the pattern graph's central node) of nodes within this subgraph is then computed. Nodes with high central scores are ranked as potential synonyms. Experiments on dictionaries like Webster's showed this 'central score' method outperformed other automatic synonym extraction techniques, although it has limitations in handling compound words. The output shows a ranked list of potential synonyms, providing a computationally efficient way to augment or verify dictionary entries.

Computational efficiency and limitations

The algorithm's computational cost is a key advantage, being linear in the number of edges for comparing distinct graphs. However, when comparing a graph to itself (e.g., to find similarities within a single network), the complexity becomes quadratic in the number of nodes, as the output similarity matrix is N x N. While the method is efficient and converges quickly in practice, often requiring only a few iterations, it does not solve the NP-hard graph isomorphism problem. This means it can approximate structural similarities but cannot definitively determine if two graphs are identical under node relabeling. The normalization of the similarity matrix is also a point of consideration; the current sum-of-squares normalization to one is for relative comparison, and other normalizations are possible.

Common Questions

The similarity matrix computes how structurally similar nodes are within a graph or between different graphs, focusing on their position and role rather than just proximity or path counts.

Topics

Mentioned in this video

More from GoogleTalksArchive

View all 79 summaries

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free