Key Moments

Clustering Aggregation

Google TalksGoogle Talks
Education5 min read55 min video
Aug 22, 2012|581 views|2|1
Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

TL;DR

Clustering aggregation combines multiple clustering algorithms to improve results, but a simple approach can be twice as far from optimal as the best input clustering.

Key Insights

1

The clustering aggregation problem aims to find a new clustering that disagrees as little as possible with a given set of input clusterings, measured by the sum of disagreements.

2

The clustering aggregation problem can be viewed as a graph partitioning problem (correlation clustering), where edges within clusters are penalized based on disagreement, and edges cut are penalized by 1 minus disagreement.

3

A simple 'best clustering' algorithm for clustering aggregation, which selects the best performing input clustering, provides a factor of 2 approximation guarantee.

4

For correlation clustering, an algorithm that iteratively clusters points around a 'center' point within a certain distance (alpha < 1/2) yields a factor of 3 approximation.

5

Empirical results suggest that the clustering aggregation framework, particularly with a local search approach, can perform competitively with specialized algorithms on categorical data, while automatically discovering the number of clusters.

6

For network immunization, a heuristic based on the eigenvector centrality of the system's matrix works better than immunizing max-degree nodes on small-world graphs for both independent cascade and birth-death virus propagation models.

The challenge of combining multiple clusterings

The core problem addressed is finding a 'consensus' clustering from a set of existing, diverse clusterings. This is motivated by the idea that different clustering algorithms capture different aspects of the data, and combining them could lead to a more robust and accurate overall clustering than any single algorithm could achieve. The speaker, Aristides Gionis, introduces the concept of 'clustering aggregation' where the goal is to minimize disagreement with a collection of input clusterings. This problem arises naturally in various contexts where multiple analytical tools or perspectives are available, similar to how search engine results are often aggregated.

Defining disagreement and the aggregation objective

To formalize the problem, Gionis defines 'disagreement' between two clusterings on a pair of data points. Two clusterings disagree if they make different decisions about a pair of points: either one places them in the same cluster while the other splits them. The disagreement distance between two clusterings is the sum of these pairwise disagreements. If two clusterings both place points in the same cluster or both split them, they agree, and the disagreement is zero. The overall objective for clustering aggregation is to find a new clustering that minimizes the sum of disagreement distances to all the given input clusterings.

Correlation clustering: a graph-based generalization

The clustering aggregation problem can be reframed as a graph partitioning problem known as 'correlation clustering'. Here, the data points are vertices in a complete graph, and edge weights represent the degree of agreement. If two points are placed in the same cluster, the cost incurred is related to the disagreement (i.e., the fraction of input clusterings that split them). If they are split into different clusters, the cost is related to their agreement (i.e., 1 minus the disagreement). This formulation is more general because one can derive edge weights directly from a graph without necessarily starting from input clusterings. A key advantage of this framework is that it naturally handles the problem of choosing the number of clusters (k), as it penalizes both intra-cluster and inter-cluster edges.

A simple factor-2 approximation for clustering aggregation

A straightforward but theoretically sound algorithm for clustering aggregation emerges from the observation that the disagreement distance induces a metric space on the set of all possible clusterings. The 'best clustering' algorithm simply iterates through all the provided input clusterings and selects the one that yields the minimum disagreement score. This simple approach achieves a factor of 2 approximation, meaning its score is at most twice as bad as the optimal solution. While intuitive, this method might not always be insightful if one expects a truly novel clustering that is a blend of the inputs, rather than just a selection from them.

An iterative algorithm for correlation clustering

For the more general correlation clustering problem, an iterative algorithm provides a factor of 3 approximation. This algorithm works by picking an arbitrary vertex 'u' and forming a 'ball' of other vertices within a distance of alpha (typically set to 0.5) from 'u'. If such a ball can be formed, these packed points are clustered together. If the average distance to 'u' is greater than alpha, 'u' is considered an outlier and forms its own cluster. This process iteratively removes points (either the ball or just 'u') and repeats until all points are clustered. This approach leverages the triangle inequality to ensure that points close to a center are also close to each other.

Practical considerations and experimental results

The speaker discusses practical aspects, including local search algorithms that can refine initial clusterings and the scalability challenges of quadratic complexity for these algorithms. Sampling techniques are proposed to improve scalability. Experimental evaluations on categorical data, using metrics like purity and classification error, demonstrate that clustering aggregation methods (especially local search) can be competitive with specialized algorithms like ROCK and LIMBO, with the added benefit of automatically determining the number of clusters. For instance, on the congressional voting dataset, the aggregation approach achieved a classification error of 12%, comparable to other methods but without needing to pre-specify 'k'.

Network immunization strategies

Transitioning to a different research area, Gionis discusses network immunization. The problem is to select a small set of nodes to immunize to minimize the spread of a virus or harmful agent through a network, given a budget. Two virus propagation models are considered: the 'independent cascade' model (where an infected node attempts to infect neighbors with a probability 'p') and a 'birth-death' process model (involving propagation and healing probabilities). For the independent cascade model, a heuristic greedy algorithm that aims to minimize the expected size of the infected component performs better than simply immunizing nodes with the highest degree, especially on small-world graphs.

Heuristics for virus propagation models

For the birth-death process, the goal is to select nodes to immunize that minimize the dominant eigenvalue of the system matrix, as this directly relates to the infection rate. A heuristic strategy involves immunizing nodes corresponding to the largest coordinates of the system's principal eigenvector. While not theoretically proven to be optimal, this heuristic is found to perform well in practice, often matching or exceeding the performance of immunizing max-degree nodes, particularly on small-world network structures. The speaker notes that max-degree immunization is effective on scale-free networks but less so on small-world ones.

Common Questions

Clustering aggregation is a technique that combines multiple existing clustering algorithms to produce a new clustering with improved quality. The goal is to find a new clustering that disagrees as little as possible with the input clusterings.

Topics

Mentioned in this video

More from GoogleTalksArchive

View all 48 summaries

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free