Computational and Applied Sciences Colloquium Series (On Clusterings -- Good, Bad, and Spectral)

Announcing the Computational & Applied Sciences Colloquium Series

Dr. Santosh Vempala
Assistant Professor
Massachusetts Institute of Technology
Friday, February 23, 2001
Center for Environmental Sciences & Technology Management (CESTM) Auditorium
University at Albany
4 p.m. to 6 p.m.

Refreshments Served
Sponsored by the University at Albany�s Division for Research

On Clusterings — Good, Bad and Spectral

Clustering, or partitioning into "similar" groups, is a classical problem in mathematics as well as the applied sciences. The availability of vast amounts of data has revitalized algorithmic research on the problem. There are several measures of the quality of a clustering that are seductively simple to define (e.g., k-median, min sum, min diameter), but have the shortcoming that they are easy to fool. That is, there are simple examples where it is clear what the "right" clustering is, but optimizing the measures defined produces undesirable solutions. Furthermore, a heuristic that often does well in practice, spectral clustering, performs poorly under these theoretical measures. Roughly speaking, spectral clustering is the general technique of partitioning the rows of a matrix according to their components in the top few singular vectors.
We propose a new measure of the quality of a clustering. A simple recursive algorithm is shown to have a (polylog) worst-case guarantee under the new measure. We then present two results about spectral clustering. One proffers worst-case guarantees while the other shows that if there exists a "good" clustering, then the algorithm will find one close to it.
In addition, spectral clustering can be made very fast, via randomized algorithms for approximating a given m x n matrix by a low-rank matrix. Standard methods take time that is polynomial in m and n. In contrast, the running time of our first algorithm depends only on the quality of the desired approximation and not on the size of the matrix, but it assumes that the entries of the matrix can be sampled in a specific manner. Our second algorithm needs no assumptions and has a running time that is linear in the number of non-zero entries. This talk will be self-contained and will include a demo of clustering web searches in real-time.
Dr. Santosh Vempala is an assistant professor of mathematics at MIT. He graduated from Carnegie Mellon in 1997 with his Ph.D. Following one year as a Miller fellow at Berkeley, Dr. Vempala joined MIT's mathematics department. His research interests are in geometry, randomness and combinatorics, especially in the context of polynomial-time algorithms. Each year, he teaches a graduate course called "An Eye for Elegance."
For directions to CESTM go to https://www.albany.edu/about_the_university/maps/uptown.html.
For additional information call (518) 442-3332.