Practical SAHN Clustering for Very Large Data Sets and Expensive Distance Metrics

Content

Abstract
Authors
Shortfacts

Abstract

Sequential agglomerative hierarchical non-overlapping (SAHN) clustering techniques belong to the classical clustering methods applied heavily in many application domains, e.g., in cheminformatics. Asymptotically optimal SAHN clustering algorithms are known for arbitrary dissimilarity measures, but their quadratic time and space complexity even in the best case still limits the applicability to small data sets. We present a new pivot based heuristic SAHN clustering algorithm exploiting the properties of metric distance measures in order to obtain a bestcase runtime of O(nlogn) for the input size n. Our approach requires only linear space and supports median and centroid linkage. It is especially suitable for expensive distance measures, as it needs only a linear number of exact distance computations. This aspect is demonstrated in our extensive experimental evaluation, where we apply our algorithm to large graph databases in combination with computationally demanding graph distance metrics. We compare our approach to exact state-of-the-art SAHN algorithms in terms of quality and runtime on real-world and synthetic instances including vector and graph data. The evaluations show a subquadratic runtime in practice and a very low memory footprint. Our approach yields high-quality clusterings and is able to rediscover planted cluster structures in synthetic data sets.

Top

Authors

Kriege, Nils M.
Schäfer, Till
Mutzel, Petra

Top

Shortfacts

Category	Journal Paper
Divisions	Data Mining and Machine Learning
Journal or Publication Title	Journal of Graph Algorithms and Applications
ISSN	1526-1719
Date	December 2014
Export

Top