Practical SAHN Clustering for Very Large Data Sets and Expensive Distance Metrics

Practical SAHN Clustering for Very Large Data Sets and Expensive Distance Metrics

Abstract

Sequential agglomerative hierarchical non-overlapping (SAHN) clustering techniques belong to the classical clustering methods applied heavily in many application domains, e.g., in cheminformatics. Asymptotically optimal SAHN clustering algorithms are known for arbitrary dissimilarity measures, but their quadratic time and space complexity even in the best case still limits the applicability to small data sets. We present a new pivot based heuristic SAHN clustering algorithm exploiting the properties of metric distance measures in order to obtain a bestcase runtime of O(nlogn) for the input size n. Our approach requires only linear space and supports median and centroid linkage. It is especially suitable for expensive distance measures, as it needs only a linear number of exact distance computations. This aspect is demonstrated in our extensive experimental evaluation, where we apply our algorithm to large graph databases in combination with computationally demanding graph distance metrics. We compare our approach to exact state-of-the-art SAHN algorithms in terms of quality and runtime on real-world and synthetic instances including vector and graph data. The evaluations show a subquadratic runtime in practice and a very low memory footprint. Our approach yields high-quality clusterings and is able to rediscover planted cluster structures in synthetic data sets.

Grafik Top
Authors
  • Kriege, Nils M.
  • Schäfer, Till
  • Mutzel, Petra
Grafik Top
Shortfacts
Category
Journal Paper
Divisions
Data Mining and Machine Learning
Journal or Publication Title
Journal of Graph Algorithms and Applications
ISSN
1526-1719
Date
December 2014
Export
Grafik Top