Published on Wed Jul 18 2018

On the Interaction Effects Between Prediction and Clustering

Matt Barnes, Artur Dubrawski

Machine learning systems increasingly depend on pipelines of multiple algorithms to provide high quality and well structured predictions. This paper argues interaction effects between clustering and prediction can cause subtle adverse behaviors during cross-validation.

0
0
0
Abstract

Machine learning systems increasingly depend on pipelines of multiple algorithms to provide high quality and well structured predictions. This paper argues interaction effects between clustering and prediction (e.g. classification, regression) algorithms can cause subtle adverse behaviors during cross-validation that may not be initially apparent. In particular, we focus on the problem of estimating the out-of-cluster (OOC) prediction loss given an approximate clustering with probabilistic error rate . Traditional cross-validation techniques exhibit significant empirical bias in this setting, and the few attempts to estimate and correct for these effects are intractable on larger datasets. Further, no previous work has been able to characterize the conditions under which these empirical effects occur, and if they do, what properties they have. We precisely answer these questions by providing theoretical properties which hold in various settings, and prove that expected out-of-cluster loss behavior rapidly decays with even minor clustering errors. Fortunately, we are able to leverage these same properties to construct hypothesis tests and scalable estimators necessary for correcting the problem. Empirical results on benchmark datasets validate our theoretical results and demonstrate how scaling techniques provide solutions to new classes of problems.

Mon Jun 15 2020
Machine Learning
Selecting the Number of Clusters with a Stability Trade-off: an Internal Validation Criterion
Clustering stability has emerged as a natural and model-agnostic principle. But stability alone is not a well-suited tool to determine the number of clusters. We propose a new principle for clustering validation: a good clusteringshould be stable.
0
0
0
Sat Aug 10 2019
Machine Learning
A Critical Note on the Evaluation of Clustering Algorithms
Experimental evaluation is a major research methodology for investigating clustering algorithms and many other machine learning algorithms. benchmark datasets have been widely used in the literature and their quality plays a key role on the value of the research.
0
0
0
Thu Mar 29 2018
Machine Learning
On Hyperparameter Search in Cluster Ensembles
Quality assessments of models in unsupervised learning and clustering have been a long-standing problem in the machine learning research. The lack of robust and universally applicable cluster validation scores often makes the algorithm selection and hyperparameter evaluation a tough guess.
0
0
0
Tue May 22 2018
Machine Learning
Clustering - What Both Theoreticians and Practitioners are Doing Wrong
Unsupervised learning is widely recognized as one of the most important challenges facing machine learning. Current theoretical understanding and practical implementations of such tasks is very rudimentary. I claim that the most signif- icant challenge for clustering is model selection.
0
0
0
Tue Aug 29 2017
Machine Learning
EC3: Combining Clustering and Classification for Ensemble Learning
classification and clustering algorithms have been proved to be successful individually in different contexts. In this paper, we propose a novel algorithm, called EC3, that merges classification and clusters. EC3 is based on a principled combination of multiple classification and multiple clustering methods using an optimization function.
0
0
0
Fri Sep 04 2020
Machine Learning
The Area Under the ROC Curve as a Measure of Clustering Quality
The Area Under the the Receiver Operating Characteristics (ROC) Curve, or AUC, is a well-known performance measure in the supervised learning domain. In this work, we explore AUC as a performance measure for cluster analysis. We show that the AUCC of a given candidate clustering solution has an expected value.
0
0
0