Published on Thu Feb 21 2008

Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables

Benhuai Xie, Wei Pan, Xiaotong Shen

Clustering analysis is one of the most widely used statistical tools in microarray data analysis. The presence of many noise variables may mask underlying structures. This article introduces a novel approach that shrinks the variances together with means.

0
0
0
Abstract

Clustering analysis is one of the most widely used statistical tools in many emerging areas such as microarray data analysis. For microarray and other high-dimensional data, the presence of many noise variables may mask underlying clustering structures. Hence removing noise variables via variable selection is necessary. For simultaneous variable selection and parameter estimation, existing penalized likelihood approaches in model-based clustering analysis all assume a common diagonal covariance matrix across clusters, which however may not hold in practice. To analyze high-dimensional data, particularly those with relatively low sample sizes, this article introduces a novel approach that shrinks the variances together with means, in a more general situation with cluster-specific (diagonal) covariance matrices. Furthermore, selection of grouped variables via inclusion or exclusion of a group of variables altogether is permitted by a specific form of penalty, which facilitates incorporating subject-matter knowledge, such as gene functions in clustering microarray samples for disease subtype discovery. For implementation, EM algorithms are derived for parameter estimation, in which the M-steps clearly demonstrate the effects of shrinkage and thresholding. Numerical examples, including an application to acute leukemia subtype discovery with microarray gene expression data, are provided to demonstrate the utility and advantage of the proposed method.

Sun Jan 29 2012
Machine Learning
A robust and sparse K-means clustering algorithm
In many situations where the interest lies in identifying clusters one might expect that not all available variables carry information about these groups. Data quality (e.g. outliers or missing entries) might present a serious and sometimes hard-to-assess problem for large and complex datasets.
0
0
0
Wed Oct 05 2016
Machine Learning
Non-Parametric Cluster Significance Testing with Reference to a Unimodal Null Distribution
Cluster analysis is an unsupervised learning strategy that can be employed to identify subgroups of observations in data sets of unknown structure. This strategy is particularly useful for analyzing high-dimensional data such as microarray gene expression data.
0
0
0
Sat Jan 02 2016
Machine Learning
Joint Estimation of Precision Matrices in Heterogeneous Populations
The method uses a Laplacian shrinkage penalty to encourage similarity among estimates from different subpopulations. The method can be extended to allow for faster computations in high dimensions.
0
0
0
Thu Sep 26 2019
Machine Learning
CS Sparse K-means: An Algorithm for Cluster-Specific Feature Selection in High-Dimensional Clustering
Feature selection is an important and challenging task in high dimensional clustering. In genomics, it is highly likely that a gene can only define one subtype against all the other subtypes. In this paper, we propose a K-means based clustering algorithm that discovers informative features.
0
0
0
Mon Oct 07 2019
Machine Learning
Gaussian Mixture Clustering Using Relative Tests of Fit
We consider clustering based on significance tests for Gaussian Mixture Models (GMMs) We study the limiting distribution and power of this approach in some examples. We then introduce a new test based on the idea of relative fit.
0
0
0
Sun Feb 21 2010
Machine Learning
Partition Decoupling for Multi-gene Analysis of Gene Expression Profiling Data
The Partition Decoupling Method (PDM) can reveal non-linear and non-convex geometries present in the data. The PDM can be used to find sets of mechanistically-related genes.
0
0
0