Prior to starting an analysis, certain data transformation can be done to
improve analysis performance. These include log transformations, mean- or
median- centering and scaling. Usually, mean- or median- centering is done
on log2- transformed data.
Different types of adjustments may be applied on top of one another in following
sequence: log2 transformation, centering and scaling.
If both mean-centering and scaling are performed, this is equivalent to the
"staandization".
Hierarchical cluster analysis is conducted with
R amap package.
2.1. Distance measures
Suppose we have two vectors x and y,
their distances can be calculated in the following ways:
- Non-centered
Pearson: sum(x_i y_i) / [sum(x_i^2) sum(y_i^2)].
-
Correlation--Centered Pearson: 1 - corr(x,y).
- Euclidean:
Usual square distance between the two vectors (2 norm).
- maximum:
Maximum distance between two components of x and y (supremum
norm)
- manhattan:
Absolute distance between the two vectors (1 norm).
- canberra:
sum(|x_i - y_i| / |x_i + y_i|). Terms with zero numerator and
denominator are omitted from the sum and treated as if the values were
missing.
2.2. Clustering methods
-
Single Linkage: The distances are measured between each member
of one cluster to each member of the other cluster. The minimum of these
distances is considered the cluster-to-cluster distance. It adopts a `friends
of friends' clustering strategy and tends to find "loose" clusters, and may
suffer from "chaining" effect in microarray data analysis.
-
Average Linkage: The average distance of each member of one
cluster to each member of the other cluster is used as a measure of
cluster-to-cluster distance.
-
Complete Linkage: The distances are measured between each
member of one cluster to each member of the other cluster. The maximum of these
distances is considered the cluster-to-cluster distance. It tends to find
compact, spherical clusters.
- Ward's
minimum variance method aims at finding compact, spherical clusters. The
distance between two clusters is the ANOVA sum of squares between the
two clusters added up over all the variables. At each generation, the
within-cluster sum of squares is minimized over all partitions obtainable by
merging two clusters from the previous generation. Ward's method tends to join
clusters with a small number of observations, and it is strongly biased toward
producing clusters with roughly the same number of observations. It is also
very sensitive to outliers (Milligan 1980).
- Centroid
method: The distance between two clusters is defined as the (squared)
Euclidean distance between their centroids or means. The centroid method is
more robust to outliers than most other hierarchical methods but in other
respects may not perform as well as Ward's method or average linkage (Milligan
1980). The centroid method was originated by Sokal and Michener (1958).
- The other methods can be regarded as aiming
for clusters with characteristics somewhere between the single and complete
link methods.
Based on previous experience, Average linkage
and complete linkage maybe the preferred methods for microarray data analysis.
2.3. Output:
- Dendrogram with all observations.
- Heatmap of all observations
- The expression line graph and heatmap for each
of the sub-clusters, for user-specified number of clusters.
- The probe sets in each subcluster, which can
be saved as data sets for refined analysis or comparison.
[Back
to Top]
3. Partitioning Methods
There are two partitioning methods provided. Both need user to pre-define the
number of clustering centers.
3.1 PAM,
or partition round medoids, is one of the k-medoids methods. Different from
usual
k-means approach, it also accepts a dissimilarity matrix, and it is more
robust because it minimizes a sum of dissimilarities instead of a sum of squared
Euclidean distances.
The PAM-algorithm is based on the search for 'k' representative objects or
medoids among the observations of the dataset, which should represent the
structure of the data. After finding a set of 'k' medoids, 'k' clusters are
constructed by assigning each observation to the nearest medoid. The goal is to
find 'k' representative objects which minimize the sum of the dissimilarities of
the observations to their closest representative object.
The distance metric to be used for calculating
dissimilarities between observations are "euclidean" and "manhattan". Euclidean
distances are root sum-of-squares of differences, and manhattan distances are
the sum of absolute differences.
3.2 K-means method chooses a predefines
number of cluster centers to minimize the within-class sum of squares from the
centers. It uses Euclidean distance. When finished, all cluster centers
are at the mean of their Voronoi sets (the set of data points which are nearest
to the cluster centre). The algorithm of Hartigan and Wong (1979) is used. It is
most appropriate for suitably scaled continuous variables.
The start points can be chosen with hierarchical clustering, which use
"Euclidean" distance and "Average" linkage methods. Or it can be selected
randomly during computation.
3.3. Output:
- For PAM, a plot showing average silhouette width
over different cluster numbers, aiding in find the optimal cluster numbers.
- For each tried number of
partition, draw a bivariate cluster plot visualizing a partition (clustering)
of the data. All observation are represented by points in the plot, using
principal components or multidimensional scaling. Around each cluster an
ellipse is drawn.
- The expression line graph and heatmap for each
of the sub-clusters, for user-specified number of clusters.
- The probe sets in each subcluster, which can
be saved as data sets for refined analysis or comparison.
[Back
to Top]
4. Self-Organizing Maps (SOM)
4.1. Description
It is proposed by Kohonen (1995), and used in microarray data analysis by
Tamayo (1999). The implementation used by BarleyBase is GeneSOM, the R
packge by Jun Yan <jyan@stat.uiowa.edu>.
Default settings are used, except for the x-dimension and y-dimension, which can
be set by user.
Kohonen, Hynninen, Kangas, and Laaksonen (1995), SOM-PAK, the Self-Organizing
Map Program Package (version 3.1). http://www.cis.hut.fi/research/papers/som_tr96.ps.Z
Principal components analysis (PCA) is often used as a data reduction
technique. It finds a new coordinate system for multivariate data such that the
first coordinate, the linear combination of the columns of data matrix,
has maximal variance, the second coordinate has maximal variance subject to
being orthogonal to the first, etc. A singular value decomposition (SVD) is
carried out.
The results are shown as 2-D or 3-D scatter plots, where the first 2 or 3
principal components are used as the axes. The 3-D plots are
shown from all 6 different sides.
It is one of the Multidimensional Scaling (MDS) methods. It finds a new,
reduced-dimensionality, coordinate system for multivariate data such that the an
error criterion between distances in the given space, and distances in the
result space, is minimized.
It is provided to help users to get an idea on if there exists clear cluster
structures with the data, and how many clusters likely.
The analysis is run 30 times, the 3 best results
are show as 3-D scatter plots, each shown the plots from all 6 different sides. Unfortunetaly, this analysis is very slow.