Shared Flashcard Set

Details

Lecture eleven
Biocore: Clustering
24
Biology
Graduate
12/13/2009

Additional Biology Flashcards

 


 

Cards

Term

 

 

 

clustering provides means of what

Definition

 

 

 

finding structure in collection of unlabeled data

Term

 

 

 

 

reasons for clustering

Definition

determine intrinsic groupings

 

 

1. classification

2. simplification

3. to create populations of types for downstream analysis

Term

 

 

 

two kinds of metrics to calculate similarity

give example of each

Definition

 

 

statistical

(Pearson correlation)

 

geometric

(Euclidean distance)

Term

 

 

 

Pearson Correlation

tells you what two things about correlation?

Definition

tells you degree and direction of correlation

 

measure of the correlation (linear dependence) between two variables X and Y, giving a value between +1 and −1 inclusive.

 

used as a measure of the strength of linear dependence between two variables.

Term

 

 

 

distance metrics

if paying attention to most deviant conditions, use what method to measure distance

Definition

 

 

 

Chebyche

Term

 

 

 

k-means clustering

Definition

 

 

 

method of cluster analysis in which n objects are grouped into k clusters in which each component belongs to the cluster with the nearest mean

Term

 

 

 

k-means clustering

advantages and disadvantages

Definition

 

advantages:

simplicity

speed

can use on large datasets

 

disadvantages:

gives you different results with each run

Term

 

 

 

in k-means clustering, how is each cluster primarily defined?

Definition

 

 

via its centroid

Term

 

 

 

 

centroid

Definition

 

 

 

mean over all cluster members for each dimension

Term

 

 

 

steps to k-means clustering

Definition

1. pick K random points and define as cluster centroids

2. add points to cluter with cloest centroid (assoc every oberservation with the nearest mean)

3. recalculate the means

4. repeat until you have reached max iteration OR until centroids don't move anymore when you recalculate them

Term

 

 

 

why are outliers a problem in k-means clustering?

Definition

 

 

 

Every case is forced to join a cluster no matter how atypical or remote it might be, with the result that the classification can be substantially distorted.

Term

 

 

 

 

hierarchical clustering

Definition

 

 

 

find successive clusters using previously established clusters

Term

 

 

 

two main classes of clustering:

Definition

 

 

 

agglomerative

 

divisive

Term

 

 

 

 

greedy algorithm

example

Definition

k-means clustering

 

perform a single procedure over and over until it can't be done any more and see what kind of results it will produce.

 

the procedure tries to maximize the return based on examining local conditions, with the hope that the outcome will lead to a desired outcome for the global problem.

 

In some cases such a strategy is guaranteed to offer optimal solutions, and in some other cases it may provide a compromise that produces acceptable approximations.

Term

 

how to measure distance between two groups in agglomerative hierarchical clustering

3 ways

Definition

 

average linkage

 

single linkage

 

complete linkage

Term

 

 

 

single linkage

Definition

 

 

 

way to calculate distance btw two groups in agglomerative hierarchical clustering

 

minimum distance btw any two points in each cluster

Term

 

 

 

 

complete linkage

Definition

 

 

 

 

way to calculate distance btw two groups in agglomerative hierarchical clustering

 

maximum distance btw any two points in each cluster

Term

 

 

 

 

how do you denote clusters in a dendrogram?

Definition

 

 

 

horizontal cuts

Term

 

 

 

 

Silhouette

what s(i) values of 1,-1, & zero tells you

Definition

 

method of interpretation and validation of clusters of data

 

provides a succinct graphical representation of how well each object lies within its cluster.

 

value of 1means datum in approp cluster

value of 0 means btw clusters

-1 means datum should be in neighboring cluster

Term

 

what can computing the average silhouette distance of a cluster and of the entire dataset tell you?

Definition

 

average s(i) of a cluster: how tightly grouped all the data in the cluster are

 

average s(i) of entire data set: measure of how appropriately the data has been clustered

Term

 

 

 

silhouette plots and averages are a powerful tool for determining what?

Definition

 

 

 

the natural number of clusters within a dataset.

Term

 

 

 

PAM

partitioning around mediods

Definition

 

more robust k-means

 

computes medoids instead of centroids as cluster centers

 

unlike k-means, it uses data to define cluster # (often silhouette)

 

 

Term

 

 

 

model based clustering

two characteristics

Definition

 

 

 

strategy for determination of # of clusters and cluster membership

 

two characteristics:

 

1.fits gaussian distributions to data

2. uses BIC to pick # of distributions

Term

 

 

 

 

Bayesian information criterion

BIC

Definition

 

 

helps determine the best clustering method along with the # of clusters

Supporting users have an ad free experience!