Shared Flashcard Set

Details

ML Final
MSAN ML cards
49
Other
Graduate
12/13/2015

Additional Other Flashcards

 


 

Cards

Term
What can you say about the relationship between Bias and Variance
Definition
They are often inversely related
Term
What happens (generally) to Bias and Variance as the complexity of a model increases?
Definition

Bias decreases

Variance increases

(Note: error terms not affected -> constant)

Term
In what situation should we be worried about the variance of the Beta coefficients in CV
Definition
the data is not sufficiently large
Term
NON-Parametric model
Definition

-Not restricted to assumptions on f

-f can take many shapes/forms

-works well for large n

-can quickly overfit data


e.g. polynomial/smoothing splines

Term
Parametric Model
Definition

-makes assumptions about function form of f

-reduces to estimation of a set of parameters

-can produce overly simplistic model

 

e.g. linear/logistic regression

Term
Explain what 'leave-one-out' CV is doing
Definition
We can think of this as n-fold CV. Each iteration, the test set has 1 obs. and the train set has n-1 obs.
Term
Explain 'Best Subsets' model selection in a regression context and discuss its limitations
Definition

In Best Subsets, we fit every model with k predictors and choose the 'best' among them. This always finds the 'best' model per some criteria because it is exhaustive.

 

The fact that is it exhaustive means we must fit 2^p different models, which can become computationally infeasible for large p. Ways around this are approximate algorithms such as forward/backward stepwise regresson

Term
Tikhonov. wat?
Definition
= Ridge Regression
Term
In simple terms, what does it mean to say that a model has a high variance?
Definition
predictions in Y dramatically change with small changes in X
Term
In Ridge Regression what happens if lambda = 0? lambda = infinity?
Definition

if lambda = 0, we are using OLS

if lambda = infinity beta_ridge = 0

Term
What key advantage does Ridge regression have over best subsets selection?
Definition
we can quickly determine lambda--computationally feasible for data with large p. (best subsets requires 2^p fitted models)
Term
What is a key distinction between Ridge and Lasso regression in terms of the beta coefficients?
Definition
Ridge shrinks the betas, although not exactly to zero. Lasso sends coefficients to zero.
Term
What is the difference in the way Ridge and Lasso penalize coefficients?
Definition

Ridge: Beta^2

Lasso: abs(Beta)

Term
Explain the curse of dimensionality
Definition
including more predictors in a model does actually improve predictions despite obtaining a lower MSE with more predictors.
Term
What are the nice results of PCA?
Definition

-we reduce our dimensions from p to M

-with M<<p, we can significantly reduce the variance of coefficients

-PCs are uncorrelated linear combos of the p original variables, getting rid of multi-collineariy

Term
What are some disadvantages of PCA?
Definition

-We introduce bias to parameter estimates

-hard to interpret

-Components are not ordered as per their significance to response, but rather based on their usefulness on predictor matrix (X)

Term
What must we do to the data before performing PCA and why is this important?
Definition
scale/center the data. In PCA we choose PCs in the order of the amount of variability they explain in X. If we have unscaled data, variables with high variance relative to others will dominate the initial PCs.
Term
explain the significance of sigma: the covariance of matrix of an n x p data matrix X, in PCA.
Definition

We use the eigenvectors of sigma to calculate the Principle components. The first PC, for example, is generated by multiplying the data matrix X by the first eigenvector of sigma. 

 

-tr(sigma) = total variation in X

Term
what must be true about the k PCs generated in PCA
Definition

The K PC's need to be explaining a significant portion of the Total variation. 

They being PC components are De-facto Orthonormal to each other

Term
When do we want to use classification methods over linear regression and why?
Definition
When Y is discrete. Regression assumes theres meaning behind the ordering of Y, when categorical variables have no natural order.
Term
What two things must we know to build a Bayes classifier and why can this be difficult?
Definition

-class conditional probabilities

-prior probabilities

 

-priors are easy to find empirically, class conditional probabilities are difficult since we cannot calculate the joint density.

Term
what happens when k gets large in KNN? k=1?
Definition

we approach always predicting the majority class (Smoothing). 

 

k=1 can result in overfitting, since we assign predictions based only on one observation. 

Term
strengths of KNN
Definition

-good for multinomial classes

-successful with irregular decision boundary

-easy/fast to train

- No Assumptions

Term
weaknesses of KNN
Definition

-have to choose k

-models are retrained for each new test obs.

-accuracy is sensitive to k

 

For a Binary classification problem (We need to have a k greater than 2, to avoid coin tossing for many of the cases)

Term
key assumption for LDA/QDA
Definition
class conditional densities are Gaussian
Term
overall aim for LDA/QDA
Definition
identify Bayes classifier.
Term
QDA vs. LDA?
Definition

QDA more flexible - high variance.

LDA will have higher Bias if the assumption that each X | Y=j have the same covariance structure isn't met. 

Term
what does logistic regression do that linear regression fails to do in the case of a binary outcome variable?
Definition
keeps probabilities between 0 and 1
Term
comment on the MLE estimate of Bj in logistic regression
Definition
it has an approximate gaussian distribution with mean Bj. Statistical inference therefore is the same as OLS
Term
what is a key difference about the calculation of B in logistic regression versus linear regression
Definition
B_logistic has no analytical form that maximised log-likelihood. Approximated using methods such as gradient descent.
Term
what is the high level outcome of logistic regression
Definition
a linear discriminant
Term
for a given test observation, what is the output from a logistic regression, and what extra step must be taken in the context of classification
Definition
a probability. to classify we must employ thresholding.
Term
what are the x and y axes on an ROC curve
Definition

y=true pos rate

x=false pos rate

Term
when does Kmeans perform well
Definition

clusters are: 

 

-spherical/elliptical

-similar in variance/spread

-similar in size

And 

- We have a good estimate for "k"

- We want results in lesser execution time

Term
binary splitting is seen as a "greedy" approach. What does this mean.
Definition
in this case, once a split has been made, we never revisit or tune the split. decisions are final.
Term
what is a weakness of binary splitting and how do we handle it?
Definition
prone to overfitting. fix with weakest link pruning
Term
when choosing regions for a classification tree, name 3 methods that access the mixture of a region, and what do we do with them?
Definition

-classification error rate

-Gini Index

-cross-entropy

 

in each case, want to minimize 

Term
con of decision trees...whats the fix?
Definition
high variance. handled by averaging trees with bagging or random forests
Term
con of bagging. whats the fix?
Definition
We need De-correlated trees, unfortunately Bagging could lead to trees with high correlation. Random Forest as it employs random Sub-sampling can help minimize high correlation within trees
Term
In Random Forest
Definition

In Random Forest, We sample the number of variables (For splitting) as well as number of data points.

 

We also try minimizing Out of Box Error R

Term
Unbalanced data
Definition
Unbalanced data is the case, when one of the classes has a proportion of 99% or more in the Response.
Term
K means  & Hierarchial clustering
Definition
K means gives a better execution time and works well for Spherical data, where as Hierarchial clustering takes more time but has more accurate results. Also Hierarchial is not dependent on choice of "K"
Term
When Do Decision Trees do very bad
Definition
When the data doesn't fall into rectangular Sub-space. In this case we would get lots of Errors with our prediction
Term
Pruning
Definition
Grow the biggest possible Tree and they shrink it to an Optimal size based on conditions of Purity.
Term
What is Tuning
Definition
Optimizing the hyperparameters of a model to get the best possible performance given the predictors.
Term
2 types of Hierarchial Clustering
Definition

1) Divisive (Go Top down): Not discussed in Class

2) Anglomerative (Go Bottom up): Discussed

Term
Classification Tree
Definition

When Building a Tree, consider: 

1) Gini Index 2) Cross Entropy

 

When Pruning a Tree, consider:

Minimizing Classification Error

Term
What are components that make up the graph laplacian? explain them
Definition

L = D - S

D=diagonal degree matrix. Each entry along the diagonal tells us how many components are connected to xi,...,xn.

S=adjecency matrix. 1s and 0s indicating whether xi is connected to xj.  

Term
Explain what the normalized graph laplacian is and how we use it in spectral clustering.
Definition
Lnorm= (D^-1)L  ...  L being the difference between the degree matrix D and adjacency matrix S. Once we have Lnorm, we take the (pre-specified) k smallest eigen vectors and to generate a n x k matrix X. We then use K-means to cluster the rows of X.
Supporting users have an ad free experience!