Shared Flashcard Set

Details

Test 2
NA
77
Business
Graduate
03/21/2014

Additional Business Flashcards

 


 

Cards

Term
What is the most general of the automatic clustering techniques
Definition
K-means clustering
Term
Cluster detection is a ___ data mining technique? Why?
Definition
undirected
It finds patterns in the data without a target variable.
Term
define a similarity cluster
Definition
breaking down clusters based on their similarity.
I.E. Credit card customers that: maintain high balance
Use the card often
Use once in a while for a large purchase
Term
what is the cluster centroid
Definition
the average position of cluster memebers in each dimension
Term
___ is the most commonly used clustering algorithm
Definition
K-means.
Term
what does the "K" in K means clustering refer to?
Definition
the algorithm looks for a fixed number of clusters. K is specified by the user.
Term
the best assignment of cluster centers could be defined as:
Definition
one that minimizes the sum of the distance from every data point to its nearest cluster center
Term
k means uses an algorithm that alternates between two steps, ___ and ___
Definition
assignment and update
Term
in K-means clustering, the cluster centers can be used to define a ___: describe this.
Definition
voronoi diagram: a diagram whose lines mark the points that are equidistant from the two nearest seeds.
Term
how does clustering ID outliers?
Definition
A record that is beyond a threshold distance from the cluster.
Term
what are the two factors used when interpreting clusters?
Definition
1. what do cluster members have in common?
2. what distinguishes each cluster from the others?
Term
What are the cluster characteristics?
Definition
Diameter: maximum distance between two records in one cluster variance: sum of the squared distance from the centroid of cluster members silhouette: measure of cluster dispersion or goodness.
Term
What can you do with a silhouette score?
Definition
chose an appropriate value of K compare clusters produced by different random seeds remove strong clusters for further analysis
Term
When clustering, you should impose a ___. Why
Definition
maximum cluster diameter. Otherwise the cluster keeps growing.
Term
what are the related variations of k means
Definition
k medians
k medoids
k modes
Term
what does k median clustering do?
Definition
looks for the set of centroids that minimizes the sum of distances from cluster members to cluster centroids.

(it tightens the grouping)
Term
k medians is less sensitive to ___
Definition
outliers
Term
what does k medoids clustering do?
Definition
as it does assignment and update, the next center is the best representative for each cluster.
Term
define scaling and weighting
Definition
scaling: adjusts the values of variables to take into account that different variables are measured in different units/ranges

weighting: is encoding the information that one variable is more or less important than the others
Term
scaling variables is calculated by ___
Definition
z score
Term
what are the 3 types of clustering and their description?
Definition
K-means: Start with fixed number of clusters, and make clusters based on a criterion.

divisive: start with one cluster and keep breaking till some stopping rule is triggerd

hierarchical: starts with every record in its own cluster
Term
what implements hierarchical clustering?
Definition
wards method
Term
how does hierarchical clustering work?
Definition
takes every record in it's own cluster and gradually merge, forming larger groupings. Continues until all records are in one cluster
Term
what is principal component analysis used for?
Definition
finding an optimal way of combining variables into a small number of subsets.
Term
principle components are sensitive to units, therefore you should ___
Definition
standardize the inputs
Term
What are the strengths of K-Nearest Neighbors (KNN)?
Definition
Simple and effective

don't make underlying distributional assumptions

fast training phase
Term
what are the weaknesses of K-nearest neighbors(KNN)?
Definition
doesn't produce a model, so no insights in relationships

slow classification phase

memory intensive

nominal and missing data need additional processing
Term
Unlike K-means clustering, with K-Nearest Neighbors there is a ___
Definition
target variable
Term

[image]

Fill in the chart for each predicted/actual variable and per arrow.

Definition
[image]
Term
What are the kNN classification sytaxes?
Definition
train: a data frame containing numeric training data

test: data frame containing numeric test date

class: factor vector with the class for each row in the training data

k: an integer indicating the number of nearest neighbors
Term
in an artificial neural network, network topology describes:
Definition
the number of layers

number of nodes in each layer

if info is allowed to travel backward
Term
in an artificial neural network, training algorithm specifies:
Definition
how connection weights are set in order to inhibit or excite neurons in proportion to the input signal
Term
What type of data mining can artificial neural networks be used for?
Definition
classification

numeric prediction

unsupervised pattern recognition
Term
what activation functions are often used in artificial neural networks?
Definition
sigmoid activation function (output ranges from 0-1)

Radial basis function

linear activation function (results in a network similar to linear regression model)
Term
what are some strengths of artificial neural networks?
Definition
adapted to classification or numeric prediction problems

among most accurate modeling approaches

makes few assumptions about data's underlying relationships
Term
what are some weaknesses of artificial neural networks
Definition
computationally intensive and slow to train

easy to over- and underfit training data

difficult or impossible to interpret
Term
what is the goal of a support vector machine?
Definition
to create a flat boundary called a hyperplane, which leads to partitions of data on either side
Term
what is the maximum margin hyperplane?
Definition
a hyperplane that creates the greatest separation between two classes

support vectors are points from each class that are closes to maximum vector hyperplane (must have 1, but can have more support vectors)
Term
What is a key feature of support vector machines
Definition
the support vectors provide a very compact way to store a classification model
Term
with support vector machines with non-linear kernels, what must you do with the data?
Definition
standardize variables

convert nominal variables to dummies and ordinal to scale.
Term
what are some commonly used kernesl?
Definition
linear polynomial sigmoid gaussian RBF kernel
Term
What is market basket analysis. Give an example of when it could be used
Definition
set of association rules that specify patterns of relationships among items in transactional data.

How many times peanut butter and jelly were purchased at the same time as bread
Term
how does market basket analysis form a set? What does the set mean?
Definition
with brackets {peanut butter, jelly}

it means that the item set appears in the data with some regularity
Term
what are association rules in market basket analysis used for?
Definition
They are used for unsupervised knowledge discovery in large databases, NOT for prediction.
Term
what are the strengths of market basket analysis?
Definition
good for large amounts of transactional data

results in rules that are easy to understand

good for discovering unexpected knowledge in databases
Term
what are the weaknesses of market basket analysis?
Definition
not good with small datasets

takes effort to separate the insight from the common sense

easy to draw spurious conclusions from random patterns
Term
what is the end result of the Apriori algorithm?
Definition
It reduces the association rule search space. This means all subsets of a frequent itemset must also be frequent
Term
how do you calculate support of an item in market basket analysis?
Definition
Support = count(x/n)
x = number of transactions the itemset appears in
n = total transactions
Term
in market basket analysis, what is the definition of confidence
Definition
it is a measurement of the predictive power or accuracy.
Term
for Market basket analysis, describe the following Arules:

Inspect()
itemFrequency()
itemFrequencyPlot()
Image()
Definition
Inspect()= looks at the contents of the sparce matrix

itemFrequency()lets you see the proportion of transactions that contain an item

itemFrequencyPlot()= allows you to produce a bar chart depicting the proportion of transactions with a certain item (shows support)

Image()= helps with identification of potential data issues
Term
with market basket analysis, what are the issues with low and high confidence?
Definition
low = leads to many unreliable rules

high = leads to obvious or inevitable rules (smoke detector purchased with batteries)
Term
in market basket analysis, what is lift?
Definition
a measure of how much more likely one item is to be purchased relative to its typical purchase rate, given that you know another item has been purchased.
Term
in market basket analysis, what does a lift greater than 1 mean?
Definition
it suggests that items are found together more often than chance occurrence (it isn't by chance they are bought together)
Term
Define the following
True positive (TP)
True Negative (TN)
False positive (FP)
false negative (FN)
Definition
TP = correctly classified as the class of interest

TN = correctly classified as not the class of interest

FP = incorrectly classified as the class of interest

FN = incorrectly classified as not the class of interest
Term
sensitivity is known as ___
Definition
actual negative power of a test
Term
specificity is known as ___
Definition
true negative rate
Term
in evaluating a model, what is a Type 1 error
Definition
Is when you predicted/expected a positive but is an actual negative. Known as crying wolf.
Term
in evaluating a model, what is a Type 2 error
Definition
is when you predicted/expected a negative, but is positive. AKA, missed detection
Term
What is a kappa statistic.
Describe the agreement rates
Definition
it adjusts accuracy by accounting for the possibility of a correct prediction by chance alone.

Max value = 1
.8-1.0 = very good agreement
.6-.8 = good agreement
.4-.6 = moderate agreement
.2-.4 = fair agreement
less than .2 = poor agreement
Term
what is an F score?
Definition
a measure of model performance that combines precision and recall into a single number
Term
what is the goal of text analysis
Definition
Term
with text analytics, what is a tag.
Definition
it refers to information associated with a text document. It is info ABOUT the document, but not part of document itself.(metadata)
Term
During text analysis, how does the bag-of-words approach function?
Definition
It examines each word individually and without context (not tied into another word, or part of a sentence)
Term
what are the commonly used ways of text analysis.
Definition
bag of words

natural language processing
Term
What is the end product of text analytics?
Definition
word cloud
Term
What is sentiment analysis?
Definition
it is text analytics with a purpose.

use of text measures to learn about the past and make predictions about the future.

opinion mining
Term
What are some design text measures (that work)?
Definition
list based

item-weighted

models for text classification

training and test regiment in evaluation
Term
with sentiment analysis, what are the two lists that are computed?
Definition
Positive: the % of words in the review that match up with the positive word list

Negative: the % of words in the review that match up with the negative word list
Term
In sentiment analysis, there are six measures and modeling techniques. What are they?
Definition
Simple difference: difference scores (positive minus negative scores)

regression difference: use linear regression to determine weights to use for positive and negative scores into predictor ratings.

word/item analysis: use original 50 words and training data to ID positive/negative leaning words. Then +1/-1 accordingly.

Logistic regression: stepwise logistic regression to select useful predictors from the set of 50 sentiment words.

Support vector machines: effective technique in text classification problems with large numbers of explanatory variables

Random forests: ensemble method that uses thousands of tree structured classifiers to arrive at a single prediction
Term
How do you calculate support when targeting a single item during transactions?
Definition
Support(x)= count(x)/N
X = targeted Item that appeared in purchases
N = total transactions
Term
How do you calculate confidence when targeting items purchased together during transactions?
Definition
Confidence(x,y)= support(X,Y)/support(x)
X and Y together = targeted Items
Support x = Individual item (x)
Term
How do you calculate lift for market basket analysis?
Definition
Lift(x,y) = Confidence(x,y)/support(y)
Term
how do you calculate specificity?
Definition
(TN)/(TN+FP)
Term
how do you calculate sensitivity?
Definition
(TP)/(TP+FN)
Term
How do you calculate precision?
Definition
(TP)/(TP+FP)
Term
How do you calculate accuracy?
Definition
(TP+TN)/(TP+TN+FP+FN)
Term
How do you calculate error rate?
Definition
1-Accuracy
Accuracy = (TP+TN)/(TP+TN+FP+FN)
Supporting users have an ad free experience!