Term
3 factors behind the sudden popularity in data mining 

Definition
a. It is now cheaper to store and process data, and increased hardware provide the ability to collect and accumulate more data
b. Increased database capacities and availability of data analysis tools made companies realize they have untapped data and the tools to analyze it
c. Consolidation in a data warehouse, data at the customer level and from various sources gives the ability to analyze from a more complete view 


Term
6 examples of applications of data mining 

Definition
a. Identify successful therapies for illnesses & to discover new drugs
b. Reduce fraudulent behavior (Insurance Claims, Credit Card Usage)
c. Identify customer buying patterns
d. Reclaim profitable customers
e. Aid in marketbasket analysis
f. Better target customers/clients 


Term
_____ is used to describe knowledge discovery in databases. 

Definition


Term
Data Mining uses ___, ___, and other techniques to extract and identify useful information and subsequent knowledge from large databases. 

Definition


Term
Data mining is also referred to as:
1. _____
2. _____
3. _____
4. _____
5. _____ 

Definition
1. Knowledge extraction
2. Data archaeology
3. Data exploration
4. Data dredging
5. Information harvesting 


Term
Data mining finds ___ and defines them in terms of mathematical rules. Those rules can then be used for prediction or association in an attempt to aid in decision making. 

Definition


Term
Data mining algorithms fall into four broad categories:
1. ____  find the commonly cooccurring groupings of things
2. ____  tell the nature of future occurences of certain events based on what has happened in the past
3. ____  Identify natural groupings of things based on their known characteristics
4. ____  discover timeordered events 

Definition
1. Associations
2. Predictions
3. Clusters
4. Sequential Relationships 


Term
Two other data mining procedures
are __ __
and __ __ __ 

Definition
Data Visualization
Time Series Forecasting 


Term
__ __ are the most common of all data mining approaches 

Definition
Classification Procedures 


Term
Classification involves identifying patterns of data as belonging to a certain ____. Examples:
a. Credit Approval
b. Store Location
c. Target Marketing
d. Fraud Detection
e. Telecommunications
f. Route or Segmentation Decisions 

Definition


Term
The Basic Idea:
1. Define the ___
2. Use the data to develop a __ model
3. Use that model to predict unknown outcomes for __ __ 

Definition
Data
Mathematical
Future Observations 


Term
If the outcome (Y) is categorical, and the predictors (Xs) are either categorical or numeric, you would use a _____ 

Definition


Term
If the outcome (Y) is categorical, and the predictors (Xs) are all numeric and have normal distributions and equal variances, then you would use ___ ___ ___ 

Definition
linear discriminant analysis 


Term
If the outcome (Y) is continuous numeric, and the predictors (Xs) are numeric with normal distributions and equal variances, then you would use
____ ____ ____ 

Definition


Term
Organizations must use a standardized approach for condicting a ____ project. 

Definition


Term
Some proposed industrystandard models for data mining are:
1. _____  one of the most popular nonproprietary standard methodologies for data mining
2. ____  Ordinarily used in manufacturing, service delivery, management, and other business activities that rely on eliminating defects, waste, & quality control.
3. ___  developed by the SAS institute. 

Definition


Term
CRISPDM stands for: ___
There are 6 steps of the CRISPDM Model:
1. ____
2. ____
3. ____
4. ____
5. ____
6. ____


Definition
CrossIndustry Standard Process for Data Mining
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation
6. Deployment 


Term

Definition
Define, Measure, Analyze, Improve, Control 


Term

Definition
Sample, Explore, Modify, Model, Assess 


Term
____ places observations (rows, customers, students, etc.) into groups such that the members share similar characteristics but the groups themselves are highly different 

Definition


Term
Cluster Analysis is different from ___ analysis in that the groups are unknown and created in cluster analysis, where as the groups are distinct and known when conducting a __ analysis. 

Definition
Classification
Classification 


Term
Market Segmentation is a common application of __ __ 

Definition


Term
Market Segmentation is used to understand the ________ 

Definition
buyer behavior of customers 


Term
Market Segmentation is used to help retailers in targeting similar groups of customers for defining the _______ 

Definition
approprite advertising campaign 


Term
Association Analysis is aimed at establishing relationships between ___ 

Definition
items (variables, columns) 


Term
The goal of ___ is to group variables that are similar. 

Definition


Term
A common application of __ analysis is Market Basket Analysis 

Definition


Term
____  the semiautomatic process of extracting patterns from large amount of unstructured data sources 

Definition


Term
Some of the most popular text mining analyses discussed in class are:
a. ____
b. ____
c. ____
d. ____ 

Definition
Summarization
Categorization/Classification
Clustering
Concept Linking (Association) 


Term
The most basic form of text mining used for summarization is ____ 

Definition


Term
The ______ matrix is used for Categorization/Classification, Clustering, and Concept Linking. 

Definition


Term
_____ maps unstructured information (in the form of a document of words) into a structured format (in the form of a feature/term vector) or a concept. 

Definition


Term
A __ vector, or __ vector, is a weighted list of words which defines a concept that describes unstructured information (document of words) 

Definition


Term
Steps to creating a feature vector:
1. Eliminate ___
2. Replace words with their _ or _
3. Consider __ and __
4. Calculate the __ of the remaining terms 

Definition
articles (the, and, other, etc)
stems/roots
Synonyms and Phrases
Weights 


Term
To get the "TF" factor (term frequency), divide ___ by ___ 

Definition
Frequency
Total words left over 


Term
A ______ matrix is created where the ROWS represent the documents and the COLUMNS represent the terms (excluding top terms), and the frequencies represent the number of times a term appears in a particular document 

Definition


Term
The text mining process can be defined in _ consecutive tasks.
1. Establish the ___
2. Create the _____
3. Extract the _____ 

Definition
1. Establish the corpus
2. Create the termdocument matrix
3. Extract the knowledge 


Term
The largest data/text repository is ___ 

Definition


Term
Examples of information found on the web:
a. Whose __ __ is linked to which other pages
b. How many people have on their own website ___ to other websites
c. How a particular site is ___
d. Tracking __ to a site, __ on a search engine, __ on ecommerce sites 

Definition
home page
hyperlinks
organized
visitors, searches, transactions 


Term
___  the discovering of relationships from web data 

Definition


Term
The 3 areas of web mining:
1. Web __ mining
2. Web __ mining
3. Web __ mining 

Definition


Term
Web __ mining extracts and uses content found within web pages. 

Definition


Term
Web __ mining extracts useful information from the analysis of links found in web documents 

Definition


Term
Web __ mining extracts and uses information that is generated through web page visits, traffic, transactions, etc. 

Definition


Term
Web content mining is similar to ___ mining 

Definition


Term
Web usage mining uses ____ data, which provides a trail of the user's activity and shows the user's browsing patterns: which sites are visited, pages accessed, time spent per page/site, etc. 

Definition


Term
Formulas:
1. Which predictor is best (given alpha)?
Compare PValue of the type (radio/newspaper) to the given alpha. pvalue < alpha = good predictor pvalue > alpha = bad predictor.
2. Predict weekly sales (make sure Adj Rsq is between 0 and 1) Y = intercept coeff + (Coeff * X1) + (Coeff * X2) where Y = Incercept Coeff X = money spent per week on advertising
3. Calculate LCF (linear classification function) LCF1/LCF0 = constant + (coeff * X1) + (coeff * X2) + (coeff * X3) WHERE Coeff is under classification analysis. Column to use depends on if LCF is 0 or 1. Use column 1 for 1, 0 for 0. Xn = given


Definition

