Shared Flashcard Set

Details

ISDS 2001 Test 4
ISDS 2001 Catanzaro Test 4 Ch 4-5 Data Mining
48
Computer Science
Undergraduate 2
04/03/2012

Additional Computer Science Flashcards

 


 

Cards

Term

3 factors behind the sudden popularity in data mining

Definition

a.       It is now cheaper to store and process data, and increased hardware provide the ability to collect and accumulate more data

b.      Increased database capacities and availability of data analysis tools made companies realize they have untapped data and the tools to analyze it

c.       Consolidation in a data warehouse, data at the customer level and from various sources gives the ability to analyze from a more complete view           

Term

6 examples of applications of data mining

Definition

a.       Identify successful therapies for illnesses & to discover new drugs

b.      Reduce fraudulent behavior (Insurance Claims, Credit Card Usage)

c.       Identify customer buying patterns

d.      Reclaim profitable customers

e.       Aid in market-basket analysis

f.       Better target customers/clients

Term
_____ is used to describe knowledge discovery in databases.
Definition
Data mining
Term
Data Mining uses ___, ___, and other techniques to extract and identify useful information and subsequent knowledge from large databases.
Definition

Statistical

Mathematical

Term

Data mining is also referred to as:

1. _____

2. _____

3. _____

4. _____

5. _____

Definition

1. Knowledge extraction

2. Data archaeology

3. Data exploration

4. Data dredging

5. Information harvesting

Term
Data mining finds ___ and defines them in terms of mathematical rules. Those rules can then be used for prediction or association in an attempt to aid in decision making.
Definition

patterns

 

 

Term

Data mining algorithms fall into four broad categories:

 

1. ____ - find the commonly co-occurring groupings of things

2. ____ - tell the nature of future occurences of certain events based on what has happened in the past

3. ____ - Identify natural groupings of things based on their known characteristics

4. ____ - discover time-ordered events

Definition

1. Associations

2. Predictions

3. Clusters

4. Sequential Relationships

Term

Two other data mining procedures

are __ __

 and __ __ __

Definition

Data Visualization

Time Series Forecasting

Term
__ __ are the most common of all data mining approaches
Definition
Classification Procedures
Term

Classification involves identifying patterns of data as belonging to a certain ____. Examples:

a. Credit Approval

b. Store Location

c. Target Marketing

d. Fraud Detection

e. Telecommunications

f. Route or Segmentation Decisions

Definition
Category
Term

The Basic Idea:

1. Define the ___

2. Use the data to develop a __ model

3. Use that model to predict unknown outcomes for __ __

Definition

Data

Mathematical

Future Observations

Term
If the outcome (Y) is categorical, and the predictors (Xs) are either categorical or numeric, you would use a _____
Definition
decision tree
Term
If the outcome (Y) is categorical, and the predictors (Xs)  are all numeric and have normal distributions and equal variances, then you would use ___ ___ ___
Definition
linear discriminant analysis
Term

If the outcome (Y) is continuous numeric, and the predictors (Xs) are numeric with normal distributions and equal variances, then you would use

____ ____ ____

Definition
linear regression
Term
Organizations must use a standardized approach for condicting a ____ project.
Definition
data mining
Term

Some proposed industry-standard models for data mining are:

1. _____ - one of the most popular non-proprietary standard methodologies for data mining

2. ____ - Ordinarily used in manufacturing, service delivery, management, and other business activities that rely on eliminating defects, waste, & quality control.

3. ___ - developed by the SAS institute.

Definition

CRISP-DM

 

DMAIC

 

SEMMA

Term

CRISP-DM stands for: ___

There are 6 steps of the CRISP-DM Model:

1. ____

2. ____

3. ____

4. ____

5. ____

6. ____

 

Definition

Cross-Industry Standard Process for Data Mining

 

1. Business Understanding

2. Data Understanding

3. Data Preparation

4. Modeling

5. Evaluation

6. Deployment

Term
DMAIC stands for: ___
Definition
Define, Measure, Analyze, Improve, Control
Term
SEMMA stands for: ____
Definition
Sample, Explore, Modify, Model, Assess
Term
____ places observations (rows, customers, students, etc.) into groups such that the members share similar characteristics but the groups themselves are highly different
Definition
Clustering Analysis
Term
Cluster Analysis is different from ___ analysis in that the groups are unknown and created in cluster analysis, where as the groups are distinct and known when conducting a __ analysis.
Definition

Classification

 

Classification

Term
Market Segmentation is a common application of __ __
Definition
Cluster Analysis
Term
Market Segmentation is used to understand the ________
Definition
buyer behavior of customers
Term
Market Segmentation is used to help retailers in targeting similar groups of customers for defining the _______
Definition
approprite advertising campaign
Term
Association Analysis is aimed at establishing relationships between ___
Definition
items (variables, columns)
Term
The goal of ___ is to group variables that are similar.
Definition
Association
Term
A common application of __ analysis is Market Basket Analysis
Definition
Association
Term
____ - the semiautomatic process of extracting patterns from large amount of unstructured data sources
Definition
Text Mining
Term

Some of the most popular text mining analyses discussed in class are:

a. ____

b. ____

c. ____

d. ____

Definition

Summarization

Categorization/Classification

Clustering

Concept Linking (Association)

Term
The most basic form of text mining used for summarization is ____
Definition
Term Extraction
Term
The ___-___ matrix is used for Categorization/Classification, Clustering, and Concept Linking.
Definition
Term-Document
Term
_____ maps unstructured information (in the form of a document of words) into a structured format (in the form of a feature/term vector) or a concept.
Definition
Text Mining
Term
A __ vector, or __ vector, is a weighted list of words which defines a concept that describes unstructured information (document of words)
Definition
Feature (term) vector
Term

Steps to creating a feature vector:

1. Eliminate ___

2. Replace words with their _ or _

3. Consider __ and __

4. Calculate the __ of the remaining terms

Definition

articles (the, and, other, etc)

 

stems/roots

 

Synonyms and Phrases

 

Weights

Term
To get the "TF" factor (term frequency), divide ___ by ___ 
Definition

Frequency

 

Total words left over

Term
A ___-___ matrix is created where the ROWS represent the documents and the COLUMNS represent the terms (excluding top terms), and the frequencies represent the number of times a term appears in a particular document
Definition
Term-Document Matrix
Term

The text mining process can be defined in _ consecutive tasks.

1. Establish the ___

2. Create the _____

3. Extract the _____

Definition

1. Establish the corpus

2. Create the term-document matrix

3. Extract the knowledge

Term
The largest data/text repository is ___
Definition
the web
Term

Examples of information found on the web:

a. Whose __ __ is linked to which other pages

b. How many people have on their own website ___ to other websites

c. How a particular site is ___

d. Tracking __ to a site, __ on a search engine, __ on e-commerce sites

Definition

home page

 

hyperlinks

 

organized

 

visitors, searches, transactions

Term
___ - the discovering of relationships from web data
Definition
Web Mining
Term

The 3 areas of web mining:

1. Web __ mining

2. Web __ mining

3. Web __ mining

Definition

Content

 

Structure

 

Usage

Term
Web __ mining extracts and uses content found within web pages.
Definition
Content
Term
Web __ mining extracts useful information from the analysis of links found in web documents
Definition
Structure
Term
Web __ mining extracts and uses information that is generated through web page visits, traffic, transactions, etc.
Definition
Usage
Term
Web content mining is similar to ___ mining
Definition
text
Term
Web usage mining uses ____ data, which provides a trail of the user's activity and shows the user's browsing patterns: which sites are visited, pages accessed, time spent per page/site, etc.
Definition
Clickstream
Term

Formulas:

1. Which predictor is best (given alpha)?

Compare P-Value of the type (radio/newspaper) to the given alpha.
 p-value < alpha = good predictor
 p-value > alpha = bad predictor.

 

2. Predict weekly sales (make sure Adj R-sq is between 0 and 1)
Y = intercept coeff + (Coeff * X1) + (Coeff * X2)
where
Y = Incercept Coeff
X = money spent per week on advertising

 

3. Calculate LCF (linear classification function)
 LCF1/LCF0 = constant + (coeff * X1) + (coeff * X2) + (coeff * X3)
 WHERE
 -Coeff is under classification analysis. Column to use depends on if
  LCF is 0 or 1. Use column 1 for 1, 0 for 0.
 -Xn = given

 

Definition
Supporting users have an ad free experience!