Term

Definition
the process of extracting patterns from data. an increasingly more important tool used to turn data into information. 


Term
Data mining: the confluence of disciplines 

Definition
statistics, visualization, algorithms, pattern recognition, machine learning, database technology 


Term
data mining and business intelligence 

Definition
decision making ^ data presentation (visualization) ^ data mining (info discovery) ^ data exploration (statistical summary, querying, reporting) ^ data prepocessing/integration ^ data sources (paper, files, web docs, scientific experiments) 


Term

Definition
use some variables to predict unknown or future values of other variable 


Term

Definition
find humaninterpretable patterns that describe the data 


Term

Definition
predictive. goal: previously unseen records should be assigned a class as accurately as possible 


Term

Definition


Term
Classification Applications 

Definition
direct marketing reduce cost of mailing by targeting a set of consumers likely to buy a new product.
fraud detection predict fraudulent cases in credit card transactions (ex, american express used this approach)
Customer Attrition/Churnto predict whether a customer is likely to be lost to a competitor
Sky survey cataloging to predict the class of sky objects, esp visually faint ones, based on the telescopic survey images 


Term

Definition
an arbitrary collection of objects (A= {a,b,c})
union, intersection,subset, and difference are in notes week 2 


Term

Definition
a set of all possible pairs (x,y) 


Term

Definition
a function in Matlab must be in this form:
function [output_parameter_list] = function_name(input_parameter_list) 


Term

Definition
descriptive. given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that: data points in one cluster are more similar to one another data points in separate clusters are less similar to one another 


Term

Definition
market segmentation: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.
document clustering: to find groups of docs that are similar to each other based on the important terms appearing in them. 


Term
Association rule discovery 

Definition
descriptive. Given a set of records each of which contain some number of items from a given collection, produce dependency rules which will predict occurrences of an item based on occurrences of other items. 


Term
Association rule discovery applications 

Definition
marketing and sales promotion: find out what to put on sale to boost sales for another item
supermarket self management: to identify items that are bought together by sufficiently many customers
inventory management: a consumer appliance repair company wants to anticipate the nature of repairs on its customer products and keep the service vehicles equipped with the right parts to reduce on number of visits to consumer households. 


Term
Sequential pattern discovery 

Definition
descriptive. given a set of objects, with each object associated with it's own timeline of events, find rules that predict strong sequential dependencies among different events.
rules are formed by first discovering patterns. Event occurrences in the patterns are governed by timing contraints 


Term

Definition
predict a valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency 


Term
Deviation/Anomaly detection 

Definition
detect significant deviations from normal behavior 


Term
Deviation/Anomaly detection Applications 

Definition
credit card fraud and network intrusion detection. 


Term

Definition
a collection of facts from which conclusions may be drawn
a collection of data objects and their attributes 


Term

Definition
a property or characteristic of an object (columns)
a collection of attributes describe a data point 


Term

Definition
numbers or symbols assigned to an attribute 


Term

Definition
the way you measure an attribute is something that may not match the attributes porperties 


Term
Types of attributes and examples 

Definition
nominal: Id numbers, eye color, zip code
Ordinal: rankings, grades, height in (tall, medium, short)
Interval: calendar dates, temperatures
Ratio: length, time, counts 


Term
properties of attribute values 

Definition
distintiveness: equal or not equal to
order: < >
addition: + 
multiplication: * / 


Term
Nominal definition and properties 

Definition
provides only enough info to distinguish one object from another
distinctiveness 


Term
Ordinal definition and properties 

Definition
provide enough info to order objects
Distinctness and order 


Term
Interval definition and properties 

Definition
the differences between values are meaningful (i.e. a unit of value exists)
distinctness, order, and addition 


Term
Ratio definition and properties 

Definition
both differences and ratios are meaningful
Distinctness, order, addition, multiplication 


Term

Definition
has only a finite or countably infinite set of values (zip codes, counts, set of words i a collection of docs)
often represented as integer variables 


Term

Definition
has real numbers as attribute values (temp, height, weight)
continuous attributes are typically represented as floatingpoint variables 


Term

Definition
Record: data matrix, doc data, transaction data
Graph: WWW, molecular structures
Ordered: spatial data, temporal data, sequential data, genetic sequence data. 


Term
Characteristics of structured data 

Definition
Dimensionality: curse of dimensionality(an exponential increase in the number of dimensions of data. hard to analyze)
Sparsity: only presence counts
Resolution: patterns depend on the scale
Attribute and Class imbalance: small number of non sero elements (related to sparsity) 


Term

Definition
data that consists of a collection of records, each of which consists of a fixed set of attributes 


Term

Definition
if data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multidimensional space, where each dimension represents a distinct attribute. This data can be represented in an mbyn matrix where m=rows (one for each object) and n=columns (one for each attribute) 


Term

Definition
each document becomes a term vector each term is a component (attribute) of the vector the value of the component is the number of times the corresponding term occurs in the document 


Term

Definition
a special type of record data where each record (transaction) involves a set of items (ex, a grocery store purchase per customer) 


Term

Definition


Term
Load data from excel files 

Definition


Term
Load data from text files 

Definition
use textscan and relation functions 


Term

Definition


Term
Reading custom file types 

Definition
use fopen, fclose, fgetl, fget for text files use fread, fwrite, fseek, ftell for binary files 


Term
Data quality problems (examples) 

Definition
noise and outliers missing values duplicate data 


Term

Definition
refers to modification of original values (static on television) 


Term

Definition
data objects with characteristics that are considerably different than most of the other data objects in the data set 


Term

Definition
reasons for missing values: information not collected or attributes may not be applicable to all cases
Handling missing values: eliminate data objects or estimate missing values 


Term

Definition
data set may include objects that are duplicates (this is major issue when merging data from heterogeneous sources) 


Term

Definition
process of dealing with noise and duplicate data 


Term

Definition
numerical measure of how alike two data points are is higher when objects are more alike often falls in the range [0,1] 


Term

Definition
numerical measure of how different are two data points lower when objects are more alike minimum dissimilarity is often 0 upper limit varies 


Term

Definition
refers to both similarity and dissimilarity 


Term

Definition
where r=2
dist(p,q)= ((p(k)q(k))^2)^1/2
k=1
where n is the number of dimensions (attributes) p(k) and q(k) are respectively, the kth attributes (components) or data objects p and q. 


Term

Definition
minkowski distance is a generalization of euclidean distance
dist= (p(k)q(k)^r) ^1/r 


Term

Definition


Term

Definition
r=infinity
dist= (p(k)q(k)^r) ^1/r 


Term
Simple matching coefficients 

Definition
= the number of matches/ number of attributes 


Term

Definition
number of 11 matches/ number of NOTBOTHZERO attribute values 


Term

Definition
cos(d1,d2)=d1*d2 / d1*d2 


Term

Definition
numbers that summarize properties of the data.
properties include: frequency, location, and spread 


Term

Definition
frequency of an attribute value is the precentage of time the value occurs in the data set 


Term

Definition
the mode of an attribute is the most frequent attribute value 


Term

Definition
the most common measure of the location of a set of m points (sensitive to outliers) 


Term

Definition
middle number in ordered data.
if m is odd, m=2r+1 if m is even, m=2r 


Term

Definition
the difference between the max and min 


Term

Definition
the most common measure of the spread of a set of points (notes for equation) 


Term

Definition
measures the linear relationship between objects
to compute:
P'k = (pkmean(p))/stdev(p) q'k= (qkmean(q))/stdev(q)
correlation(p,q)= p'*q' / n1 


Term
Knearest neighbor algorithm (KNN) 

Definition
a simple algorithm that stores alll vailable data points (examples) and classifies new data points based on a similarity measure 


Term

Definition
belongs to the class of "lazy" algorithms. no process of learning a model. exmaples are simply stored as the data is collected difficulty comes at classification stage. we need to calculate n distances and find best K data points 


Term

Definition
a preliminary exploration of data to better understand its characteristics 


Term

Definition
the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.
one of the most powerful and appealing techniques of data exploration 


Term

Definition
the mapping of info to a visual format 


Term

Definition
the placement of visual elements within a display 


Term

Definition
the elimination or the deemphasis of certain objects and attributes 


Term

Definition
usually shows the distribution of values of a single variable divide the values into bins and show a bar plot of the number of objects in each bin the hegiht of each bar indicates the number of objects shape of it depends on the number of bins 


Term

Definition
invented by J. Turkey another way of displaying the distribution of data 


Term

Definition
attribute values determine the position 2 dimensional scatter plots most common often attributes can be displayed by using the size, shape, color, of the markers that represent the objects 


Term

Definition
useful when a continuous attribute is measured on a spatial grid they partition the plane into regions of similar values  the contour lines that form the boundaries of these regions connect points with equal values 


Term

Definition
can plot the data matrix can be useful when objects are sorted according to class 


Term

Definition
Parallel coordinates: used to plot the attribute values of highdimensional data. Instead of using perpendicular axes, use a set of parallel axes.
Star plots: similar approach to parallel coords but axes radiate from a central point. The line connecting the values of an object is a polygon.
Chernoff faces: created by herman chernoff, associates each attribute with a characteristic of a face. The values of each attribute determine the appearance of the corresponding facial characteristic. 


Term
Multidimensional measure of data quality 

Definition
accuracy completeness consistency timeliness believability value added interpretability accessibility 


Term

Definition
fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies 


Term

Definition
integration of multiple databases or files 


Term

Definition
normalization and aggregation 


Term

Definition
obtains reduced representation in volume but produces the same or similar analytical results 


Term

Definition
part of data reduction but with particular importance, esp for numerical data 


Term

Definition
fill in missing values, identify outliers and smooth out noisy data, correct inconsistent data, resolve redundancy caused by data integration 


Term

Definition
a function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values 


Term
Types of sampling: simple random sampling 

Definition
there's an equal probability of selecting any particular item 


Term
Types of sampling: sampling without replacement 

Definition
as each item is selected, it's removed from the population 


Term
Types of sampling: sampling with replacement 

Definition
objects not removed from the population as they are selected for the sample. (the object can be picked up more than once) 


Term
Types of sampling: stratified sampling 

Definition
split the data into several partitions; then draw random samples from each parition 


Term

Definition
another way to reduce dimensionality of data
redundant features/ irrelevant 


Term
Feature subset selection techniques 

Definition
brute force approach: try all possible feature subsets as input to data mining algorithm
embedded approaches: feature selection occurs naturally as part of the data mining algorithm
filter approaches: features are seleced before data mining algorithm is run
wrapper approaches: use the data mining algorithm as a black box to find the best subset of attributes 


Term

Definition
create new attributes that can capture the important info in a data set much more efficiently than the original attributes 


Term

Definition
mapping data to a new space 

