Shared Flashcard Set

Details

i211 midterm exam
advanced comp programming
95
Computer Science
Undergraduate 3
02/23/2010

Additional Computer Science Flashcards

 


 

Cards

Term
data mining
Definition
the process of extracting patterns from data. an increasingly more important tool used to turn data into information.
Term
Data mining: the confluence of disciplines
Definition
statistics, visualization, algorithms, pattern recognition, machine learning, database technology
Term
data mining and business intelligence
Definition
decision making
^
data presentation (visualization)
^
data mining (info discovery)
^
data exploration (statistical summary, querying, reporting)
^
data prepocessing/integration
^
data sources (paper, files, web docs, scientific experiments)
Term
Predictive methods
Definition
use some variables to predict unknown or future values of other variable
Term
Descriptive methods
Definition
find human-interpretable patterns that describe the data
Term
Classification
Definition
predictive. goal: previously unseen records should be assigned a class as accurately as possible
Term
training set
Definition
a collection of records
Term
Classification Applications
Definition
direct marketing- reduce cost of mailing by targeting a set of consumers likely to buy a new product.

fraud detection- predict fraudulent cases in credit card transactions (ex, american express used this approach)

Customer Attrition/Churn-to predict whether a customer is likely to be lost to a competitor

Sky survey cataloging- to predict the class of sky objects, esp visually faint ones, based on the telescopic survey images
Term
set
Definition
an arbitrary collection of objects
(A= {a,b,c})

union, intersection,subset, and difference are in notes week 2
Term
Cartesian product AxB
Definition
a set of all possible pairs (x,y)
Term
Matlab function
Definition
a function in Matlab must be in this form:

function [output_parameter_list] = function_name(input_parameter_list)
Term
clustering
Definition
descriptive. given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that:
-data points in one cluster are more similar to one another
-data points in separate clusters are less similar to one another
Term
Clustering Applications
Definition
market segmentation: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.

document clustering: to find groups of docs that are similar to each other based on the important terms appearing in them.
Term
Association rule discovery
Definition
descriptive. Given a set of records each of which contain some number of items from a given collection, produce dependency rules which will predict occurrences of an item based on occurrences of other items.
Term
Association rule discovery applications
Definition
marketing and sales promotion: find out what to put on sale to boost sales for another item

supermarket self management: to identify items that are bought together by sufficiently many customers

inventory management: a consumer appliance repair company wants to anticipate the nature of repairs on its customer products and keep the service vehicles equipped with the right parts to reduce on number of visits to consumer households.
Term
Sequential pattern discovery
Definition
descriptive. given a set of objects, with each object associated with it's own timeline of events, find rules that predict strong sequential dependencies among different events.

-rules are formed by first discovering patterns. Event occurrences in the patterns are governed by timing contraints
Term
regression
Definition
predict a valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency
Term
Deviation/Anomaly detection
Definition
detect significant deviations from normal behavior
Term
Deviation/Anomaly detection Applications
Definition
credit card fraud and network intrusion detection.
Term
data
Definition
a collection of facts from which conclusions may be drawn

a collection of data objects and their attributes
Term
attribute
Definition
a property or characteristic of an object (columns)

a collection of attributes describe a data point
Term
attribute values
Definition
numbers or symbols assigned to an attribute
Term
Measure of length
Definition
the way you measure an attribute is something that may not match the attributes porperties
Term
Types of attributes and examples
Definition
nominal: Id numbers, eye color, zip code

Ordinal: rankings, grades, height in (tall, medium, short)

Interval: calendar dates, temperatures

Ratio: length, time, counts
Term
properties of attribute values
Definition
distintiveness: equal or not equal to

order: < >

addition: + -

multiplication: * /
Term
Nominal definition and properties
Definition
provides only enough info to distinguish one object from another

distinctiveness
Term
Ordinal definition and properties
Definition
provide enough info to order objects

Distinctness and order
Term
Interval definition and properties
Definition
the differences between values are meaningful (i.e. a unit of value exists)

distinctness, order, and addition
Term
Ratio definition and properties
Definition
both differences and ratios are meaningful

Distinctness, order, addition, multiplication
Term
Discrete attribute
Definition
has only a finite or countably infinite set of values (zip codes, counts, set of words i a collection of docs)

-often represented as integer variables
Term
Continuous attribute
Definition
has real numbers as attribute values (temp, height, weight)

-continuous attributes are typically represented as floating-point variables
Term
Types of data sets
Definition
Record: data matrix, doc data, transaction data

Graph: WWW, molecular structures

Ordered: spatial data, temporal data, sequential data, genetic sequence data.
Term
Characteristics of structured data
Definition
Dimensionality: curse of dimensionality(an exponential increase in the number of dimensions of data. hard to analyze)

Sparsity: only presence counts

Resolution: patterns depend on the scale

Attribute and Class imbalance: small number of non sero elements (related to sparsity)
Term
Record data
Definition
data that consists of a collection of records, each of which consists of a fixed set of attributes
Term
Data matrix
Definition
if data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multidimensional space, where each dimension represents a distinct attribute. This data can be represented in an m-by-n matrix where m=rows (one for each object) and n=columns (one for each attribute)
Term
Document data
Definition
each document becomes a term vector
-each term is a component (attribute) of the vector
-the value of the component is the number of times the corresponding term occurs in the document
Term
Transaction data
Definition
a special type of record data where each record (transaction) involves a set of items (ex, a grocery store purchase per customer)
Term
Load data from websites
Definition
use readurl function
Term
Load data from excel files
Definition
use xlsread function
Term
Load data from text files
Definition
use textscan and relation functions
Term
Load data from CSV files
Definition
use csvread function
Term
Reading custom file types
Definition
use fopen, fclose, fgetl, fget for text files
use fread, fwrite, fseek, ftell for binary files
Term
Data quality problems (examples)
Definition
-noise and outliers
-missing values
-duplicate data
Term
Noise
Definition
refers to modification of original values (static on television)
Term
Outliers
Definition
data objects with characteristics that are considerably different than most of the other data objects in the data set
Term
Missing values
Definition
reasons for missing values: information not collected or attributes may not be applicable to all cases

Handling missing values: eliminate data objects or estimate missing values
Term
Duplicate data
Definition
data set may include objects that are duplicates (this is major issue when merging data from heterogeneous sources)
Term
Data cleaning
Definition
process of dealing with noise and duplicate data
Term
Similarity
Definition
-numerical measure of how alike two data points are
-is higher when objects are more alike
-often falls in the range [0,1]
Term
Dissimilarity
Definition
-numerical measure of how different are two data points
-lower when objects are more alike
-minimum dissimilarity is often 0
-upper limit varies
Term
Proximity
Definition
refers to both similarity and dissimilarity
Term
Euclidean distance
Definition
where r=2

dist(p,q)= ((p(k)-q(k))^2)^1/2

k=1

where n is the number of dimensions (attributes) p(k) and q(k) are respectively, the kth attributes (components) or data objects p and q.
Term
Minkowski distance
Definition
minkowski distance is a generalization of euclidean distance

dist= (|p(k)-q(k)|^r) ^1/r
Term
Manhattan distance
Definition
r=1

dist= (|p(k)-q(k)|)
Term
Lmax distance
Definition
r=infinity

dist= (|p(k)-q(k)|^r) ^1/r
Term
Simple matching coefficients
Definition
= the number of matches/ number of attributes
Term
Jaccard coefficients
Definition
number of 11 matches/ number of NOT-BOTH-ZERO attribute values
Term
Cosine similarity
Definition
cos(d1,d2)=d1*d2 / ||d1||*||d2||
Term
Summary statistics
Definition
numbers that summarize properties of the data.

-properties include: frequency, location, and spread
Term
Frequency
Definition
frequency of an attribute value is the precentage of time the value occurs in the data set
Term
Mode
Definition
the mode of an attribute is the most frequent attribute value
Term
Mean
Definition
the most common measure of the location of a set of m points (sensitive to outliers)
Term
Median
Definition
middle number in ordered data.

if m is odd, m=2r+1
if m is even, m=2r
Term
Range
Definition
the difference between the max and min
Term
STDEV
Definition
the most common measure of the spread of a set of points (notes for equation)
Term
Correlation
Definition
measures the linear relationship between objects

to compute:

P'k = (pk-mean(p))/stdev(p)
q'k= (qk-mean(q))/stdev(q)

correlation(p,q)= p'*q' / n-1
Term
K-nearest neighbor algorithm (KNN)
Definition
a simple algorithm that stores alll vailable data points (examples) and classifies new data points based on a similarity measure
Term
Properties of KNN
Definition
-belongs to the class of "lazy" algorithms. no process of learning a model. exmaples are simply stored as the data is collected
-difficulty comes at classification stage. we need to calculate n distances and find best K data points
Term
Data exploration
Definition
a preliminary exploration of data to better understand its characteristics
Term
Visualization
Definition
the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.

one of the most powerful and appealing techniques of data exploration
Term
Representation
Definition
the mapping of info to a visual format
Term
Arrangement
Definition
the placement of visual elements within a display
Term
Selection
Definition
the elimination or the de-emphasis of certain objects and attributes
Term
Histogram
Definition
-usually shows the distribution of values of a single variable
-divide the values into bins and show a bar plot of the number of objects in each bin
-the hegiht of each bar indicates the number of objects
-shape of it depends on the number of bins
Term
Box plots
Definition
-invented by J. Turkey
-another way of displaying the distribution of data
Term
Scatter plots
Definition
-attribute values determine the position
-2 dimensional scatter plots most common
-often attributes can be displayed by using the size, shape, color, of the markers that represent the objects
Term
Contour plots
Definition
-useful when a continuous attribute is measured on a spatial grid
-they partition the plane into regions of similar values
- the contour lines that form the boundaries of these regions connect points with equal values
Term
Matrix plots
Definition
-can plot the data matrix
-can be useful when objects are sorted according to class
Term
Visualization techniques
Definition
Parallel coordinates: used to plot the attribute values of high-dimensional data. Instead of using perpendicular axes, use a set of parallel axes.

Star plots: similar approach to parallel coords but axes radiate from a central point. The line connecting the values of an object is a polygon.

Chernoff faces: created by herman chernoff, associates each attribute with a characteristic of a face. The values of each attribute determine the appearance of the corresponding facial characteristic.
Term
Multi-dimensional measure of data quality
Definition
accuracy
completeness
consistency
timeliness
believability
value added
interpretability
accessibility
Term
data cleaning
Definition
fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
Term
data integration
Definition
integration of multiple databases or files
Term
data transformation
Definition
normalization and aggregation
Term
data reduction
Definition
obtains reduced representation in volume but produces the same or similar analytical results
Term
data discretization
Definition
part of data reduction but with particular importance, esp for numerical data
Term
Data cleaning tasks
Definition
fill in missing values, identify outliers and smooth out noisy data, correct inconsistent data, resolve redundancy caused by data integration
Term
Attribute transformation
Definition
a function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values
Term
Types of sampling: simple random sampling
Definition
there's an equal probability of selecting any particular item
Term
Types of sampling: sampling without replacement
Definition
as each item is selected, it's removed from the population
Term
Types of sampling: sampling with replacement
Definition
objects not removed from the population as they are selected for the sample. (the object can be picked up more than once)
Term
Types of sampling: stratified sampling
Definition
split the data into several partitions; then draw random samples from each parition
Term
feature subset selection
Definition
another way to reduce dimensionality of data

redundant features/ irrelevant
Term
Feature subset selection techniques
Definition
brute force approach: try all possible feature subsets as input to data mining algorithm

embedded approaches: feature selection occurs naturally as part of the data mining algorithm

filter approaches: features are seleced before data mining algorithm is run

wrapper approaches: use the data mining algorithm as a black box to find the best subset of attributes
Term
Feature creation
Definition
create new attributes that can capture the important info in a data set much more efficiently than the original attributes
Term
methodologies
Definition
mapping data to a new space
Supporting users have an ad free experience!