Shared Flashcard Set

Details

Title

Data Mining

Description

n/a

Total Cards

Subject

Computer Science

Level

Graduate

Created

11/27/2011

Click here to study/print these flashcards.

Create your own flash cards! Sign up here.

Additional Computer Science Flashcards

Cards Return to Set Details

Term

What is Type I error (False Positive)

Definition

Type I errors is less risk to a program classification since it classifies a module as high risk when its actually low risk

Term

what is Type II error (false negative)

Definition

Type II is more important because it is when a module is classified as a low risk module when its actually a high risk module

Term

Similarities between logistic regression and multiple linear regression.

Definition

The equations are very similar on the right hand side.

Term

Differences between logistic regression and multiple linear regression.

Definition

Multiple linear regression is used to predict a value. Logistic regression is used to predict a class.

Term

Data splitting

Definition

Split the data into two parts, one being training and other being test.

Term

re-substitution

Definition

You build a model using training data set or fit data set, then you use that same fit or training data set to substitute back into that model and then you look at the accuracy and other performance

Term

subsequent project

Definition

you have software projects and subsequent software projects. Then you use one software project to build the model then you use the other software projects as a test.

Term

cross-validation

Definition

split the data into 10 parts and we use 9 parts to build the model and then the last part to test the model then you repeat this 10 times and then you combine the results

Term

Precision (efficiency)

Definition

The proportion of the modules that are actually fault prone(correctly guessed) out of all of them modules that were classified as fault prone

Term

Recall (effectiveness)

Definition

the ratio of the number of correctly guessed fault prone modules you predicted divided by the total number of modules that are actually fault prone modules

Term

For linear regression models among the selection methods: greedy,M5,no selection. which method will used most number of independent variables

Definition

no selection because none of the variables are pruned.

Term

an over-fitted number prediction or classification model

Definition

The model is very good when you use the training data but very bad when you use the test data. (model too good to be true)

Term

#4
look on doc

Definition

class (xi) = {NFP if (NFP/FP) > c
{FP, Otherwise

mean value of 0.75 :means 75% NFP and 25% are FP
(.75)/(.25) = 3

(a) 2>3 = NFP
(b) 5<3 = FP

Term

(5)

Definition

(A)
K ? ? 20
^ between C and E
# of faults C=0 E=1; (0+1)/2 = 0.5
L ? ? 50
^ between I and J
# of faults I=7 J=10; (7+10)/2 = 8.5
M ? ? 38
^ between H and I
# of faults H=5 I=7; (5+7)/2 = 6

(B)
for dfp find 2 closes to fp
for dnfp find 2 closest to nfp
Class (xi){NFP if (dfp/dnfp) > c
{FP otherwise
K ? ? 20
dfp 35 and 40; ((35-20+(40-20))/2 =17.5
dnnfp 21 and 21; ((21-20)+(22-20))/2 =1.5
dfp/dnfp; 17.5/1.5 > 0.5 so NFP
L ? ? 50
dfp 40 and 55; (|40-50|+(55-50))/2 =7.5
dnnfp 30 and 29; (|29-50|+|30-50|)/2 =20.5
dfp/dnfp; 7.5/20.5 !> 0.5 so FP
M ? ? 38
dfp 35 and 40; ((35-20)+(40-20))/2 =17.5
dnnfp 30 and 29; (|29-50|+|30-50|)/2 =20.5
dfp/dnfp; 17.5/20.5 !> 0.5 so NFP

Term

MOM steps

Definition

1. Use numerical prediction to predict faults
2. Order the modules based on the predicted faults.
3. Finding the quality by using the actual number of faults

Flashcard Machine - create, study and share online flash cards

Shared Flashcard Set

Details

Additional Computer Science Flashcards

Cards Return to Set Details

My Flashcards

Flashcard Library

Browse

About

Help

Mobile