Shared Flashcard Set

Details

Data Mining
n/a
15
Computer Science
Graduate
11/27/2011

Additional Computer Science Flashcards

 


 

Cards

Term
What is Type I error (False Positive)
Definition
Type I errors is less risk to a program classification since it classifies a module as high risk when its actually low risk
Term
what is Type II error (false negative)
Definition
Type II is more important because it is when a module is classified as a low risk module when its actually a high risk module
Term
Similarities between logistic regression and multiple linear regression.
Definition
The equations are very similar on the right hand side.
Term
Differences between logistic regression and multiple linear regression.
Definition
Multiple linear regression is used to predict a value. Logistic regression is used to predict a class.
Term
Data splitting
Definition
Split the data into two parts, one being training and other being test.
Term
re-substitution
Definition
You build a model using training data set or fit data set, then you use that same fit or training data set to substitute back into that model and then you look at the accuracy and other performance
Term
subsequent project
Definition
you have software projects and subsequent software projects. Then you use one software project to build the model then you use the other software projects as a test.
Term
cross-validation
Definition
split the data into 10 parts and we use 9 parts to build the model and then the last part to test the model then you repeat this 10 times and then you combine the results
Term
Precision (efficiency)
Definition
The proportion of the modules that are actually fault prone(correctly guessed) out of all of them modules that were classified as fault prone
Term
Recall (effectiveness)
Definition
the ratio of the number of correctly guessed fault prone modules you predicted divided by the total number of modules that are actually fault prone modules
Term
For linear regression models among the selection methods: greedy,M5,no selection. which method will used most number of independent variables
Definition
no selection because none of the variables are pruned.
Term
an over-fitted number prediction or classification model
Definition
The model is very good when you use the training data but very bad when you use the test data. (model too good to be true)
Term
#4
look on doc
Definition
class (xi) = {NFP if (NFP/FP) > c
{FP, Otherwise

mean value of 0.75 :means 75% NFP and 25% are FP
(.75)/(.25) = 3

(a) 2>3 = NFP
(b) 5<3 = FP
Term
(5)
Definition
(A)
K ? ? 20
^ between C and E
# of faults C=0 E=1; (0+1)/2 = 0.5
L ? ? 50
^ between I and J
# of faults I=7 J=10; (7+10)/2 = 8.5
M ? ? 38
^ between H and I
# of faults H=5 I=7; (5+7)/2 = 6

(B)
for dfp find 2 closes to fp
for dnfp find 2 closest to nfp
Class (xi){NFP if (dfp/dnfp) > c
{FP otherwise
K ? ? 20
dfp 35 and 40; ((35-20+(40-20))/2 =17.5
dnnfp 21 and 21; ((21-20)+(22-20))/2 =1.5
dfp/dnfp; 17.5/1.5 > 0.5 so NFP
L ? ? 50
dfp 40 and 55; (|40-50|+(55-50))/2 =7.5
dnnfp 30 and 29; (|29-50|+|30-50|)/2 =20.5
dfp/dnfp; 7.5/20.5 !> 0.5 so FP
M ? ? 38
dfp 35 and 40; ((35-20)+(40-20))/2 =17.5
dnnfp 30 and 29; (|29-50|+|30-50|)/2 =20.5
dfp/dnfp; 17.5/20.5 !> 0.5 so NFP
Term
MOM steps
Definition
1. Use numerical prediction to predict faults
2. Order the modules based on the predicted faults.
3. Finding the quality by using the actual number of faults
Supporting users have an ad free experience!