Shared Flashcard Set

Details

Psyc 7165
Reliability lecture
52
Psychology
Graduate
10/04/2011

Additional Psychology Flashcards

 


 

Cards

Term

 

 

Psychometric Inference

Definition
Characteristics measured represent quantity of an attribute
Term
What type of Charateristics can be measured by Psycholgoical Inference
Definition

Characteristics may be overt (directly observable behavior) or covert (intelligence, self-esteem, working memory)

 
Term
What is the Major Problem in Measurement
Definition

Major problem in measurement is one often has no basis for assuming that a numerical score accurately reflects the quantify of interest

 
Term
What does assement often involve?
Definition

Assessment often involves measurement of constructs

 
Term
What is a ture score?
Definition
Average value examinee obtains over infinite number of observed scores
 
Term
What do errors of measurement reflect in ture scores?
Definition

Errors of measurement reflect discrepancy between observed scores & the true score

Standard error of measurement (SEM) is standard deviation of infinite number of observed scores
 
Term
What is the formula for estimating ture scores?
Definition

True score can be estimated:rxx (X – M) + M

.90 (110-100) + 100 = 109
.90 (90 – 100) + 100 = 91
What explains the above?
 
Term
can in individual have more than one true score?
Definition
Yes
Term

Tell me about Biological or Absolute True Scores

 

Definition
Absolute true score exists independent of measurement process used
Errors of measurement still occur
Individuals have only 1 absolute true score (ex. DNA,Blood pressure,Cholesterol level, Cancer,Pregnancy)
Different lab tests may yield different results but one would never average tests to estimate an absolute true score
Term
Ways to measure response strength in operant behavior
Definition
Frequency
Duration
Latency
Interresponse time
Term
Sources of Error for operant behavior
Definition
Less than perfect IOA
Can we generalize reliability to all other observers?
Is there a standard error of measurement
Term
Threats to Measurement of operant behavior
Definition
Observer bias
Observer drift
Code complexity
Term
What does accuracy refer to in operant behavior measurement?
Definition
Degree to which measurement reflects true value
Term
Types of Test Scores
Definition
Raw scores
Composite scores
Percentile ranks (1-99)
Stanines (M = 5, SD = 2)
Normal curve equivalents (M = 50, SD =21.06)
Standard scores
z scores (M=0, SD=1)
T scores (M=50, SD=10)
Scale scores (M=10, SD=3)
DIQ scores (M=100, SD=15)
Equivalence of standard scores
z  of 1=T of 60=DIQ of 115=SS of 13=NCE of 71.06
Term
Types of Norms
Definition
Age norms
Grade norms
Gender norms
Special group norms
Percentile rank norms
Standard score norms
National norms
State norms
District norms
School norms
Term
How long has Classical Test Theory (CTT) been around?
Definition
over a 100 years
Term
What are test scores made of in CTT
Definition
Test scores made up of  true score component & error score component
X = True score + Error score
True score is hypothetical (never actually known but can be estimated)
True score is average score on infinite number of tests (model of parallel tests)
Standard deviation of these tests is standard error of measurement
Term
What reflects reliablitity of the test in CTT
Definition

Degree of error in a test reflects the reliability of the test

 

Term
How is reliability defined CTT?
Definition

Reliability defined as true score variance relative to total variance in scores

R= True score variance/Total variance (R = 75/100 = .75)
Or R = 1– Error Variance/Total variance (R = 1- .25 = .75)
Or True score variance = Total Variance X R (100 X .75 = .75)
Term
Fundamental Assumptions in CTT
Definition
Individuals possess stable traits or characteristics (true scores) that persist through time
Errors of measurement are completely random (due entirely to unsystematic variation in test scores)
Fallible scores are the result of the addition of true & error scores Xobtained = Xtrue + Xerror
Based on above logic, an individuals true scores is exactly the same on all parallel tests
Also (based on the above logic), an individuals fallible score will vary from one parallel test to another (based on differences in reliability)
Also (and confusingly), an individual can more than 1 true score (WISC-Stanford Binet-Woodcock Johnson)
Term
Major Sources of Error in Testing
Definition
Errors associated with specific situation
Errors associated with  different occasions (situational-centered & person-centered sources of variation)
Errors associated with different test contents (2 or more measures thought to be parallel may not be parallel)
Errors associated with subjective scoring systems (ratings, essay examinations, oral examinations, etc.)
Term
What Makes Tests Parallel?
Definition
If they have the same mean
If they have the same standard deviation
If they correlate the same with a set of true scores
If all their variance that is not explainable by true scores is pure random error
Term
Methods for Estimating Reliability of Tests
Definition
Correlations between scores on repetitions of same test (coefficient of stability)
Correlations among scores on parallel forms of a test (coefficient of equivalence)
Correlations between repetitions & parallel forms (coefficient of stability & equivalence)
Correlations between comparable halves of a test (split-half reliability)
Intercorrelations among all components of a test (inter-item correlation)
Term
Indices of Reliability and Error
Definition

 

Reliability Index:
Reliability Coefficient:
Coefficient Alpha:
Spearman-Brown Formula:
Standard Error of Measurement:
Standard Error of Estimate:
Correction for Attenuation:
Term
Reliability Index:
Definition

 

Correlation between true scores & fallible scores
 
Term
Reliability Coefficient:
Definition

 Correlation between scores on parallel test forms or how well scores on 1 parallel test can predict scores on another parallel test

Term
Coefficient Alpha:
Definition

 Reflects the average inter-item correlation or the reliability of an item sample in a content domain

Term
Spearman-Brown Formula:
Definition

Demonstrates the relationship between test length & measurement error

Term
Standard Error of Measurement:
Definition

 Extent to which an individuals scores vary over a series of parallel tests

Term
Standard Error of Estimate:
Definition

Degree of measurement error in prediction from 1 variable to another

Term
Correction for Attenuation:
Definition
 Extent to which unreliability in test scores diminishes the correlation between 2 or more sets of test scores
Term
Differences between Standards
Definition

 

Standard Deviation (dispersion in set of test scores)
Standard Error of Measurement (error in a single test score)
Standard Error of Estimate (error in prediction from 1 test score to another)
Standard Error of the Mean (sampling error in average test score)
Standard Error of Correlation (sampling error in correlation of test scores)
Term
The Importance of Spearman-Brown
Definition

 

Major way of making tests more reliable is to make them longer
Major source of error in tests is content error
Longer tests have less content error than shorter tests
CTT assumes subject error is minimized because of large samples
Spearman Brown Prophesy Formula: rkk = kr11/1 + (k-1)r11
What would be the reliability of a test that was increased by a factor of k? A 20 item test have a reliability of .70 and is increased to 60 items.
rkk= 3(.7)/1 + (3-1)(.7) = .88
Spearman Brown can also be used to estimate reliability by shortening a test using the same formula: 100 item test with r =.95 & I want to shorten it to a 50 item test
rxx = ..50 (.95)/ 1 + .50 – 1)(.95)
rxx = .475/.525 = .90
You can also modify the formula to estimate the number of items required to obtain a desired level of reliability:
K = rkk (1 – r11)r11(1-rkk)
K = .80 (1-.5)/(.5)(1-.8)=.4/.1=4
Term
Importance of Standard Error of Measurement
Definition

 

SEM reflects degree of measurement error around an obtain score
SEM is standard deviation of infinite number of parallel tests
SEM reflects degree of confidence one has in a test score
SEM is a function of test reliability and variability in test performance
SEM places confidence intervals around test scores
As reliability of test decreases, SEM approaches the SD of a test
+/1 SEM=68% CI; +/- 2 SEM=95% CI; +/-3 SEM=99% CI
r= .93 & SD=15 X=76 SEM =  4 points
68% CI 72 - 80
95% CI 68 - 84
99% CI 64 – 88
You can also place a SEM around the estimated true score
Xtrue = rxx (X – M) + M
Xtrue= .93 (76-100) + 100 = 78
Term

Reliability Coefficient & SEM

 

Internal consistency reliability (α)

 

Definition

 

Invariably higher than test-retest
Produces smaller SEMs
Based on average inter-item correlation
Reflects precision of test score on a given day
IQ score on Monday
Term

Reliability Coefficient & SEM

Test-retest reliability (r)

 

Definition

 

Invariably lower than α
Produces larger SEMs
Based on correlation over time (Time 1/Time 2)
Reflects precision of test scores on any given day
IQ score in September vs IQ score in January
Term
Standard Error of Measurement:
A Practical Example
Definition

 

Forrest is administered a WAIS-IV and obtains a FSIQ of 71
Diagnosis of MR requires a score of 2 SDs below mean (70)
Based on Forrests IQ of 71, is he MR?
rxx of FSIQ=.97 SD of FSIQ=15         r1,2 = .95 SD of FSIQ=15
SEM=2 points  SEM=3
68% CI: 68-74  68%CI: 68-74
95% CI: 65-76  95%CI: 65-77
99% CI: 62-79  99%CI: 62-80
Conclusion? What should we do with Forrest?
Term
Generalizability Theory:
An Alternative to CTT
Definition

 

Extends notion of measurement error beyond CTT
Offers way of assessing multiple sources of error (facets) concurrently
CTT lumps all sources of error into one estimate (cannot separate)
Term

Generalizability Theory:

Dependaility

Definition

 

Accuracy of generalizing from a persons observed score on a test or measure to the average score that person would have received under all conditions of measurement
Single score obtained on one occasion on a form of a test with a single administrator is not fully dependable (multiple sources of error)
Term
Multifaceted Measurement Error
Definition

 

Persons
Occasions
Items
Raters
Settings
Term
G-theory Studies use what?
Definition

 

G Theory studies investigated in ANOVA designs
ANOVA designs separate sources of variation in test scores
ANOVA designs can be:
Crossed
Nested
ANOVA designs can be:
Fixed effects

Random effects

Term

 

G Studies vs. D Studies
Definition

 

G studies collect information
D study use above information to make best decision
Term
What do G-studies do?
Definition

 

Anticipates multiple uses of measurement
Provides as much information as possible about sources of error
Incorporates this information into proper test interpretation
Term
What do D-Studies do?
Definition

 

Makes use of information in G Study to design best application
Specifies which facets to be considered
Specifies proper interpretation
Estimates dependability based on increasing conditions (facets) of measurement
Term
G & D Studies
Systematic Direct Observations
Definition

 

14 students from 5th grade classroom
DV: on-task/off task behavior
5 observers (4 hour training session)
SDOs collected twice a day for 10 consecutive days using momentary time sampling
IOA=90%
G Study
Persons (62%)
Time (1%)
Setting  (0%)
Person x Time (0%)
Person x Setting (13%)
Time x Setting (0%)
Person x Time x Setting (24%)
Term

Decision Studies

and # of observations

Definition

 

Decision Study 1
1 observation per day for 10 days
G=.46
Decision Study 2
1 observation per day for 3 days
G=.25
Decision Study 3
2 observations per day for 20 days
G=.72
Decision Study 4
4 observations per day for 20 days
G=.83

 

Term
D Studies
What Does It All Mean
Definition

 

Iwas 90% but Greliability=.62 (Interpretation?)
Adequate reliability only obtained if observations collected 4 times per day for 4 weeks (40 days)
2400 minutes or 40 hours of observation (What would JT say?)
IOA as proxy for accuracy of measurement
No inconvertible index with which to compare observed scores
SDOs should not be used in isolation (other methods needed)
SDOs certainly not a gold standard measurement method
Term
G Study
Behavior Rating Scales
Definition

 

G Study of BASC & TRF (Achenbach) Externalizing Behavior
6 teacher pairs Grades 1-5 rated 61 students
α=90-.97; r1.2 =.70-.90; rinterrater=.60-.76
Dependability Coefficients
Externalizing Composites=.68
Aggression=.59
Oppositional Defiant=.58
Conduct Problems=.47
Dependability coefficients weaker than bivariate correlations
Dependability coefficients all in moderate range
Considering multiple sources of error attenuates dependability
Should not rely solely on rating scales in assessment
Rating scales certainly not a gold standard method
 
 
Term
G & D Studies
Direct Behavior Rating
Definition

 

Academic engagement
SDO
DBR
Data collected over 10 consecutive school days
DBR
Teacher cued to start observation
End of period teachers rated student behavior
100 mm line divided into 11 equal gradients (never-sometimes-always)
SDO
Momentary time sampling
15-s interval
Design:
Raters (Methods) Observation Periods (Days) X Persons
[p x (r.m) x [o:d]
Term

G & D Studies
Direct Behavior Rating

Results

Definition

 

12 persons x 4 raters x 10 days x 3 rating periods
1440 total ratings
ϕ=.77
D Studies
1 observation/day SDO
1 day-.50
5 days-.83
10 days-.91
15 days-.93
20 days-.98
100 days-.99
1 observation /day DBR
.48-1 day
.82-5 days
.91-10 days
.93-15 days
.97—20 days
.99-100 days
Term
Implications for Practice from G-studies
Definition

 

More dependable estimates obtained via SDO
SDO records behavior every 15 s vs. every 15-m (DBR)
Sufficient reliability SDO after 3 sessions vs. 20 DBR ratings
Quick decisions best made using SDO
Findings inconsistent with Hintz & Matthews study
DBRs less intrusive & can measure low frequency behaviors
DBRs less time consuming & require less training effort
Supporting users have an ad free experience!