Term

Definition
•Characteristics measured represent quantity of an attribute
• 


Term
What type of Charateristics can be measured by Psycholgoical Inference 

Definition
Characteristics may be overt (directly observable behavior) or covert (intelligence, selfesteem, working memory)



Term
What is the Major Problem in Measurement 

Definition
Major problem in measurement is one often has no basis for assuming that a numerical score accurately reflects the quantify of interest



Term
What does assement often involve? 

Definition
Assessment often involves measurement of constructs



Term

Definition
–Average value examinee obtains over infinite number of observed scores
–



Term
What do errors of measurement reflect in ture scores? 

Definition
Errors of measurement reflect discrepancy between observed scores & the true score
–Standard error of measurement (SEM) is standard deviation of infinite number of observed scores
–



Term
What is the formula for estimating ture scores? 

Definition
True score can be estimated:rxx (X – M) + M
•.90 (110100) + 100 = 109
•.90 (90 – 100) + 100 = 91
•What explains the above?
–



Term
can in individual have more than one true score? 

Definition


Term
Tell me about Biological or Absolute True Scores


Definition
–Absolute true score exists independent of measurement process used
–Errors of measurement still occur
–Individuals have only 1 absolute true score (ex. DNA,Blood pressure,Cholesterol level, Cancer,Pregnancy)
•Different lab tests may yield different results but one would never average tests to estimate an absolute true score 


Term
Ways to measure response strength in operant behavior 

Definition
–Frequency
–Duration
–Latency
–Interresponse time 


Term
Sources of Error for operant behavior 

Definition
–Less than perfect IOA
–Can we generalize reliability to all other observers?
–Is there a standard error of measurement 


Term
Threats to Measurement of operant behavior 

Definition
–Observer bias
–Observer drift
–Code complexity 


Term
What does accuracy refer to in operant behavior measurement? 

Definition
Degree to which measurement reflects true value 


Term

Definition
•Raw scores
•Composite scores
•Percentile ranks (199)
•Stanines (M = 5, SD = 2)
•Normal curve equivalents (M = 50, SD =21.06)
•Standard scores
–z scores (M=0, SD=1)
–T scores (M=50, SD=10)
–Scale scores (M=10, SD=3)
–DIQ scores (M=100, SD=15)
•Equivalence of standard scores
–z of 1=T of 60=DIQ of 115=SS of 13=NCE of 71.06 


Term

Definition
•Age norms
•Grade norms
•Gender norms
•Special group norms
•Percentile rank norms
•Standard score norms
•National norms
•State norms
•District norms
•School norms 


Term
How long has Classical Test Theory (CTT) been around? 

Definition


Term
What are test scores made of in CTT 

Definition
•Test scores made up of true score component & error score component
•X = True score + Error score
•True score is hypothetical (never actually known but can be estimated)
•True score is average score on infinite number of tests (model of parallel tests)
•Standard deviation of these tests is standard error of measurement
• 


Term
What reflects reliablitity of the test in CTT 

Definition
Degree of error in a test reflects the reliability of the test
•



Term
How is reliability defined CTT? 

Definition
Reliability defined as true score variance relative to total variance in scores
•R= True score variance/Total variance (R = 75/100 = .75)
•Or R = 1– Error Variance/Total variance (R = 1 .25 = .75)
•Or True score variance = Total Variance X R (100 X .75 = .75) 


Term
Fundamental Assumptions in CTT 

Definition
•Individuals possess stable traits or characteristics (true scores) that persist through time
•Errors of measurement are completely random (due entirely to unsystematic variation in test scores)
•Fallible scores are the result of the addition of true & error scores Xobtained = Xtrue + Xerror
•Based on above logic, an individual’s true scores is exactly the same on all parallel tests
•Also (based on the above logic), an individual’s fallible score will vary from one parallel test to another (based on differences in reliability)
•Also (and confusingly), an individual can more than 1 true score (WISCStanford BinetWoodcock Johnson) 


Term
Major Sources of Error in Testing 

Definition
•Errors associated with specific situation
•Errors associated with different occasions (situationalcentered & personcentered sources of variation)
•Errors associated with different test contents (2 or more measures thought to be parallel may not be parallel)
•Errors associated with subjective scoring systems (ratings, essay examinations, oral examinations, etc.) 


Term
What Makes Tests Parallel? 

Definition
•If they have the same mean
•If they have the same standard deviation
•If they correlate the same with a set of true scores
•If all their variance that is not explainable by true scores is pure random error 


Term
Methods for Estimating Reliability of Tests 

Definition
•Correlations between scores on repetitions of same test (coefficient of stability)
•Correlations among scores on parallel forms of a test (coefficient of equivalence)
•Correlations between repetitions & parallel forms (coefficient of stability & equivalence)
•Correlations between comparable halves of a test (splithalf reliability)
•Intercorrelations among all components of a test (interitem correlation) 


Term
Indices of Reliability and Error 

Definition
•Reliability Index:
•Reliability Coefficient:
•Coefficient Alpha:
•SpearmanBrown Formula:
•Standard Error of Measurement:
•Standard Error of Estimate:
•Correction for Attenuation: 


Term

Definition
• Correlation between true scores & fallible scores



Term

Definition
Correlation between scores on parallel test forms or how well scores on 1 parallel test can predict scores on another parallel test
• 


Term

Definition
Reflects the average interitem correlation or the reliability of an item sample in a content domain
• 


Term

Definition
Demonstrates the relationship between test length & measurement error
• 


Term
Standard Error of Measurement: 

Definition
Extent to which an individual’s scores vary over a series of parallel tests
• 


Term
Standard Error of Estimate: 

Definition
Degree of measurement error in prediction from 1 variable to another
• 


Term
Correction for Attenuation: 

Definition
Extent to which unreliability in test scores diminishes the correlation between 2 or more sets of test scores 


Term
Differences between Standards 

Definition
•Standard Deviation (dispersion in set of test scores)
•Standard Error of Measurement (error in a single test score)
•Standard Error of Estimate (error in prediction from 1 test score to another)
•Standard Error of the Mean (sampling error in average test score)
•Standard Error of Correlation (sampling error in correlation of test scores) 


Term
The Importance of SpearmanBrown 

Definition
•Major way of making tests more reliable is to make them longer
•Major source of error in tests is content error
•Longer tests have less content error than shorter tests
•CTT assumes subject error is minimized because of large samples
•Spearman Brown Prophesy Formula: rkk = kr11/1 + (k1)r11
•What would be the reliability of a test that was increased by a factor of k? A 20 item test have a reliability of .70 and is increased to 60 items.
•rkk= 3(.7)/1 + (31)(.7) = .88
•Spearman Brown can also be used to estimate reliability by shortening a test using the same formula: 100 item test with r =.95 & I want to shorten it to a 50 item test
•rxx’ = ..50 (.95)/ 1 + .50 – 1)(.95)
•rxx’ = .475/.525 = .90
•You can also modify the formula to estimate the number of items required to obtain a desired level of reliability:
•K = rkk (1 – r11)r11(1rkk)
•K = .80 (1.5)/(.5)(1.8)=.4/.1=4 


Term
Importance of Standard Error of Measurement 

Definition
•SEM reflects degree of measurement error around an obtain score
•SEM is standard deviation of infinite number of parallel tests
•SEM reflects degree of confidence one has in a test score
•SEM is a function of test reliability and variability in test performance
•SEM places confidence intervals around test scores
•As reliability of test decreases, SEM approaches the SD of a test
•+/1 SEM=68% CI; +/ 2 SEM=95% CI; +/3 SEM=99% CI
•r= .93 & SD=15 X=76 SEM = 4 points
•68% CI 72  80
•95% CI 68  84
•99% CI 64 – 88
•You can also place a SEM around the estimated true score
•Xtrue = rxx (X – M) + M
•Xtrue= .93 (76100) + 100 = 78 


Term
Reliability Coefficient & SEM
Internal consistency reliability (α)


Definition
–Invariably higher than testretest
–Produces smaller SEMs
–Based on average interitem correlation
–Reflects precision of test score on a given day
–IQ score on Monday 


Term
Reliability Coefficient & SEM
Testretest reliability (r)


Definition
•–Invariably lower than α
–Produces larger SEMs
–Based on correlation over time (Time 1/Time 2)
–Reflects precision of test scores on any given day
–IQ score in September vs IQ score in January 


Term
Standard Error of Measurement: A Practical Example 

Definition
•Forrest is administered a WAISIV and obtains a FSIQ of 71
•Diagnosis of MR requires a score of 2 SDs below mean (70)
•Based on Forrest’s IQ of 71, is he MR?
•rxx of FSIQ=.97 SD of FSIQ=15 r1,2 = .95 SD of FSIQ=15
•SEM=2 points SEM=3
–68% CI: 6874 68%CI: 6874
–95% CI: 6576 95%CI: 6577
–99% CI: 6279 99%CI: 6280
•Conclusion? What should we do with Forrest? 


Term
Generalizability Theory: An Alternative to CTT 

Definition
•Extends notion of measurement error beyond CTT
•Offers way of assessing multiple sources of error (facets) concurrently
•CTT lumps all sources of error into one estimate (cannot separate)
• 


Term
Generalizability Theory:
Dependaility


Definition
–Accuracy of generalizing from a person’s observed score on a test or measure to the average score that person would have received under all conditions of measurement
•Single score obtained on one occasion on a form of a test with a single administrator is not fully dependable (multiple sources of error)
• 


Term
Multifaceted Measurement Error 

Definition
–Persons
–Occasions
–Items
–Raters
–Settings 


Term
Gtheory Studies use what? 

Definition
•G Theory studies investigated in ANOVA designs
•ANOVA designs separate sources of variation in test scores
•ANOVA designs can be:
–Crossed
–Nested
•ANOVA designs can be:
–Fixed effects
Random effects 


Term

Definition
–G studies collect information
–D study use above information to make best decision 


Term

Definition
–Anticipates multiple uses of measurement
–Provides as much information as possible about sources of error
–Incorporates this information into proper test interpretation 


Term

Definition
–Makes use of information in G Study to design best application
–Specifies which facets to be considered
–Specifies proper interpretation
–Estimates dependability based on increasing conditions (facets) of measurement 


Term
G & D Studies Systematic Direct Observations 

Definition
•14 students from 5th grade classroom
•DV: ontask/off task behavior
•5 observers (4 hour training session)
•SDOs collected twice a day for 10 consecutive days using momentary time sampling
•IOA=90%
•G Study
–Persons (62%)
–Time (1%)
–Setting (0%)
–Person x Time (0%)
–Person x Setting (13%)
–Time x Setting (0%)
–Person x Time x Setting (24%) 


Term
Decision Studies
and # of observations 

Definition
•Decision Study 1
–1 observation per day for 10 days
–G=.46
•Decision Study 2
–1 observation per day for 3 days
–G=.25
•Decision Study 3
–2 observations per day for 20 days
–G=.72
•Decision Study 4
–4 observations per day for 20 days
–G=.83



Term
D Studies What Does It All Mean 

Definition
•Iwas 90% but Greliability=.62 (Interpretation?)
•Adequate reliability only obtained if observations collected 4 times per day for 4 weeks (40 days)
•2400 minutes or 40 hours of observation (What would JT say?)
•IOA as proxy for accuracy of measurement
•No inconvertible index with which to compare observed scores
•SDOs should not be used in isolation (other methods needed)
•SDOs certainly not a “gold standard” measurement method 


Term
G Study Behavior Rating Scales 

Definition
•G Study of BASC & TRF (Achenbach) Externalizing Behavior
•6 teacher pairs Grades 15 rated 61 students
•α=90.97; r1.2 =.70.90; rinterrater=.60.76
•Dependability Coefficients
–Externalizing Composites=.68
–Aggression=.59
–Oppositional Defiant=.58
–Conduct Problems=.47
•Dependability coefficients weaker than bivariate correlations
•Dependability coefficients all in moderate range
•Considering multiple sources of error attenuates dependability
•Should not rely solely on rating scales in assessment
•Rating scales certainly not a “gold standard” method



Term
G & D Studies Direct Behavior Rating 

Definition
•Academic engagement
–SDO
–DBR
•Data collected over 10 consecutive school days
•DBR
–Teacher cued to start observation
–End of period teachers rated student behavior
–100 mm line divided into 11 equal gradients (neversometimesalways)
•SDO
–Momentary time sampling
–15s interval
•Design:
–Raters (Methods) Observation Periods (Days) X Persons
–[p x (r.m) x [o:d] 


Term
G & D Studies Direct Behavior Rating
Results 

Definition
•12 persons x 4 raters x 10 days x 3 rating periods
•1440 total ratings
•ϕ=.77
•D Studies
–1 observation/day SDO
•1 day.50
•5 days.83
•10 days.91
•15 days.93
•20 days.98
•100 days.99
–1 observation /day DBR
•.481 day
•.825 days
•.9110 days
•.9315 days
•.97—20 days
•.99100 days 


Term
Implications for Practice from Gstudies 

Definition
•More dependable estimates obtained via SDO
•SDO records behavior every 15 s vs. every 15m (DBR)
•Sufficient reliability SDO after 3 sessions vs. 20 DBR ratings
•Quick decisions best made using SDO
•Findings inconsistent with Hintz & Matthews study
•DBRs less intrusive & can measure low frequency behaviors
•DBRs less time consuming & require less training effort 

