Shared Flashcard Set

Details

Title

Psyc 7165

Description

Reliability lecture

Total Cards

Subject

Psychology

Level

Graduate

Created

10/04/2011

Click here to study/print these flashcards.

Create your own flash cards! Sign up here.

Additional Psychology Flashcards

Cards Return to Set Details

Term

Psychometric Inference

Definition

•Characteristics measured represent quantity of an attribute

•

Term

What type of Charateristics can be measured by Psycholgoical Inference

Definition

Characteristics may be overt (directly observable behavior) or covert (intelligence, self-esteem, working memory)

Term

What is the Major Problem in Measurement

Definition

Major problem in measurement is one often has no basis for assuming that a numerical score accurately reflects the quantify of interest

Term

What does assement often involve?

Definition

Assessment often involves measurement of constructs

Term

What is a ture score?

Definition

–Average value examinee obtains over infinite number of observed scores

–

Term

What do errors of measurement reflect in ture scores?

Definition

Errors of measurement reflect discrepancy between observed scores & the true score

–Standard error of measurement (SEM) is standard deviation of infinite number of observed scores

–

Term

What is the formula for estimating ture scores?

Definition

True score can be estimated:rxx (X – M) + M

•.90 (110-100) + 100 = 109

•.90 (90 – 100) + 100 = 91

•What explains the above?

–

Term

can in individual have more than one true score?

Definition

Yes

Term

Tell me about Biological or Absolute True Scores

Definition

–Absolute true score exists independent of measurement process used

–Errors of measurement still occur

–Individuals have only 1 absolute true score (ex. DNA,Blood pressure,Cholesterol level, Cancer,Pregnancy)

•Different lab tests may yield different results but one would never average tests to estimate an absolute true score

Term

Ways to measure response strength in operant behavior

Definition

–Frequency

–Duration

–Latency

–Interresponse time

Term

Sources of Error for operant behavior

Definition

–Less than perfect IOA

–Can we generalize reliability to all other observers?

–Is there a standard error of measurement

Term

Threats to Measurement of operant behavior

Definition

–Observer bias

–Observer drift

–Code complexity

Term

What does accuracy refer to in operant behavior measurement?

Definition

Degree to which measurement reflects true value

Term

Types of Test Scores

Definition

•Raw scores

•Composite scores

•Percentile ranks (1-99)

•Stanines (M = 5, SD = 2)

•Normal curve equivalents (M = 50, SD =21.06)

•Standard scores

–z scores (M=0, SD=1)

–T scores (M=50, SD=10)

–Scale scores (M=10, SD=3)

–DIQ scores (M=100, SD=15)

•Equivalence of standard scores

–z of 1=T of 60=DIQ of 115=SS of 13=NCE of 71.06

Term

Types of Norms

Definition

•Age norms

•Grade norms

•Gender norms

•Special group norms

•Percentile rank norms

•Standard score norms

•National norms

•State norms

•District norms

•School norms

Term

How long has Classical Test Theory (CTT) been around?

Definition

over a 100 years

Term

What are test scores made of in CTT

Definition

•Test scores made up of true score component & error score component

•X = True score + Error score

•True score is hypothetical (never actually known but can be estimated)

•True score is average score on infinite number of tests (model of parallel tests)

•Standard deviation of these tests is standard error of measurement

•

Term

What reflects reliablitity of the test in CTT

Definition

Degree of error in a test reflects the reliability of the test

•

Term

How is reliability defined CTT?

Definition

Reliability defined as true score variance relative to total variance in scores

•R= True score variance/Total variance (R = 75/100 = .75)

•Or R = 1– Error Variance/Total variance (R = 1- .25 = .75)

•Or True score variance = Total Variance X R (100 X .75 = .75)

Term

Fundamental Assumptions in CTT

Definition

•Individuals possess stable traits or characteristics (true scores) that persist through time

•Errors of measurement are completely random (due entirely to unsystematic variation in test scores)

•Fallible scores are the result of the addition of true & error scores Xobtained = Xtrue + Xerror

•Based on above logic, an individual’s true scores is exactly the same on all parallel tests

•Also (based on the above logic), an individual’s fallible score will vary from one parallel test to another (based on differences in reliability)

•Also (and confusingly), an individual can more than 1 true score (WISC-Stanford Binet-Woodcock Johnson)

Term

Major Sources of Error in Testing

Definition

•Errors associated with specific situation

•Errors associated with different occasions (situational-centered & person-centered sources of variation)

•Errors associated with different test contents (2 or more measures thought to be parallel may not be parallel)

•Errors associated with subjective scoring systems (ratings, essay examinations, oral examinations, etc.)

Term

What Makes Tests Parallel?

Definition

•If they have the same mean

•If they have the same standard deviation

•If they correlate the same with a set of true scores

•If all their variance that is not explainable by true scores is pure random error

Term

Methods for Estimating Reliability of Tests

Definition

•Correlations between scores on repetitions of same test (coefficient of stability)

•Correlations among scores on parallel forms of a test (coefficient of equivalence)

•Correlations between repetitions & parallel forms (coefficient of stability & equivalence)

•Correlations between comparable halves of a test (split-half reliability)

•Intercorrelations among all components of a test (inter-item correlation)

Term

Indices of Reliability and Error

Definition

•Reliability Index:

•Reliability Coefficient:

•Coefficient Alpha:

•Spearman-Brown Formula:

•Standard Error of Measurement:

•Standard Error of Estimate:

•Correction for Attenuation:

Term

Reliability Index:

Definition

• Correlation between true scores & fallible scores

Term

Reliability Coefficient:

Definition

Correlation between scores on parallel test forms or how well scores on 1 parallel test can predict scores on another parallel test

•

Term

Coefficient Alpha:

Definition

Reflects the average inter-item correlation or the reliability of an item sample in a content domain

•

Term

Spearman-Brown Formula:

Definition

Demonstrates the relationship between test length & measurement error

•

Term

Standard Error of Measurement:

Definition

Extent to which an individual’s scores vary over a series of parallel tests

•

Term

Standard Error of Estimate:

Definition

Degree of measurement error in prediction from 1 variable to another

•

Term

Correction for Attenuation:

Definition

Extent to which unreliability in test scores diminishes the correlation between 2 or more sets of test scores

Term

Differences between Standards

Definition

•Standard Deviation (dispersion in set of test scores)

•Standard Error of Measurement (error in a single test score)

•Standard Error of Estimate (error in prediction from 1 test score to another)

•Standard Error of the Mean (sampling error in average test score)

•Standard Error of Correlation (sampling error in correlation of test scores)

Term

The Importance of Spearman-Brown

Definition

•Major way of making tests more reliable is to make them longer

•Major source of error in tests is content error

•Longer tests have less content error than shorter tests

•CTT assumes subject error is minimized because of large samples

•Spearman Brown Prophesy Formula: rkk = kr11/1 + (k-1)r11

•What would be the reliability of a test that was increased by a factor of k? A 20 item test have a reliability of .70 and is increased to 60 items.

•rkk= 3(.7)/1 + (3-1)(.7) = .88

•Spearman Brown can also be used to estimate reliability by shortening a test using the same formula: 100 item test with r =.95 & I want to shorten it to a 50 item test

•rxx’ = ..50 (.95)/ 1 + .50 – 1)(.95)

•rxx’ = .475/.525 = .90

•You can also modify the formula to estimate the number of items required to obtain a desired level of reliability:

•K = rkk (1 – r11)r11(1-rkk)

•K = .80 (1-.5)/(.5)(1-.8)=.4/.1=4

Term

Importance of Standard Error of Measurement

Definition

•SEM reflects degree of measurement error around an obtain score

•SEM is standard deviation of infinite number of parallel tests

•SEM reflects degree of confidence one has in a test score

•SEM is a function of test reliability and variability in test performance

•SEM places confidence intervals around test scores

•As reliability of test decreases, SEM approaches the SD of a test

•+/1 SEM=68% CI; +/- 2 SEM=95% CI; +/-3 SEM=99% CI

•r= .93 & SD=15 X=76 SEM = 4 points

•68% CI 72 - 80

•95% CI 68 - 84

•99% CI 64 – 88

•You can also place a SEM around the estimated true score

•Xtrue = rxx (X – M) + M

•Xtrue= .93 (76-100) + 100 = 78

Term

Reliability Coefficient & SEM

Internal consistency reliability (α)

Definition

–Invariably higher than test-retest

–Produces smaller SEMs

–Based on average inter-item correlation

–Reflects precision of test score on a given day

–IQ score on Monday

Term

Reliability Coefficient & SEM

Test-retest reliability (r)

Definition

•–Invariably lower than α

–Produces larger SEMs

–Based on correlation over time (Time 1/Time 2)

–Reflects precision of test scores on any given day

–IQ score in September vs IQ score in January

Term

Standard Error of Measurement:
A Practical Example

Definition

•Forrest is administered a WAIS-IV and obtains a FSIQ of 71

•Diagnosis of MR requires a score of 2 SDs below mean (70)

•Based on Forrest’s IQ of 71, is he MR?

•rxx of FSIQ=.97 SD of FSIQ=15 r1,2 = .95 SD of FSIQ=15

•SEM=2 points SEM=3

–68% CI: 68-74 68%CI: 68-74

–95% CI: 65-76 95%CI: 65-77

–99% CI: 62-79 99%CI: 62-80

•Conclusion? What should we do with Forrest?

Term

Generalizability Theory:
An Alternative to CTT

Definition

•Extends notion of measurement error beyond CTT

•Offers way of assessing multiple sources of error (facets) concurrently

•CTT lumps all sources of error into one estimate (cannot separate)

•

Term

Generalizability Theory:

Dependaility

Definition

–Accuracy of generalizing from a person’s observed score on a test or measure to the average score that person would have received under all conditions of measurement

•Single score obtained on one occasion on a form of a test with a single administrator is not fully dependable (multiple sources of error)

•

Term

Multifaceted Measurement Error

Definition

–Persons

–Occasions

–Items

–Raters

–Settings

Term

G-theory Studies use what?

Definition

•G Theory studies investigated in ANOVA designs

•ANOVA designs separate sources of variation in test scores

•ANOVA designs can be:

–Crossed

–Nested

•ANOVA designs can be:

–Fixed effects

Random effects

Term

•G Studies vs. D Studies

Definition

–G studies collect information

–D study use above information to make best decision

Term

What do G-studies do?

Definition

–Anticipates multiple uses of measurement

–Provides as much information as possible about sources of error

–Incorporates this information into proper test interpretation

Term

What do D-Studies do?

Definition

–Makes use of information in G Study to design best application

–Specifies which facets to be considered

–Specifies proper interpretation

–Estimates dependability based on increasing conditions (facets) of measurement

Term

G & D Studies
Systematic Direct Observations

Definition

•14 students from 5th grade classroom

•DV: on-task/off task behavior

•5 observers (4 hour training session)

•SDOs collected twice a day for 10 consecutive days using momentary time sampling

•IOA=90%

•G Study

–Persons (62%)

–Time (1%)

–Setting (0%)

–Person x Time (0%)

–Person x Setting (13%)

–Time x Setting (0%)

–Person x Time x Setting (24%)

Term

Decision Studies

and # of observations

Definition

•Decision Study 1

–1 observation per day for 10 days

–G=.46

•Decision Study 2

–1 observation per day for 3 days

–G=.25

•Decision Study 3

–2 observations per day for 20 days

–G=.72

•Decision Study 4

–4 observations per day for 20 days

–G=.83

Term

D Studies
What Does It All Mean

Definition

•Iwas 90% but Greliability=.62 (Interpretation?)

•Adequate reliability only obtained if observations collected 4 times per day for 4 weeks (40 days)

•2400 minutes or 40 hours of observation (What would JT say?)

•IOA as proxy for accuracy of measurement

•No inconvertible index with which to compare observed scores

•SDOs should not be used in isolation (other methods needed)

•SDOs certainly not a “gold standard” measurement method

Term

G Study
Behavior Rating Scales

Definition

•G Study of BASC & TRF (Achenbach) Externalizing Behavior

•6 teacher pairs Grades 1-5 rated 61 students

•α=90-.97; r1.2 =.70-.90; rinterrater=.60-.76

•Dependability Coefficients

–Externalizing Composites=.68

–Aggression=.59

–Oppositional Defiant=.58

–Conduct Problems=.47

•Dependability coefficients weaker than bivariate correlations

•Dependability coefficients all in moderate range

•Considering multiple sources of error attenuates dependability

•Should not rely solely on rating scales in assessment

•Rating scales certainly not a “gold standard” method

Term

G & D Studies
Direct Behavior Rating

Definition

•Academic engagement

–SDO

–DBR

•Data collected over 10 consecutive school days

•DBR

–Teacher cued to start observation

–End of period teachers rated student behavior

–100 mm line divided into 11 equal gradients (never-sometimes-always)

•SDO

–Momentary time sampling

–15-s interval

•Design:

–Raters (Methods) Observation Periods (Days) X Persons

–[p x (r.m) x [o:d]

Term

G & D Studies
Direct Behavior Rating

Results

Definition

•12 persons x 4 raters x 10 days x 3 rating periods

•1440 total ratings

•ϕ=.77

•D Studies

–1 observation/day SDO

•1 day-.50

•5 days-.83

•10 days-.91

•15 days-.93

•20 days-.98

•100 days-.99

–1 observation /day DBR

•.48-1 day

•.82-5 days

•.91-10 days

•.93-15 days

•.97—20 days

•.99-100 days

Term

Implications for Practice from G-studies

Definition

•More dependable estimates obtained via SDO

•SDO records behavior every 15 s vs. every 15-m (DBR)

•Sufficient reliability SDO after 3 sessions vs. 20 DBR ratings

•Quick decisions best made using SDO

•Findings inconsistent with Hintz & Matthews study

•DBRs less intrusive & can measure low frequency behaviors

•DBRs less time consuming & require less training effort

Flashcard Machine - create, study and share online flash cards

Shared Flashcard Set

Details

Additional Psychology Flashcards

Cards Return to Set Details

My Flashcards

Flashcard Library

Browse

About

Help

Mobile