Shared Flashcard Set

Details

Tests and Measurment Exam 3
USF CLP 4433 Dr. Stark
55
Psychology
Undergraduate 4
11/16/2011

Additional Psychology Flashcards

 


 

Cards

Term
Base Rate (BR)
Definition
proportion of people in population who can successfully do the job
Term
Selection Ratio (SR)
Definition
proportion of persons hired/admitted (SR= #selected / #applicants)
Term
The accuracy of selection decisions based on test scores depends on 3 factors:
Definition
Based Rate
Selection Ratio
Test Validity
Term
There are four possible outcomes of every selection decision. These can be arranged in a table.
Definition
True Positive (TP)
False Positive (FP)
False Negative (FN)
True Negative (TN)
Term
When the base rate is high
Definition
When the base rate is highthere will be more TP and FN
Term
When the base rate is low
Definition
will be more TN and FP.
Term
Taylor-Russell Table
Definition
--The relationship of validity coefficients to the practical effectiveness of tests in selection.
--Taylor and Russell catalogued expected hit rates (TP) for different base rates, selection ratios, and validity coefficients.
--These tables are based on premise that an organization wants to maximize True Positive decisions; other outcomes are not considered.
--The tables assume bivariate normality, which can be violated if a test has floor or ceiling effects. Unless this violation is severe, the tables are reasonably accurate.
Term
Test reliability can be increased in two ways:
Definition
-- by adding items that correlate positively with others
--by removing items that are problematic (too wordy, tricky/confusing, too hard/easy)
Term
Multiple-choice items
1. Body of a multiple choice item
2. The choices that follow
3. Incorrect response options
Definition
1. Stem
2. Response Options
3. Distractors
Term
2 possibilities for scoring multiple
Definition
If response options can be ordered to reflect different
Term
choice items
Definition

degrees of correctness, a 3 might be assigned to the right answer; 2 for the next best, 1 for the next, and 0 for completely incorrect. This scheme awards points for partial knowledge. (polytomous scoring)

 

On the other hand, it is more common to simply assign 1 for choosing the correct response; and 0 for all other responses (dichotomous scoring). 

Term
According to Murphy, a "perfect test item" has two characteristics
Definition
--all people who know the answer will choose the correct reponse
--those who do not know the answer will choose randomly among the distractors, which implies (some respondents will guess correctly, each possible incorrect response will be equally popular)
Term
Distractors that are rarely chosen
Definition
decrease the difficulty of an item
Term
Test items may be scored DICHOTOMOUSLY
Definition
**two possible socres for each item
EX. math items have a right and wrong answer.. --assign 0 if wrong answer is chosen --assign 1 if right answer is chosen
Term
Survey and multiple choice items may be scored POLYTOMOUSLY
Definition
**3 or more possible scores per item
EX. attitude surveys do not have right and wrong answers (-2=SD -1=D and etc)
Term
Traditional methods of item analysis
Definition
judge the quality of items with respect to the intended sample of test takers. Three psychometric properties are important: --how difficult/easy is the item for the target group of examinees --how well does the item discriminate among persons having different levels of ability --good test items are of moderate difficulty and discriminate well among examinees.
Term
To examine the difficulty and discriminating power of items, we often consider 3 basic statistics:
Definition
** P-values
** Item-total correlations
** Inter-item correlations
Term
P-Values
Definition
*which represent the proportion of persons correctly answering or endorsing an item.
high p-value -> easy (too high >.9)
Term
Item-Total Correlations
Definition
correlation of responses to individual test items with the total test score
(correlations greater then .3 you would want to keep!)
Term
Inter-Item Correlations
Definition
(the correlation of items with each other)
--in general, a reliable test can be created by adding items that correlate positively with each other, even if the correlations are small (0.2)
--Inter-item correlations are LARGE when test content is homogeneous and small when heterogeneous
Term
Test reliability is influenced
Definition
by the variance of total test scores
Term
One way to increase test score variance is to
Definition
select items having p-valuues near 0.5
Term
Items considered bad or (not useful for the target sample) have the following general properties:
Definition
*p-values less than 0.1 or greater than 0.9
*negative or very low (<.1) inter-item correlations
*negative or low item-total correlations (<0.3)
Term
P-values less than 0.1 or greater than 0.9
Definition
items having p-values less than 0.1 or greater than 0.9, contribute little to test variance. They don't differentiate among examinees of high and low ability. So, they can be dropped without loss of measurement precision.
Term
Negative or very low (<.1)inter-item correlations
Definition
Negative or very low (<.1) inter-item correlations suggest that the test is measuring more than 1 construct.
Removing items having negative or very low inter-item correlations will increase internal consistency reliability (recall, coefficient alpha assumes homogeneity)
Term
Negative or low item-total correlations (<.3)
Definition
Items having negative correlations with the total test score must be dropped. People who did well on the test did poorly on the item, indicating a possible problem with content.
Items having low item-total correlations are
Term
Broad test
Definition
inter-item correlations will be small (hetero)
Term
Narrow Test
Definition
inter-item correlations larger (homo)

*if a test is broad more items needed to achieve acceptable levels (.7 or more)
Term
Item Response Theory (IRT)
Definition
*is a relatively new and powerful methodology for examining the properties of test items. *Items can be compared using parameters that reflect difficulty, discrimination and effects of guessing.
*The greek letter theta is used to represent an examinee's trait level (score) (ability skill, or standing on the construct measured by a test). Scores are standard normal, ranging from about -3 to +3.
*The quality of items is examined using item response functions (IRF's) which graphically illustrate the relationship between trait level and the probability of a correct response.
(plot functions they have S shaped )
Term
Computerized adaptive testing (applications of IRT methods)
Definition
*create test tailored to examinee ability
*administer only items tht provide high information about the examinee, thus, reducing the error in a person's score
*Adaptive tests require only about half as many items as nonadaptive tests to obtain a similar level of accuracy.
Term
Detecting biased items (applications of IRT methods)
Definition
*by comparing IRF's across groups, one can determine whether a test item exhibits psychometric bias (a.k.a differential item functioning, DIF)
*An item is said to be biased if its IRFs differ across groups of examinees (eg., men and women) after a process called "linking"
Term
Adaptive testing
Definition
tailored examinations (to group or individuals applications)
Term
Constructing the test
Definition
*Selecting item types
*Item Writing
*Item Content
*Item response alternatives (response format)
Term
Selecting item types
Definition
*constructed response (short answer/essay)demos of skill
*Low fidelity simulation (describe how something should be produced)
*High fidelity simulation (actually develop the product or do the task)
*
Term
Item Writing
Definition
*first step in test constructions is to generate a pool of items
*generally need 2-3 times as many items in poolas you desire in final version of test
*items will be selected based on both content and psychometric properties
Term
Guidelines for item writing
Definition
AVOID
*Long items
*Double negatives
*Double-barreled statements (mix different concepts) do not inculde asking two things
*Sexist, racist, offensive, language
*Slang that may go out of date quickly
*Using big, complicated or esoteric words (EX. the word HOT can have different meanings)
DO select appropriate reading level for target group (e.g., 5th grade)
Term
Item Content
Definition
*generally there are two approaches to scale development: rational and empirical
*often use a hybrid (mix) of these two approaches; call it rational-empirical method.
Term
Rational Scales (item content)
Definition
*create items based on a theory of behavior; some underlying thought, belief, or rationale used as based for selectiong items. Answers based on theoretical grounds.
*Advantage- can use theory to make predictions about behavior. good face validity
*Disadvantage- items tend to be transparent (ie. clear what they are measuring); so responses are subject to conscious (faking) or unconscious (self-deception) distortion.
Term
Empirical Scales
Definition
*generate broad range of items- not tied to any theory
*compute correlation between item responses and some criterion variable
*select and retain items that predict well (ie. have highest correlation with external criterion) and those that differentiate among members of different groups. For example; select items that best differentiate between schizophrenics and "normal" indviduals.
*Items are scored by empirical keying (aka criterion keying)
*Advantage- ---------.....---------
*Disadvantage- lower face validity
Term
Item Response Alternatives (response format) EXAMPLES
Definition
-the response format refers to the manner in which responses will be collected from the examinees. Ex: True-False, Multiple Choice, Free Responses, Auditory Response , Likert Type, Forced Choice
-MC populat because can be scored objectively difficult to write good distractors
-Free Response; get rich information, requires subjective judgemnet to score, must often ecxamine inter-rater agreement
Term
Response Sets
Definition
-Psychologist frequently use self-report measures; *some questions perceived as too invasive or personal *sometimes persons concerned about confidentiality so they consciously distort responses (fake good, fake bad, respond randomly)
-Test developers try to control these effects, which are called RESPONSE SETS; *use scales designed to detect unusual responses *use warnings that unusual response can be detected and that verifiable information will be examined for accuracy
Term
Examples of response sets
Definition
Social Desirability: IDEA persons tend to answer in ways that present themselves in best light (fake good) or worst light (fake bad), ratherthan answer honestly
-Intentionally distorting one's responses is known as FAKING or DISSIMULATION
-Faking is a big issue in non cognitive assessment (personality, biographical data, worker diaries and etc)
-Sharp disagreement about ramificationof faking
-Can you correct for faking after a measure has been administered (research suggest no)
-Can you prevent faking by strategic construction of items or tests (Maybe)
Term
Random Responding....
How can you try to detect it?
Definition
occurs when the examinees fail to attend to content of items because unmotivated, in a hurry or unwilling to cooperate
-Try to detect by: using scales containing mix of negatively and positively worded items and apply mathematical models
Term
Response Styles
Definition
*tendency to answer in a certain way; characteristic you bring to the test
-Acquiescence: tendency to agree with content of item without attending to content
-Criticalness: tendency to disagree with content of item without attending to content
-Dealing with response Styles: are response styles elicited by items that are ambiguous or confusing. Try to detect by using negatively and positively worded items and perhaps including statements that would be clearly false or true EX, it would be odd if person agreed with statement, "Ive never drank water".
Term
Negative Halo
Definition
once you do something bad, that's it.
Term
Positive Halo
Definition
if the rater likes you they will score high on everything
Term
Normative Tests
Definition
-allow for inter-individual (between person) comparisons
-compare each person's score to those of a normative group
-give indication of amount or level of trait exhibited
**can compare scores across people
Term
Ipsative Tests
Definition
allow only for intra-individual (within person) comparisons
-use a forced-choice format (paired comparison) where examinee must express a preference between two alternatives (think about the Carrots or Broccoli example)
-With forced choice like the carrots and broccoli example, you don't know how strong the liking of any vegetable is.
-Thus this test's scores cannot be used for inter-individual comparison as in job selection. This is the "challenge" for developing "fake-resistant" personality tests.
Term
Normative Scales vs Ipsative Scales
Definition
Normative
-can be used for inter-individual comparisons
-provide information about absolute standing on trait(s) assessed
Ipsative
-can only be used for intra-individual comparisons
-provide information about relative standing on traits assessed
Term
Norming psychological tests
Definition
must choose samples that represent the target population: *good comparison groups provide a "representative" sample (demographic characteristics) *typically have several norm groups for each test; local norms preferred
Term
Steps in developing norms
Definition
defining the target population; *decide on composition of normative group based on intended use of test EX, LSAT, MCAT, ACT
-Selecting the sample; *obtain samples that are a cross-section of he population; *regional samples (rural/urban) *gepgraphical
-Standardication; *administer test same way to all individuals *standardization decreases error variance by keeping conditions uniform across administrations *use anchor items to equate scores from different test forms
Term
Test publication and revision
(WRITING THE MANUAL)
Definition
*state purpose of test, directions for administration and scoring, and describes test development and validity evidence EX, Describe validation samples, reliability, converegent and discriminant validity with other measures *Manual must be revised with each new form or amendment
Term
Test publication and revision
(REVISING THE TEST)
Definition
should be revised when: *language is outdated *security is compromised *cotent disclosed *changes to content, format, medium of administration or scoring
Term
Technical Manual
Definition
every 5 years or so you have to go back and update your manual because it’s out of date
*some test are not as urgent to change/update as others. EX, personality test is pretty stable
Supporting users have an ad free experience!