Term
|
Definition
| basic unit/individual that we are describing in the study |
|
|
Term
|
Definition
| data we are recording for each observational unit |
|
|
Term
|
Definition
| categorical (not-numbers) |
|
|
Term
|
Definition
|
|
Term
|
Definition
| the investigator simply records what is/has happened |
|
|
Term
|
Definition
| the investigator imposes a treatment on the observational units |
|
|
Term
|
Definition
| The observational units on which we have data (***if we gave someone a questionnaire but they didn’t return it, they don’t count!***) |
|
|
Term
|
Definition
| All observational units who had a chance of being selected in the sample |
|
|
Term
|
Definition
The group of observational units we are ultimately trying to describe.
The population will depend on the question being asked
Sometimes the sample/sampling frame/population can be the same group. That is called a census. |
|
|
Term
|
Definition
truth about the population
*** Will almost always be in % unless explicitly asks for a number***
We usually do not know the true number, but we can describe it in words (% of UW students who live on campus) |
|
|
Term
|
Definition
describes the sample
May not be given in % format, but should be converted to match format of the parameter
We will usually be able to calculate this from the data given |
|
|
Term
|
Definition
Describes the population Fixed, will not change True value may be unknown
Describes the Sample Will vary when different samples are taken Will be able to compute from information given |
|
|
Term
|
Definition
| includes SRS. Any type of design in which randomization is used to pick the observational units |
|
|
Term
|
Definition
the investigator selects which observational units will be in the sample
**almost always biased** |
|
|
Term
|
Definition
the observational units choose whether they want to be in the sample or not
**almost always biased** |
|
|
Term
|
Definition
Think of Variability as how spread out are my estimates
Think of bias as how far away are my estimates from the truth
Not one against the other, both can be high, both can be low, or can be opposite |
|
|
Term
| Sources of Variability (4) |
|
Definition
Random Sampling Error (sampling variability). ***This is the only variability accounted for by the Margin of Error*** Any additional bias or variability caused by poor survey design will add extra variability
Shortcut method for Confidence Interval = p.hat +/- 1/sqrt(n); where n = sample size
Confidence statement- We are 95% confident that the true parameter lies between (confidence interval) ***95% of the time that I follow this same procedure and construct a confidence interval, it will cover the true parameter***
When the sample size increases we can be more sure about our estimate, so we do not need as large of a margin of error |
|
|
Term
| Sources of Bias (2; 1 with 4 possible) |
|
Definition
Undercoverage- when the sampling frame does not accurately reflect the population (ex. random digit dials won’t include people without phones)
Non-sampling errors- o Response error- people don’t answer truthfully (ex- how many times have you cheated on a test?)
o Non-response- when people don’t respond because they can’t be contact or don’t cooperate
o Processing errors- typos when recording data
o Question wording- confusing questions or questions which can cause a certain response to be more likely (leading questions) |
|
|
Term
|
Definition
| a variable that may cause a change in the response variable; the cause, usually the X variable |
|
|
Term
|
Definition
| measures the outcome of an experiment; the effect, usually the Y variable |
|
|
Term
|
Definition
| specific condition that is applied in an experiment; often the explanatory variable or mix of explanatory variables |
|
|
Term
|
Definition
| variable that may have effect on response variable that is not measured |
|
|
Term
|
Definition
| When two variables have effects on the response variable that cannot be distinguished from each other |
|
|
Term
| Statistically significant |
|
Definition
| ***The result we found would rarely occur simply by chance*** |
|
|
Term
|
Definition
| The benefit derived from the psychological effect of receiving a treatment |
|
|
Term
|
Definition
| Both the clinicians and subjects are “blind” to whether they are in the control or treatment group |
|
|
Term
|
Definition
| using impersonal chance to assign subjects to either the treatment or control group |
|
|
Term
|
Definition
| Is the distribution skewed or symmetric? Is there one mode or multiple modes? |
|
|
Term
|
Definition
| where do most of the observations lie? What is the highest/lowest values? |
|
|
Term
|
Definition
| What is the center point of the distribution? (mean, median or mode) |
|
|
Term
| Numerical Descriptions: Mean |
|
Definition
| Add up all observations and then divide the total by the number of observations. Highly affected by outliers, changes when you add to/multiply to the data |
|
|
Term
| Numerical Descriptions: Median |
|
Definition
| Midpoint of the distribution. Sort all your observations, and choose the middle observation, or average the middle two if there are an even number of observations. Not affected by outliers as much, changes when you add to/multiply to data |
|
|
Term
| Numerical Descriptions: Mode |
|
Definition
|
|
Term
| Numerical Descriptions: Percentiles |
|
Definition
The cth percentile of a distribution is defined so that (at least) c% of the observations are at or below it and (at least) (100-c)% of the observations are at or above it |
|
|
Term
| Numerical Descriptions: Five Number summary |
|
Definition
| Min, 25%, Median, 75%, Max |
|
|
Term
| Numerical Descriptions: Standard Deviation |
|
Definition
a measure of how spread out the data are. 68% of all observations lie within +/- 1 sd of the mean, 95% within 2 sd, 99.7% within 3 sd, changes when add to/multiply to data
o First find xbar (mean) o Then add up (x – xbar)squared for each observation o Divide that total by n-1 o Take the square root of that ratio |
|
|
Term
|
Definition
|
|
Term
|
Definition
• At least 25% of observations are ≤ 1st Quartile, and at least 75% of observations are ≥ 1st quartile
• At least 75% of observations are ≤ 3rd Quartile, and at least 25% of observations are ≥ 3rd quartile
• Interquartile range = 3rd quartile – 1st quartile
• Changes when you add to or multiply the data |
|
|
Term
|
Definition
| plots two variables on same graph. Each point is one individual observation |
|
|
Term
|
Definition
measures “strength of relationship between two variables
• Always between -1 and 1 • Positive correlation means positive association (as one increases, so does the other). Negative value means negative association (as one increases, the other decreases) • ***Correlation does not imply causation!!*** • Must be linear (or football shaped) to be valid measurement of association. No outliers |
|
|
Term
|
Definition
| correlations based on averages or rates. Usually overstates the correlation |
|
|
Term
|
Definition
|
|
Term
|
Definition
|
|
Term
|
Definition
| Regression sd is the “average size of error” (√1 − r.squared)(Sy) ***only use this when you are making a prediction involving prior information*** (think about on the quiz. When we picked a random student and guessed their quiz 2 score, we used the quiz two average and sd. But when we knew their quiz one score, then we used the regression sd) |
|
|
Term
|
Definition
|
|
Term
|
Definition
| observations that are extreme in the X-direction are not as extreme in the Y-direction |
|
|
Term
|
Definition
P(A) must be between 0 and 1 Total probability must add up to 1 P(A not happening) = 1- P(A) |
|
|
Term
|
Definition
Symmetric, bell shaped
Only need to know mean and standard deviation to define the whole curve
68% of all observations lie within +/- 1 sd of the mean, 95% within 2 sd, 99.7% within 3 sd
The standard score is the number of standard deviations an observation is away from the mean std. score = (obs - mean)/ SD
Once we have the standard score, we can look up P(X < standard score) in Table B of the book |
|
|
Term
|
Definition
| As we take larger and larger samples, the sum or average (not product or ratio) will begin to look like an normal curve |
|
|
Term
| Central Limit Theorem pt. 2 |
|
Definition
| If we take a sample proportion many times, the distribution will be a normal distribution with mean = p, and standard deviation √ [ p (1-p) / n ] |
|
|
Term
| Central Limit Theorem pt. 3 |
|
Definition
| We would expect 95% of all p-hats to be within 2 sd of the mean, or p ± 2√ [ p (1-p) / n ] |
|
|
Term
| Central Limit Theorem pt. 4 |
|
Definition
| When we don’t know mean, but have an estimate for p, |
|
|
Term
|
Definition
The basic idea, is that we will reject the null hypothesis if our observation is very unlikely to happen if the null hypothesis is true
Null Hypothesis- the status quo, or the no change option Alternative Hypothesis- usually what we are trying to prove |
|
|
Term
| Calculating test of significance |
|
Definition
Assume the null hypothesis is true, and calculate how likely our sample 1. Determine the mean and standard deviation of our “Null distribution” (the distribution when the null is true) --Find mean = p and sd = √ [ p (1-p) / n ] --Find the standard score of p.hat = (p.hat - p) /√ [ p (1-p) / n ] Look up the value in the table. ***you may need to subtract the value from 1 depending on whether you want the area to the left or to the right of the standard score***
This is the p-value- the probability that something as extreme or more extreme than our current observation would occur when the null is true
If the p-value is less than .05, reject the null hypothesis |
|
|