PSY440 - Evaluation


Chapter Four Notes (Nitko, 2001)


reliability - refers to the consistency of results.

Reliability is the degree to which consistency occurs on two different occasions (stability), across two or more raters, and between alternative assessments of the same tasks given at the same (stability), or on different (stability and equivalence), occasions.

Reliability determines how much confidence you can place in assessment results. Interpretation of results are more valid when reliability is higher.

Although validity requires good reliability, reliability does not ensure validity because many types of validity evidence are required.

Inconsistency is also called measurement error, and reliability and measurement error are complementary ways of speaking about the same phenomenon.

Inconsistencies in assessment results are due to many causes (e.g., testing conditions, poor item construction, time of day, administration errors, etc.).

Two factors that influence the consistency of results are the content or particular tasks on the assessment, and the occasion on which the assessment is given. They can also work together to influence consistency.

reliability coefficient - quantifying consistency of assessment results by computing a correlation coefficient for scores form two different occasions or two forms of the assessment (ranges from .00 to 1.00).

standard error of measurement - estimating the amount that a student's score changes from from one administration to the next by expressing this variation numerically.

 

Reliability Coefficients

test-retest reliability coefficient or stability coefficient - the correlation between scores on two occasions.

Typical test-retest reliability coefficients for standardized tests are between .80 and .90.

In general, the longer the time interval between administrations, the lower the test-retest reliability.

Changes in performance over longer periods of time would probably reflect actual changes in the student's ability rather than fluctuations due to circumstances.

delayed alternate forms reliability coefficient - the correlation between two assessments that contain equivalent but different tasks/items and given on different occasions.

The two forms follow the same test blueprint or table of specifications, which eliminates the effects of students remembering specific items on the second administration (but it doesn't eliminate practice effects).

You cannot rely totally on statistics for assessing reliability; you also need to be knowledgeable about the domain of student performance being assessed, the theory and facts regarding factors that influence students' performance in that domain, and the intended interpretations and uses of the assessment results.

alternate forms reliability coefficient or equivalent-forms reliability coefficient - the correlation between two sets of scores from alternative forms of an assessment given on the same occasion.

parallel-forms reliability coefficient - the correlation between two forms of an assessment that are made up of tasks carefully matched to the same test blueprint.

Parallel forms should:

  1. have equal observed score means and standard deviations,
  2. measure students with equal accuracy (have equal standard errors of measurement),
  3. correlate equally with other measurements, and
  4. measure the same attribute in precisely the same way.

In practice, many assessment procedures have no parallel forms because the assessment will usually only be used once with each student, the very act of testing may change the student, there is usually only one way to assess the ability of interest, and it is too costly to build a parallel form.

Alternative methods of calculating reliability based on one administration of a test include the split-halves procedure, the Kuder-Richardson formula, and coefficient alpha. These methods are referred to as internal consistency reliability.

split-halves procedure - reliability from an assessment is taken by correlating one half of the assessment with the other half.

Spearman-Brown double length (prophecy) formula - an estimate of reliability that corrects for the split-halves procedure (where the correlation is based on only half of the assessment).

odd-even split-halves procedure - splitting an assessment by correlating the odd-numbered items with the even-numbered items.

Kuder-Richardson formula (KR20) - similar to the split-halves procedure, except it is used on assessments where items are scored dichotomously (1 or 0), and it uses data based on the proportion of students answering each item correctly.

coefficient (Cronbach's) alpha - similar to the KR20 & KR21, except that it is used with assessments where items are scored other than dichotomously (e.g., on a scale of 1 to 5).

These estimates of internal consistency are dependent on homogeneous items, that is, items that all measure the same trait or attribute. If items are not homogeneous, KR20 and coefficient alpha will be lower than the split-halves procedure (which is why they are referred to as lower bound estimates of reliability). KR20 and coefficient alpha are actually estimates based on all possible split halves.

KR20 and coefficient alpha are increased by increasing the length of the assessment, they should not be used with speeded assessments, and they do not consider the occasion of sampling.

Another source of error in measurement comes from inconsistency between scores of raters (or machines) on an assessment.

inter-rater (scorer) reliability - an index (correlation) of the extent to which scorers are consistent in marking the same students (based not on actual scores but the rank ordering of students).

percent agreement - agreement in the absolute sense where consistency is based on the percent of agreement between scorers in assigning similar scores to each student.

The choice between using the correlation or percent of agreement index depends on the particular interpretation and use of scores. If a pass-fail decision is needed, then absolute scores are necessary, and percent agreement is used for consistency of items. If only the rank-ordering of students is needed, then correlation would be the appropriate measure of consistency.

 

Standard Error of Measurement

obtained scores - the scores students actually receive on an assessment.

Obtained scores are actually the result of a person's true score plus error score.

true score - the hypothetical average (mean) of the observed scores a student would obtain if the assessment were administered to the student repeatedly under the same conditions.

error score - often referred to as the error of measurement, this is the inconsistency in a student's score based on either assessment errors, situational factors (student or environment characteristics), or administration errors.

standard error of measurement - the standard deviation, or spread, of scores around the mean (true score) of a student based on an estimated sampling distribution of repeated administrations of an assessment to the student.

The standard error of measurement is equal to the standard deviation of the obtained scores on a measurement times the square root of 1 minus the reliability of an assessment. In other words, the standard error of measurement depends on both the reliability coefficient and the standard deviation of the obtained scores.

The standard error of measurement can be considered a numerical estimate of the deviation of a student's observed score from his/her true score, or it can be used in conjunction with the normal curve, where 68% of the time a student takes a particular test, his/her score will fall within plus/minus one standard error of measurement (also referred to as the confidence interval or confidence band).

Other factors that affect the standard error of measurement and reliability:

  1. Longer assessments are more reliable than shorter assessments.
  2. The numerical value of a reliability coefficient will fluctuate from one sample of students to another.
  3. The narrower the range of a groups ability, the lower the reliability coefficient tends to be.
  4. Students at different achievement levels may be assessed with different degrees of accuracy.
  5. The longer the time interval between administrations, the lower the reliability.
  6. More objectively scored assessments are more reliable.
  7. Different methods of assessing reliability will give different results.

decision consistency - when the consistency of the exact score of a student is less important than the decision made based on that score ( reliability of mastery decisions). For example, would two different forms of an assessment classify the same students as masters or nonmasters of the material.

Decision error or error of misclassification - when a student's true status is not revealed by the assessment results.

Factors that affect decision errors:

  1. The assessment product may contain tasks that have weak validity for assessing the type of mastery you have in mind (e.g., using a standardized test that doesn't match your curriculum materials).
  2. Longer assessment procedures usually lead to more accurate mastery decisions.
  3. Low inter-rater reliability is associated with high rates of mastery decision errors.
  4. The passing score you set for a mastery decision affects the rate of decision errors (very high and very low cutoff scores have more decision errors).
  5. Students whose true mastery status is very close to the passing score you set are the ones who are most likely to be misclassified.

percent agreement index for mastery decisions - the percent of agreement between two parallel forms of an assessment administered to the same students to determine the degree of consistency in the mastery decision. This is equal to the percent consistent for mastery decisions plus percent consistent for nonmastery decisions.

Two points to keep in mind when deciding how "high" a reliability coefficient should be:

In general, standardized multiple-choice achievement tests have reliability coefficients between .85 and .95; open-ended paper-pencil assessments average between .65 and .80; portfolio assessments have typical reliabilities between .40 and .60. Moderate levels of reliability (.70) are acceptable for classroom assessments as long as decisions are being based on several pieces of information.

Ways to improve the reliability of assessment results:

  1. Lengthen the assessment procedure.
  2. Broaden the scope of the procedure.
  3. Improve objectivity.
  4. Use multiple scorers.
  5. Combine results from several assessments.
  6. Provide sufficient time to students.
  7. Teach students how to perform their best.
  8. Match assessment difficulty to the ability level of students.
  9. Differentiate among students.

Back to course notes.

Back to course homepage.

This Webpage designed and updated (9/24/01) by Ron Dugan, University at Albany, State University of New York.