EPSY440 - Evaluation


Chapter Three Notes (Nitko, 2001)


validity - refers to the legitimacy of your interpretation and use of assessment results.

Three points to keep in mind concerning validity:

  1. Validity refers to your interpretation and is not inherent in the measure or procedure.
  2. Assessment results have different degrees of validity for different persons in different situations.
  3. Judgments about validity should be made only after several types of evidence have been gathered.

Four principles of validation help decide the degree to which assessment results are valid:

  1. The meaning (interpretation) you give to assessment results are valid only to the degree you can provide evidence.
  2. Uses of assessment results are valid only to the degree you can provide evidence.
  3. Interpretations and uses are valid only when the values implied by them are appropriate.
  4. Interpretations and uses are valid only when the consequences of these interpretations and uses are consistent with appropriate values.

Multiple interpretations can be made from one set of assessment results, and these interpretations are not mutually exclusive. For example, a reading test can be used to infer reading comprehension has improved, motivation to do well has improved, and skill in answering item types included on the test have improved.

There is a distinction between interpretation (meaning) and use of assessment results. Evidence should be provided separately for each intended use of assessment results. And in order to validate a particular use, you must use a valid interpretation of the results. In other words, invalid interpretations negate your intended uses for those interpretations.

Interpretations and uses arise from educational and social values. You choice of a particular test, or a particular format, and of the content, all imply certain values held by the test user.

Intended and unintended consequences result from interpretations and uses of assessment results. For example, using test results for tracking and then stigmatizing students is an inappropriate use and consequence. Tests users should strive for positive consequences and avoidance of negative consequences to maximize validity.

Judgment of validity depends on knowing the particular interpretation, use, value, and consequence of the assessment results.

 

Criteria for improving validity of scores from a classroom assessment:

Content Representativeness and Relevance

The assessment should:

  1. Emphasize what you taught (clearly identify learning targets and make sure you sample them).
  2. Represent the school's and state's stated curricular content.
  3. Represent current thinking about the subject.
  4. Contain content worth learning (assessments should assess combinations of skills and content).

Thinking Processes and Skills Represented

The assessment should:

  1. Require students to integrate and use several thinking skills (use taxonomies).
  2. Represent thinking processes and skills stated in the school's curriculum.
  3. Contain tasks that cannot be completed without using intended thinking skills.
  4. Allow enough time for students to use complex skills and processes.

Consistency With Other Classroom Assessments

The assessment should:

  1. Yield patterns of results consistent with your other assessments of the class.
  2. Contain individual tasks (items) not too easy or too difficult (results in all scores being alike).

Reliability and Objectivity

The assessment should:

  1. Use a systematic procedure for assigning quality ratings and marks for every student (reliability or consistency).
  2. Provide each student with several opportunities to demonstrate competence for each learning target assessed.

Fairness to Different Types of Students

The assessment should:

  1. Contain tasks that are interpreted appropriately by students with different backgrounds (cultural fairness).
  2. Accommodate students with disabilities or other learning difficulties if necessary.
  3. Be free of ethnic, racial, and gender bias (eliminate stereotypical language and content).

Economy, Efficiency, Practicality, Instructional Features

The assessment should:

  1. Require a reasonable amount of time for you to construct and administer.
  2. Represent appropriate use of the students' class time.
  3. Represent appropriate use of your class time.

Multiple Assessment Usage

The assessment should:

  1. Be used in conjunction with other assessment results for important decisions.

 

Validity of Extra-Classroom Assessments

extra-classroom assessments include district- and state-mandated assessments, standardized achievement and aptitude tests, attitude inventories, individually administered tests, and others.

The previous types of evidence discussed for teacher-made (classroom) assessments apply to all assessment methods, but purposes for extra-classroom assessments are usually different, as are the emphases and combinations of evidence used to judge validity.

Three things to keep in mind:

  1. The importance of each type of evidence changes as interpretations and uses of assessment results change.
  2. Providing evidence is the responsibility of both the publisher and user of the test.
  3. Not being able to afford conducting validity studies does not excuse one from validating assessment results (sometimes requires an admittance of not having the resources to validate your uses resulting in low validity of your interpretations).

Validity should be thought of as a unitary concept, although several types of evidence are used to make this judgment (do not think of the different types of evidence as different types of validity).

Eight types of evidence need to be considered before validating a particular interpretation and use of an extra-classroom assessment:

Content representativeness and relevance (content evidence):

  1. How well do the assessment tasks represent the domain of important content?
  2. How well do assessment tasks represent the curriculum as you define it?
  3. How well do assessment tasks reflect current thinking about what should be taught/assessed?
  4. Are the assessment tasks worthy of being learned?

Types of thinking skills and processes required (substantive evidence):

  1. How much do assessment tasks require students to use critical thinking skills and processes?
  2. How well do assessment tasks represent types of thinking skills identified as important by the curriculum?
  3. Are the thinking skills required by the assessment tasks those actually claimed to be used?

Relationships among assessment tasks or parts of the assessment (internal structural evidence):

  1. Do the assessment tasks work together so each task contributes positively toward assessing the quality of interest?
  2. If different parts of the assessment provide unique information, do results support the uniqueness?
  3. If different parts of the assessment provide the same or similar information, do the results support this?
  4. Are students' responses scored in a manner consistent with the constructs and theory on which the assessment is based?

Relationships of assessment results to the results of other variables (external structure evidence):

  1. Are results of this assessment consistent with results of other assessments for these students?
  2. How well does performance on the assessment reflect the quality/trait measured by other tests?
  3. How well does performance on the assessment predict current/future performance on other criteria?
  4. How well can assessment results be used to select students for jobs, schools, etc.? What is the measurement error?
  5. How well can assessment results be used to assign students to different types of instruction? Does this improve their learning?

Reliability over time, assessors, and content domain (reliability evidence):

  1. Will the same students obtain similar results if the assessment was administered on another occasion? What is the measurement error?
  2. Would students' outcomes be the same if scored by different persons? What is the measurement error?
  3. If an alternate assessment is used with similar content, would the results be similar? What is the measurement error?

Generalization of interpretations over different types of people, under different conditions, or with special instruction/intervention (generalization evidence):

  1. Does the assessment procedure give significantly different results when it is used with students from different socioeconomic, ethnic, ability backgrounds? If so, is this fair or biased?
  2. Would students' results be significantly altered if they were offered incentives, and if so, would this later the interpretations?
  3. Will special interventions, changes in instructions, or special coaching significantly alter the results students obtain on the assessment? If so, should this change how the results are interpreted?

Value of intended and/or unintended consequences (consequential evidence):

  1. What do we expect to happen to the students if we interpret the results in a certain way? To what degree do these expected consequences happen, and is this good?
  2. What side effects do we anticipate happening to the students if we interpret and use results in a particular way? To what degree are these anticipated side effects happening, and are they positive or negative?
  3. What unanticipated negative side effects happened to the students for whom we interpreted and used the assessment results in this particular way? Can these negative side effects be avoided by using other assessment techniques or altering out interpretations?

Cost, efficiency, practicality, instructional features (practicality evidence):

  1. Can the assessment technique accommodate typical numbers of students?
  2. Is the assessment procedure easy for teachers to use?
  3. Can the assessment procedure give quick results to guide instruction?
  4. Do teachers agree that the theoretical concepts behind the assessment procedure reflect the key understandings they are teaching?
  5. Do the assessment results meaningfully explain individual differences?
  6. Do the assessment results identify misunderstandings that need to be corrected?
  7. Would an alternative assessment procedure be more efficient?

content representativeness - the extent to which tasks are a representative sample from the content domain.

content relevance - whether the assessment tasks are included in the user's definition of the content domain.

table of specifications - a tool for defining the domain which contains major categories and skills that are assessed.

curricular relevance - a judgment of the degree of overlap between the assessment tasks and the school's curriculum learning targets.

A school's curricular framework is usually much larger than any single assessment instrument, which highlights the importance of multiple assessments.

Test developer's should provide the user with:

  1. a detailed description of processes and abilities they claim to be assessing,
  2. a clear demonstration of how each task assesses each of these processes and abilities, and
  3. evidence from research studies that demonstrate students use the skills and abilities being assessed.

internal structure - the interrelationships among the tasks, and the relationships between the tasks and the total results.

Evidence that tests developers provide is most often in the form of correlation coefficients (see below).

Evidence also comes from how well the assessment results correlate with other variables or criteria (e.g., SAT with college GPA).

external structure - a pattern of relationships between assessment results and external variables.

predictive validity evidence - the extent to which future performance on a criterion can be predicted from prior performance on an assessment instrument.

concurrent validity evidence - extent to which individuals' current status on a criterion can be estimated from current performance on an assessment instrument.

The length of time between obtaining results from two measures is usually related to the correlation between them (the longer the time interval, the lower the correlation).

correlation coefficient - a statistical index that quantifies the degree of relationships between scores on two different assessments (ranges from -1.00 to +1.00).

scattergram - an alternative way to study the relationship between scores on different assessments (or administrations of the same assessment) by plotting them on a graph.

Pearson product-moment correlation - the quantitative index used most often to represent the correlation coefficient, denoted by a small-case, italicized r.

positive correlation - when scores on one assessment go up or down in conjunction with scores on another assessment (both go up or both go down together).

negative correlation - when scores on one assessment go up or down in opposition to scores on another assessment (as one goes up, the other goes down, or vice-versa).

(Appendix H in the textbook illustrates how to compute the correlation coefficient).

Correlations between assessments are rarely, if ever, perfect, due to measurement, personal, and administration errors.

Correlation does not imply causation, only a relationship; it could be that one variable causes changes in the other, or vice-versa.

A correlation represents the relationship of scores on assessments from samples of people which estimate the relationship for the population of people being sampled from.

Factors that raise correlation coefficients include:

validity coefficient - correlations between scores from an assessment instrument and criterion scores to provide predictive or concurrent validity evidence (e.g., the correlation between IQ and high achievement on a test of higher-order thinking skills).

No single number (i.e., correlation coefficient) is sufficient to judge the validity of assessment results.

expectancy table - a grid (or two-way table) that displays how likely it is for a person with a specific assessment score to attain a particular score on the criterion (e.g., how scores on a measure of reading skills are distributed across assignment of grades for English classes).

the criterion problem - the difficulty in obtaining adequate criterion measures to use in validating assessment results.

Three types of criteria in education are: achievement tests scores; ratings, marks, or other teacher judgments; and career data.

Any single criterion measure to judge validity is incomplete.

Judgment of the worth of criteria measures in a validity study evaluate:

Ultimate criteria (e.g., real life performance in a particular field) usually do not occur for a very long time, so intermediate criteria (e.g., student teacher ratings by a cooperating teacher) are used.

Reliability (consistency of assessment results, discussed in Chapter 4) is related to validity in that higher reliability means higher validity of inferences.

Systematic errors in a criterion measure may lead to the wrong conclusion (e.g., if teachers favor one type of student over another and rate them according to this bias).

Practical considerations (e.g., availability of data) limit the degree to which criteria can be used in validity studies.

Factors found to affect a students' scores on a test of reading comprehension (and therefore affected generalizability, or the extent to which results of validity studies can be applied to other contexts) included:

content bias - tasks that include information on which students have a lot of prior knowledge.

passage independence - the extent to which items depend on the reading or comprehension of a passage.

speededness - if time limits are not generous, performance may depend on reading speed rather than comprehension.

attitude and motivation - highly motivated students are likely to perform better than less motivated students.

sophistication of students - how ""test-wise"" the students are.

test directions - if directions are misunderstood, the student's performance will deteriorate.

ability to mark answers - students differ in their speed and accuracy of measuring correct answers.

Not all validity evidence will be included in a test publisher's manual, so you may need to do research to find studies that contain additional information.

Much of the controversy today surrounding the use of teacher-made classroom assessments or external classroom assessments (standardized tests) revolves around the consequences for instruction, learning, and equity (e.g., high-stakes assessments which affect school funding, teacher employment, etc.).

Assessments can be quite valid, yet practicality may impede their use, such as the procedure being too complex for teachers to use, prior teacher training, cost, and/or availability of supporting materials (e.g., computer printouts of results).

Validity is based on several pieces of evidence, not on any one piece.

The argument-based approach to validation requires you to:

  1. State clearly what interpretations you intend to make.
  2. Present a logically coherent argument that the assessment results can be interpreted as you intend.
  3. Support your logical argument by citing evidence for and against your intended interpretation.

Back to course notes.

Back to course homepage.

This Webpage designed and updated (9/21/01) by Ron Dugan, University at Albany, State University of New York.