EPSY440 - Evaluation
Chapter
Three Notes (Nitko, 2001)
validity - refers to the legitimacy of your interpretation and use of assessment results.
Three points to keep in mind concerning validity:
Four principles of validation help decide the degree to which assessment results are valid:
Multiple interpretations can be made from one set of assessment results, and these interpretations are not mutually exclusive. For example, a reading test can be used to infer reading comprehension has improved, motivation to do well has improved, and skill in answering item types included on the test have improved.
There is a distinction between interpretation (meaning) and use of assessment results. Evidence should be provided separately for each intended use of assessment results. And in order to validate a particular use, you must use a valid interpretation of the results. In other words, invalid interpretations negate your intended uses for those interpretations.
Interpretations and uses arise from educational and social values. You choice of a particular test, or a particular format, and of the content, all imply certain values held by the test user.
Intended and unintended consequences result from interpretations and uses of assessment results. For example, using test results for tracking and then stigmatizing students is an inappropriate use and consequence. Tests users should strive for positive consequences and avoidance of negative consequences to maximize validity.
Judgment of validity depends on knowing the particular interpretation, use, value, and consequence of the assessment results.
Criteria for improving validity of scores from a classroom assessment:
Content Representativeness and Relevance
The assessment should:
- Emphasize what you taught (clearly identify learning targets and make sure you sample them).
- Represent the school's and state's stated curricular content.
- Represent current thinking about the subject.
- Contain content worth learning (assessments should assess combinations of skills and content).
Thinking Processes and Skills Represented
The assessment should:
- Require students to integrate and use several thinking skills (use taxonomies).
- Represent thinking processes and skills stated in the school's curriculum.
- Contain tasks that cannot be completed without using intended thinking skills.
- Allow enough time for students to use complex skills and processes.
Consistency With Other Classroom Assessments
The assessment should:
- Yield patterns of results consistent with your other assessments of the class.
- Contain individual tasks (items) not too easy or too difficult (results in all scores being alike).
Reliability and Objectivity
The assessment should:
- Use a systematic procedure for assigning quality ratings and marks for every student (reliability or consistency).
- Provide each student with several opportunities to demonstrate competence for each learning target assessed.
Fairness to Different Types of Students
The assessment should:
- Contain tasks that are interpreted appropriately by students with different backgrounds (cultural fairness).
- Accommodate students with disabilities or other learning difficulties if necessary.
- Be free of ethnic, racial, and gender bias (eliminate stereotypical language and content).
Economy, Efficiency, Practicality, Instructional Features
The assessment should:
- Require a reasonable amount of time for you to construct and administer.
- Represent appropriate use of the students' class time.
- Represent appropriate use of your class time.
Multiple Assessment Usage
The assessment should:
- Be used in conjunction with other assessment results for important decisions.
Validity of Extra-Classroom Assessments
extra-classroom assessments include district- and state-mandated assessments, standardized achievement and aptitude tests, attitude inventories, individually administered tests, and others.
The previous types of evidence discussed for teacher-made (classroom) assessments apply to all assessment methods, but purposes for extra-classroom assessments are usually different, as are the emphases and combinations of evidence used to judge validity.
Three things to keep in mind:
Validity should be thought of as a unitary concept, although several types of evidence are used to make this judgment (do not think of the different types of evidence as different types of validity).
Eight types of evidence need to be considered before validating a particular interpretation and use of an extra-classroom assessment:
Content representativeness and relevance (content evidence):
Types of thinking skills and processes required (substantive evidence):
Relationships among assessment tasks or parts of the assessment (internal structural evidence):
Relationships of assessment results to the results of other variables (external structure evidence):
Reliability over time, assessors, and content domain (reliability evidence):
Generalization of interpretations over different types of people, under different conditions, or with special instruction/intervention (generalization evidence):
Value of intended and/or unintended consequences (consequential evidence):
Cost, efficiency, practicality, instructional features (practicality evidence):
content representativeness - the extent to which tasks are a representative sample from the content domain.
content relevance - whether the assessment tasks are included in the user's definition of the content domain.
table of specifications - a tool for defining the domain which contains major categories and skills that are assessed.
curricular relevance - a judgment of the degree of overlap between the assessment tasks and the school's curriculum learning targets.
A school's curricular framework is usually much larger than any single assessment instrument, which highlights the importance of multiple assessments.
Test developer's should provide the user with:
internal structure - the interrelationships among the tasks, and the relationships between the tasks and the total results.
Evidence that tests developers provide is most often in the form of correlation coefficients (see below).
Evidence also comes from how well the assessment results correlate with other variables or criteria (e.g., SAT with college GPA).
external structure - a pattern of relationships between assessment results and external variables.
predictive validity evidence - the extent to which future performance on a criterion can be predicted from prior performance on an assessment instrument.
concurrent validity evidence - extent to which individuals' current status on a criterion can be estimated from current performance on an assessment instrument.
The length of time between obtaining results from two measures is usually related to the correlation between them (the longer the time interval, the lower the correlation).
correlation coefficient - a statistical index that quantifies the degree of relationships between scores on two different assessments (ranges from -1.00 to +1.00).
scattergram - an alternative way to study the relationship between scores on different assessments (or administrations of the same assessment) by plotting them on a graph.
Pearson product-moment correlation - the quantitative index used most often to represent the correlation coefficient, denoted by a small-case, italicized r.
positive correlation - when scores on one assessment go up or down in conjunction with scores on another assessment (both go up or both go down together).
negative correlation - when scores on one assessment go up or down in opposition to scores on another assessment (as one goes up, the other goes down, or vice-versa).
(Appendix H in the textbook illustrates how to compute the correlation coefficient).
Correlations between assessments are rarely, if ever, perfect, due to measurement, personal, and administration errors.
Correlation does not imply causation, only a relationship; it could be that one variable causes changes in the other, or vice-versa.
A correlation represents the relationship of scores on assessments from samples of people which estimate the relationship for the population of people being sampled from.
Factors that raise correlation coefficients include:
validity coefficient - correlations between scores from an assessment instrument and criterion scores to provide predictive or concurrent validity evidence (e.g., the correlation between IQ and high achievement on a test of higher-order thinking skills).
No single number (i.e., correlation coefficient) is sufficient to judge the validity of assessment results.
expectancy table - a grid (or two-way table) that displays how likely it is for a person with a specific assessment score to attain a particular score on the criterion (e.g., how scores on a measure of reading skills are distributed across assignment of grades for English classes).
the criterion problem - the difficulty in obtaining adequate criterion measures to use in validating assessment results.
Three types of criteria in education are: achievement tests scores; ratings, marks, or other teacher judgments; and career data.
Any single criterion measure to judge validity is incomplete.
Judgment of the worth of criteria measures in a validity study evaluate:
Ultimate criteria (e.g., real life performance in a particular field) usually do not occur for a very long time, so intermediate criteria (e.g., student teacher ratings by a cooperating teacher) are used.
Reliability (consistency of assessment results, discussed in Chapter 4) is related to validity in that higher reliability means higher validity of inferences.
Systematic errors in a criterion measure may lead to the wrong conclusion (e.g., if teachers favor one type of student over another and rate them according to this bias).
Practical considerations (e.g., availability of data) limit the degree to which criteria can be used in validity studies.
Factors found to affect a students' scores on a test of reading comprehension (and therefore affected generalizability, or the extent to which results of validity studies can be applied to other contexts) included:
content bias - tasks that include information on which students have a lot of prior knowledge.
passage independence - the extent to which items depend on the reading or comprehension of a passage.
speededness - if time limits are not generous, performance may depend on reading speed rather than comprehension.
attitude and motivation - highly motivated students are likely to perform better than less motivated students.
sophistication of students - how ""test-wise"" the students are.
test directions - if directions are misunderstood, the student's performance will deteriorate.
ability to mark answers - students differ in their speed and accuracy of measuring correct answers.
Not all validity evidence will be included in a test publisher's manual, so you may need to do research to find studies that contain additional information.
Much of the controversy today surrounding the use of teacher-made classroom assessments or external classroom assessments (standardized tests) revolves around the consequences for instruction, learning, and equity (e.g., high-stakes assessments which affect school funding, teacher employment, etc.).
Assessments can be quite valid, yet practicality may impede their use, such as the procedure being too complex for teachers to use, prior teacher training, cost, and/or availability of supporting materials (e.g., computer printouts of results).
Validity is based on several pieces of evidence, not on any one piece.
The argument-based approach to validation requires you to:
This Webpage designed and updated (9/21/01) by Ron Dugan, University at Albany, State University of New York.