EPSY440 - Evaluation


Chapter Fifteen Notes (Nitko, 2001)


 

Raw Scores – the number of points or marks you actually assign to a student’s performance on an assessment.

·        Raw scores tell little about the meaning of a score.

 

Referencing Framework – a structure used to compare a student’s performance to something external from the assessment.

Norm-referencing framework – a student’s performance is compared to the performance of a well-defined group of students who have also taken the same assessment.

Criterion-referencing framework – used to infer performance based on what a student can do in a domain (referenced to a criterion).

 

Norm Referencing

norm group – the well-defined group to which a student’s performance is compared.

·        You need to determine who is in the norm group that you are going to compare scores to.

·        Norm-referenced interpretations are less valid when the norm group is ill-defined.

 

Norm referencing is made easier by the use of derived scores:

1.      percentile ranks – tells the percentage of persons in a norm group that scored lower than the student’s raw score.

2.      linear standard scores – tell the location of a particular raw score in relation to the mean and standard deviation of a norm group.

3.      normalized standard scores – tell the location of a particular raw score in relation to a normal distribution fitted to a norm group.

4.      grade-equivalent scores – tell the grade placement for which a particular raw score is the average for a norm group.

 

Norm referencing is not enough to fully interpret a score; we also need to know what specific learning targets a student has achieved.

 

Criterion Referencing

Criterion referenced scores need a well-defined domain of performance in order to be considered valid.

·        A poor sampling of the domain by the assessment items will also lead to lower validity.

·        Both the number of assessment items and the representativeness of the domain contribute to the validity of criterion-referenced interpretations.

 

Criterion-referenced assessments do not have derived scores, but can be expressed by the following:

1.      percentage – a number telling the maximum percentage of points earned by a student.

2.      speed of performance – the amount of time a student takes to complete a task, or the number of tasks completed in a fixed amount of time.

3.      quality ratings – the quality level at which a student performs.

4.      precision of performance – the degree of accuracy with which a student completes a task.

 

Reasons for Assessing:

1.      To describe the performance a student has achieved within each subject area.

2.      To describe a student’s deficiencies which need improvement within each subject area.

3.      T describe which subject areas a student is strong/weak in across the curriculum.

4.      To describe the amount of educational development a student has made in each subject area over the years.

 

·        Numbers 1 & 2 require criterion-referencing, while numbers 3 & 4 require norm-referencing.

·        Standardized tests describe students’ relative strengths and weaknesses in different curricular areas because of the normative information they provide.

 

Norm groups provide the basis for defining educational development scales across different grade levels.

·        Students’ scores are referenced to the developmental scale one or two times a year.

·        Teachers are usually not required to construct growth scales or calculate scale scores, but they do need to know how to read and interpret them.

 

 

Types of Norm Groups

·        Test manuals report performance of large, representative samples of students (the norm groups).

·        A group’s current average does not tell how the group has achieved in relation to the state’s/district’s standards, but it does provide information on the range of performance you can expect from the students.

 

1.      Multiple norm groups – students usually belong to more than one norm group (e.g., females, Hispanic, 9th graders).

2.      Local norms – for most norm-referenced interpretations, the most appropriate group to compare a student is the local norm group (same grade, same school).

3.      National norms – most norm-referenced, standardized achievement and aptitude batteries have national norms, but each publisher uses a different definition of what constitutes a representative national sample, so norms from different publishers are not comparable.

·        Modal-age norms – include only those near the typical age for a certain grade level of students.

4.      Special norm groups – norm groups formed from special populations, such as the deaf, blind, students with MR, those in a certain course, or students in a certain region.

5.      School average norms – used when a comparison of a school’s average score is needed; a tabulation of the mean score from each school building in a national sample of schools which provide information on the relative ordering of the means (because individual scores vary too much to make accurate comparisons).

The most accurate comparisons are obtained when you use a norm group tested nearest to the time of year you are testing a student.

 

Published norms should satisfy three criteria:

1.      relevance – the norm group provided should be relevant to the groups you will be comparing them to.

2.      representativeness – the manual should clearly tell you the norm data was based on a carefully planned sample and should also provide you with information about subclassifications (sex, age, etc,).

3.      recency – the data should be recent as curriculum, schooling, social, and economic factors change.

 

Overview of Norm-Referenced Scores

 

Norm Tables – tables the test publisher provides for converting raw scores into different kinds of norm-referenced scores.

 

The most useful and easily interpreted norm-referenced score is the percentile rank.

·        Tells the percentage of students in a norm group who fall below the raw score in question.

·        Multiple sets of percentile ranks are often reported when there are large differences between groups (male, female, overall).

·        Make sure you use norm tables that correspond to the time of year you are testing.

·        Due to measurement error, do not interpret percentile ranks too precisely (percentile or confidence bands are often reported which are based on the standard error of measurement)

·        To calculate percentile ranks, divide the number of test takers falling at the score in question by one-half, add to this number the number of test takers who fall below that score, then divide this sum by the total number of test takers, then multiply by 100:

o       Example – 6 students scored an 88, 15 students scored below 88, and there were 24 students total: (6/2 + 15) / 24 x 100 = 75, so a score of 88 is at the 75th percentile rank.

 

·        Remember percentile ranks are specific to the group being referenced.

·        Publishers will often report local norms and national norms.

·        The main disadvantage of percentile ranks is that they are often confused with percentage correct on an assessment.

 

Linear Standard Scores tell how far the raw score is from the mean of the group in standard deviation (SD) units.

·        These are used to make two separate distributions more comparable.

·        z-Scores – the fundamental linear standard score which tells the number of standard deviation units (see appendix H) raw score is above or below the mean of a certain distribution:

o       z = raw score – mean / standard deviation

o       Example: for a raw score of 88, a mean of 84, and a standard deviation of 6: 88 – 84 / 6 = .33, so the person’s raw score is 1/3 of a standard deviation above the mean (because it is positive – if it were negative, that would signify the raw score fell below the mean).

o       z-scores between +/- 1 SD are considered typical or average achievement.

o       He main advantage of z-scores is they put scores from different scales on the same scale to make them comparable.

o       The main disadvantage of the z-score is it is difficult to explain to parents and students (but transforming them to SS-scores, discussed next, can resolve this problem).

·        SS-scores – a modification to z-scores that eliminates negative numbers and decimals by using a distribution with a mean of 50, a SD of 10, and rounding:

o       SS = 10z + 50

o       Example – using the z-score of .33 above: 10(.33) + 50 = 53.3, rounded to 53

o       As long as you know this score is based on a distribution with a mean of 50 and a SD of 10, you can interpret this score by mentally reverting to the z-score…that is, a score of 53 is 1/3 of a SD above the mean.

o       The main disadvantage of SS-scores is the person needs to understand the concepts of SD and linear transformation.

 

Normal Distributions transform scores to a common mathematical distribution also called the normal curve.

·        Every normal curve is smooth, continuous, and bell-shaped, but they can have different SDs and means which make them appear flatter and more spread out, or taller and less spread out (see Figure 17.4a and 17.4b)

·        Raw score distributions are more jagged and less symmetrical (think of the Unit 2 exam test distribution).

·        Normal curves are useful for interpretation purposes, but realize they are hypothetical, and not the actual distribution of scores.

·        Area under the curve – a normal distribution cut up into standard deviation units with percentages of scores falling under each section (see Figure 17.6).

o       68% of scores fall between +/- 1 SD

o       95% of scores fall between +/- 2 SDs

o       99.7% of scores fall between +/- 3 SDs

 

Normalized Standard Scores transform scores from a non-normal distribution of raw scores into a normal curve.

·        Normalized z-Scores – these occur when z-scores have percentile ranks corresponding to what we would see under the normal curve:

o       zn = the z-score corresponding to a given percentile rank in a normal distribution.

o       Example – the above example of a percentile rank of 75 for the non-normal distribution of test scores (raw) had a z-score of .33: referring to a table of area under the curve in a basic statistics textbook would show the normalized z-score to be .675

o       These are the z-scores that would have been obtained of the distribution were normal.

·        Normalized T-scores – tell the location of a raw score in a distribution with a mean of 50 and a SD of 10 (this is the corollary to the SS-scores). The difference between this score and the SS-score is the z-score in the formula is a normalized z-score:

o       T = 10zn + 50

o       Example – using the normalized z-score above: 10(.675) + 50 = 56.75 rounded to 57.

o       The advantage of the T-score over the normalized z-score is the same as for the SS-score over the z-score (more interpretable without negative numbers or decimals), but it also allows looking up percentile ranks under the area under the normal curve.

·        Deviation IQ Scores – this is used with certain intelligence assessments, is equivalent to a normalized standard score, and is used in a distribution with a mean of 100 and a SD of 15 or 16, depending on the assessment being used.

o       Advantage over the old way of calculating IQ scores (chronological age/mental age x 100) is the scores are calculated on chronological age regardless of grade placement (mental age).

o       DIQ = 16zn + 100

o       Example – using the normalized z-score above: DIQ = 16(.675) + 100 = 110.8 rounded to 111.

·        Stanine Scores – tell the location of a raw score in a specific section of a normal distribution (refer to Figure 17.7).

o       The normal distribution is divided into 9 segments, each of them one-half SD wide (except for Stanines 1 &  9 at the ends).

o       All persons falling within an interval are assigned the same Stanine for that interval.

o       Publishers recommend using Stanines for interpretation of achievement and aptitude scores.

o       The advantages are they are single digits, they have equal intervals, and they don’t require an exactness not warranted by the assessment.

o       Stanines have a mean of 5 and SD of 2.

o       Some argue the Stanine score is more problematic than percentile rank scores, because they are only single digits, and the scores are grouped in a coarse interval.

o       Stanines, like percentile ranks, are specific to the group from which they were calculated.

·        SAT Scores – from the Scholastic Assessment Test I, these are a normalized standard score based on a reference group of 1,052,000 students who graduated from high school in 1990 and who took the SAT in their junior or senior year.

o       The SAT score tells the location of a raw score in a distribution with a mean of 500 and a SD of 100.

o       SAT = 100zn + 500

o       Example – Using the normalized z-score calculated above: 100(.675) + 500 = 567.5, rounded to 568.

·        Normal Curve Equivalents – are normalized standard scores with a mean of 50 and a SD of 21.06. they are primarily used with federal evaluation programs to measure gains from educational programs that use different standardized tests.

o       NCE = 21.06zn + 50

o       Example – Using the same normalized z-score above: 21.06(.675) + 50 = 64.21, rounded to 64.

o       Refer to Table 17.7 to see the relationship between NCEs, Stanines, and percentile ranks. You will notice NCEs are directly related to Stanine scores in that if you move the decimal left one digit and round, you have the equivalent Stanine score (64 to 6.4 to 6).

 

Extended Normalize Standard Scores tell the location of a raw score on a scale of numbers that is anchored to a lower grade reference group.

·        Scale scores from assessment batteries need to be able to show growth over grade levels because they span several grades and ordinary scale scores do not reflect this.

o       Percentile ranks and T-scores could remain essentially the same without showing growth that did occur.

o       Extended normal standard scores are anchored on a “ruler” from the lower grade which is a continuum that extends beyond the lower grade’s distribution; low scores represent the lowest level of achievement, and high scores the highest.

o       The process of extended normalized standard scores is as follows:

§        Choose a base or anchor group and calculate normalized z-scores that extend beyond the range of scores for this anchor group (i.e., beyond 3 SDs).

§        Administer a series of assessments with overlapping items for adjoining groups (grades 1 & 2, 2 & 3, 3 & 4, etc.).

§        Tabulate and normalize the scores for each grade.

§        Place all groups on the extended z-score scale from the anchor group.

·        Item Response Theory (IRT) – uses a mathematical equation that is fitted to the publisher’s norm sample of students’ item responses.

o       A student’s score depends on his/her pattern of right and wrong answers.

o       IRT patterns consider the difficulty and discrimination of each item across a range of ability levels.

o       Expanded standard scores can have lower measurement error and greater reliability than traditional number-right scores.

o       Program evaluators and school researchers prefer expanded standard scores, but they are not readily interpretable by teachers, parents, and students.

o       Expanded standard scores show different SDs between subjects and progressively increasing SDs from grade to grade.

·        Grade Equivalent Scores tell the grade placement at which a raw score is average. They are educational development scores, which are usually used with achievement tests at the lower grade levels.

o       GE scores are reported as decimals (e.g., 6.7) where the whole number reflects the grade level (e.g., 6th grade) and the decimal represents the month of the school year (e.g., 7th month, or March in a typical school year.

o       The typical school year runs 9 months, so students are assumed to gain knowledge equivalent to 1 month over the summer (which makes for 10 equal numbers in the tenths position, but this is an erroneous belief about growth…usually knowledge decreases over the summer).

o       The match between what is taught and what is on the test, up until the testing point, is a major problem with GE scores.

o       GE scores also do not identify a student’s strengths and weaknesses across different subject matters.

o       Administrators, teachers and parents often misinterpret GE scores.

o       Test manuals usually provide conversion tables to aid interpretation. The development process is as follows:

§        The publisher creates a series of overlapping tests.

§        The tests are administered to large representative samples at each grade level.

§        Dates on which the tests are administered are called empirical norming dates.

§        Overlapping tests are then linked using expanded score scales (vertical linking or vertical equating).

§        The publisher then locates and graphs on a plot the median score in each grade’s norm group and the grade equivalent is assigned as 1.5, 2.5, 3.5, or whatever grade is in question.

§        The process of extrapolation (extending the line beyond the norm groups actually tested according to the trend of the medians) and interpolation (values that fall on the line that connect the medians of those groups actually tested) are then used to calculate GE scores for raw score which fall in between or beyond the raw scores which correspond to the median raw scores (refer to Figure 17.11).

o       It is a misconception to say students should have the same placement as their grade equivalent scores:

§        Example – 3rd graders who score at the 5th grade level did so taking items on a 3rd grade test, not a 5th grade test.

§        By definition, of the median used to calculate GE scores, half of the students will fall above the median.

o       To interpret mastery based on the fraction of a GE score is a misinterpretation.

§        The more closely the items on the test match what you taught, the higher above the norm group’s median your students’ scores will fall.

§        If teaching sequence and testing sequence are not aligned, inferring mastery is inappropriate.

o       Grade equivalent scores depend on the actual items placed on the test as well as the particular norm group used, which make results from two different publisher’s assessments noncomparable.

o       Another misinterpretation is to compare GEs from different subject areas to each other.

§        Scores from one subject area are usually more spread out than those in another subject area, resulting in different patterns of interpolation.

§        If all students in the same norm group took the same tests in all the subjects, then percentile ranks are the appropriate scores to use.

o       Normal Growth View – a student should exhibit a growth of 1.0 GE score per year.

§        Because GEs are based on the mean for each year, only students who score exactly at the mean each year will achieve a growth of exactly 1.0 (and maintain percentile rank from year to year).

o       GE scores also do not have a one-to-one correspondence with the number of questions a student answers correctly on a test.

o       Because it is inappropriate to average grade equivalent scores (based on how they are calculated), grade mean equivalents are used to tell the grade placement of a group’s average expanded scale score:

§        First average the expanded scale scores then look up the grad equivalent.

o       Because teachers and administrators have a need for some measure of educational growth, they should use grade equivalents only as coarse indicators, and they should report them with the corresponding percentile ranks.

 

Usually you will need to report normalized standard scores with their corresponding percentile ranks.

 

General guidelines for score interpretation:

1.      Look for unexpected patterns in the scores – assessments should confirm what the teachers already know for the most part.

2.      Seek an explanation for the patterns – check for motivation, special interests or difficulties when seeking an explanation for a change in scores.

3.      Don’t expect surprises for every student – most students’ scores should be as you expect.

4.      Small differences in subtest scores should be considered chance occurrences – use the standard error of measurement to determine actual differences.

5.      Use information from various assessments and observation to explain performance on other assessments.

 

You need to be prepared to answer parents’ questions, and therefore, you should be familiar with the responses and types of knowledge you should report to parents regarding their questions (see Table 17.13).

 

Always use a student’s classroom performance to complement and explain their standardized test results.

 

Common parental misunderstandings:

1.      Grade equivalent scores tell which grade a student should be in.

2.      Percentile rank and percent correct scores mean the same thing.

3.      The percentile rank norm group consists of only the students in a particular classroom.

4.      “Average” is the standard to beat.

5.      Small changes in percentile rank over time are meaningful.

6.      Percent correct scores below 70 are failing.

7.      If you get a perfect score, your percentile rank must be 99.

 

Back to course notes.

Back to course homepage.

 

This Webpage designed and updated (12/3/01) by Ron Dugan, University at Albany, State University of New York.