EPSY440 - Evaluation


Chapter Nine Notes (Nitko, 2001)


Essays have a long (perhaps the longest) history among paper and pencil tests.

Teachers use essays to assess higher-order thinking skills such as explanation, communication, comparison, contrasting, analysis, synthesis, and evaluation, as well as to assess writing skills.

The two major types of essays include:

restricted response - the item restricts or limits what the student is required to answer.

extended response - items that allow students free reign on their expression of ideas and the relationship/organization of these ideas, where no single answer is correct.

Essay items should ask students for more than simple recall; they should ask students to apply their knowledge to new situations.

interpretive exercises (context-dependent) - the student is required to write the essay based on accompanying material.

Extended response essays allow students to express subject-matter knowledge and general writing ability.

The unique quality of essays is they allow students an opportunity to show their ability to write about, organize, express, and explain interrelationships among concepts and ideas.

Preparing for essays encourage the studying of broad concepts (vs. studying for objective tests which encourages the studying of facts).

There is conflicting evidence on the usefulness of essays for improving writing skills - some argue essays increase this skill and students write more, while others argue students may write more, but the writing isn't necessarily of any better quality.

A continually emphasized point throughout this course is that multiple assessments lead to more valid interpretations.

Essays limit the range of content that can be covered, but they allow more in-depth coverage of the learning target being assessed.

To compensate for the narrow range of targets covered by essays, multiple essays should be used over an extended period of time.

Factors that affect reliability of scoring essays include:

  1. inter-rater reliability - inconsistency of scoring between raters.
  2. inconsistent standards - may vary between readers and between scoring periods.
  3. rater drift - the failure to pay attention to criteria over time.
  4. changes in topic and prompt - different items will elicit different scoring from the same rater.
  5. halo effect - judgments of one characteristic of a student influence judgments of other characteristics.
  6. carryover effect - the judgment of one item is carried over to judgment of other items.

Scoring essays carelessly because of restrictions on time and large number of essays to grade is a violation of professional ethics and responsibilities.

A decision needs to be made early whether essays are the appropriate method for assessment to avoid careless scoring.

Using well defined criteria and scoring rubrics can assist in scoring essays appropriately and in a timely manner.

Checklist for judging the quality of essay items:

  1. Does the item assess an important aspect of the learning target?
  2. Does the item match the teacher's assessment plan (i.e., test blueprint or Table of Specifications)?
  3. Does the item require students to apply knowledge to a new or novel situation?
  4. In relation to other items, does the item add to content coverage?
  5. Is the item focused (specific directions)?
  6. Is the level of complexity consistent with the educational/maturational level of the students?
  7. Is more than just the recall of facts required to score well on the item?
  8. Is the item worded as such that all students will interpret it in the same manner?
  9. Does the wording of the item specify the length, purpose, amount of time to be devoted, and the basis of evaluation?
  10. If the item asks for opinions, is logic and evidence emphasized over correctness of the answer?

Again, revision is an absolute and necessary step in the construction of items.

Items #1 & #2 above are considerations in any item format. It is best to focus the type of response you want versus how well the learning target is stated.

Higher-level thinking is assessed best when knowledge is applied to new situations; otherwise you may be assessing the simple recall of factual material.

Your assessment plan should cover the wide range of content and thinking skills (i.e., the range of levels on Bloom's Taxonomy) that make up your assessment plan.

Because essays take time to both complete and score, they should be balanced with objective items that cover a range of skills and learning targets.

If essay items are not focused it will be impossible to distinguish those who know the material from those who do not.

It is good suggested practice to have colleagues or students to review your essay items to ensure similar interpretation of what is being asked.

As with other item formats, the wording and vocabulary should be carefully controlled to avoid confusion and allow for maximum readability.

You should use short-answer, completion, T-F, MC, or matching items if you are assessing simple recall or recognition, and essay items for higher-order thinking (e.g., application, synthesis, evaluation).

Make sure the framework of the item is clearly outlined so students know what they are to respond to, in what amount of time, at what level, and to what audience they are responding for.

Students need to know how they will be evaluated so they can focus their response accordingly.

Some key words and phrases that assist in constructing various types of essay items usually include:

Optional essay questions (a choice of which ones to respond to) have been found to lead to inequities in assessment because students cannot be compared as they are answering different questions.

If the only thing being evaluated is students' general writing ability, then the use of optional items would be appropriate.

Two general methods for scoring essays are analytic and holistic scoring rubrics.

analytic scoring rubric - an outline of the major elements students need to include in the response, along with the points assigned for each specific (assignment of partial credit should also be outlined).

holistic scoring rubric - a judgment is made of the overall quality of the response.

Holistic scoring rubrics are usually more appropriate for evaluating extended response essays which assess synthesis and creativity.

Holistic scoring rubrics can be set up by:

Holistic scoring rubrics help to score papers faster and view the paper as a working whole, but they don't point out details and can introduce your own bias and errors.

Analytic scoring rubrics can give more detailed information about student strengths and weaknesses by seeing which areas gave students the most problems, and some elements can be weighed more heavily than others. However, they are also slower, well-defined elements are more difficult to come up with, and the time to prepare them can be frustrating.

annotated holistic rubric - a combination (hybrid) of the analytic and holistic scoring rubric quality levels are defined and papers are scored holistically, then brief comments are made pointing out strengths and weaknesses of the response.

More scoring suggestions:

  1. use scoring rubrics to apply the same standards from paper to paper
  2. score one question at a time for all students, then go to the next one
  3. score subject-matter correctness separate from writing ability
  4. score essays anonymously to avoid halo effects
  5. give feedback to students as to why their essays were graded as they were (strengths and weaknesses)
  6. meet with students individually when possible
  7. when important decisions rest on essay scores, use more than one reader (independent scoring)
  8. evaluate the quality of your self-constructed scoring rubrics

Back to course notes.

Back to course homepage.

This Webpage designed and updated (10/29/01) by Ron Dugan, University at Albany, State University of New York.