Assessment Handbook: A guide for developing assessment programs in Illinois schools

1995 Edition

ILLINOIS STATE BOARD OF EDUCATION

School and Student Assessment Section
100 North First Street
Springfield, Illinois 62777-0001

Louis Mervis, Chairperson
Joseph Spagnolo, State Superintendent

CHAPTER 1

LOCAL ASSESSMENT SYSTEMS
Overview of Assessment
Developing Local Assessment Systems

CHAPTER 2

THE QUALITY OF ASSESSMENT
Ensuring the Quality of Assessment
Other Important Topics

CHAPTER 3

SELECTION AND DEVELOPMENT OF ASSESSMENT PROCEDURES
Selecting Assessment Procedures
Developing Performance-Based Assessment Procedures
- Developing tasks
- Developing scoring scales

CHAPTER 4

INTERPRETATION, USE, AND REPORTING OF ASSESSMENT RESULTS
Reporting Results
- Audiences Information Format

REFERENCES

Chapter 1

Local Assessment Systems

This handbook is intended to help schools and school districts develop and implement assessment programs that produce useful information that is of high quality. Assessment is a critical element of the Illinois school improvement process.

Overview of Assessment

What is assessment?

Assessment is the major focus here, although testing will also be discussed. The terms test and assessment are frequently used interchangeably.

Technically, however, there are important differences between them. A third term, evaluation, is sometimes used interchangeably also but is actually distinct. To avoid confusion, the three terms are defined below.

Test, the narrowest of the terms, usually refers to a specific set of questions or tasks that is administered to an individual or to all members of a group and measures a sample of behavior. It is highly structured and can be administered and scored consistently within and across groups of students, thus making it highly reliable. It requires a relatively limited period of time to administer.

Assessment is more encompassing and includes the collection of information from multiple sources. A test is one kind of assessment. Assessment may also include rating scales, observation of student performance, portfolios, individual interviews, and other procedures. Assessment may refer to groups or individuals. Group assessment may involve administering different performance tasks or subsets of items to different samples of students and reporting the results for groups but not for individuals. In addition, assessment often refers to a planned program or system.

Evaluation refers to making a value judgment about the implications of assessment information. This process is necessary for school improvement planning. While assessment involves obtaining achievement data through a variety of means, evaluation goes a step further - interpreting the data from an informed perspective. That perspective should also be informed by knowledge such as that about instructional content, community context, school climate, and dropout rate. Although this Handbook includes some material about interpreting assessment results, for the most part it does not address evaluation.

In summary, testing provides one isolated glimpse -- analogous to taking a picture with a camera -- of student achievement (individual or group) in specific skills or knowledge at a specific time. Assessment provides more comprehensive data from multiple measures administered over a period of time or, preferably, a variety of data-gathering approaches. Evaluation produces value judgments about the results of assessment.

Testing, assessment, and evaluation are strongly interdependent; the quality of one affects the quality of the others. Good tests strengthen assessment; well-planned assessment increases the probability of valid and accurate evaluation.

What is a comprehensive assessment system?

A comprehensive assessment system is a coordinated plan for periodically monitoring the progress of students at multiple grade levels in a variety of subjects. It specifies the procedures that will be used for assessment; indicates when and how those procedures will be administered; and describes plans for processing, interpreting, and using the resulting information. It takes information collected at various levels--classroom, school, district, and state--into consideration. A comprehensive assessment system includes:

a schedule for assessing students throughout the school year and at multiple grades,
a variety of assessment procedures that are used appropriately,
technical and ethical standards that protect the quality of student assessment, the validity of interpretations and uses of resulting information, the reliability of performance-based and other assessment procedures, and the fairness of the assessment,
provisions for collecting other relevant information (e.g., analyses of the instructional process, judgments about local conditions or needs, and contextual or system variables such as mobility and dropout rates) that can be used to supplement or help interpret achievement data during decision making, and
plans for processing and using the assessment results in ways that help schools and districts meet information needs, are efficient, and minimize the paperwork burden on staff.

Assessment types

There are numerous ways to categorize the many different types of assessment. The previous version of this Handbook used the following classification scheme:

Publisher's Standardized Shelf Tests
Publisher's Customized Tests
Publisher's Textbook Texts
District's Locally Developed Tests
Uniform Procedures Requiring Written
Performance
Uniform Procedures Requiring Other (Nonwritten) Performance

This version classifies assessment into two types and discusses three sources of assessment.

The two types of assessment are:

Forced-choice
Performance-based (also known as "complex generated response")

The three sources of assessment are:

Commercial publishers (including
textbook publishers)
Local (developed by local staff)
Other (adopted or adapted from
elsewhere)

The major types and sources of assessment will be discussed below. The major advantages and limitations of each will be discussed.

Forced-choice assessment

Forced-choice assessment (referred to in some publications as "selected-response") requires students to select the correct response from two or more alternatives (e.g., multiple-choice, true-false, or matching test items) or supply a word, a phrase or several sentences to answer a question or complete a statement. This approach is sometimes described as "traditional," "standardized," or "objective." (However, these descriptions apply to all assessment procedures- -performance-based and forced-choice- -which will be aggregated or used for comparison. Both types of assessment are traditional in the sense that they have long histories. Both should be standardized in the sense that the same procedures are used with all students and are administered and scored uniformly. Both should be objective in the sense that they are analyzed systematically and results can be verified.) Most forced-choice assessments are paper-and-pencil tests. The "correct" responses are seldom debatable. Students' scores are likely to be the same regardless of who scores the test.

The following types of assessment are considered forced-choice:

Multiple choice
Matching
True - False
Multiple correct
Fill-in-the-blank
Completion

An advantage of forced-choice assessments is their efficiency in collecting information about student achievement. Questions are usually fairly brief. To respond, students fill in a bubble, check or circle the correct answer, or write a brief response. Thus, students can answer a large number of questions that cover a rather broad content area in a relatively short period of time. Responses can often be machine scored.

A limitation of forced-choice assessment is that it cannot be used to validly assess many skills which require students to actually perform (conduct a scientific experiment, run a 100-yard dash) or to make a product (write an essay, create a painting). Another limitation is that students can sometimes guess the correct response without actually having the knowledge assessed. Still another limitation is the difficulty of writing good questions or items which assess higher order thinking skills.

Perhaps the most effective use of forced-choice assessment is to assess student knowledge, particularly if there is a need to cover a broad content area. Students can answer many forced-choice items in a relatively brief period of time.

Performance-based assessment

Performance-based assessment, also known as complex generated response, extended response, alternative, authentic, or constructed-response assessment, requires students to construct their own responses to questions or prompts--to actually perform or to develop a product. For example, a student might deliver a speech or play a musical instrument; the performances might be live or recorded. Or a student might assemble a portfolio containing descriptions of several cultures and representations of their literature and artwork, write an essay, show a mathematics proof, or make a drawing; such assessments would use paper and pencil. Rather than being treated as correct or incorrect, some responses may be considered qualitatively different from others.

Types of activities that qualify as performance-based assessments (if they are administered and scored uniformly) include:

Writing an essay
Preparing a report or research paper
Reviewing/analyzing/critiquing a performance or product
Solving a multiple-step mathematics problem
Conducting an investigation or experiment
Preparing and presenting an exhibition or demonstration
Developing a visual or graphic representation
Singing or playing a musical instrument
Composing, choreographing, or playwriting
Performing a physical task such as running
Designing a project

An advantage of this type of assessment is its effectiveness in validly assessing learning which refers to demonstration of skills or other performances. Also, performance-based assessment often provides more in-depth information--which may be useful for diagnosing individual students' learning needs. And, it is more likely to be directly related to real-life skills.

A limitation of performance-based assessment is that while it can validly assess the student behaviors that are directly included in a specific task, those behaviors may not adequately represent a broader domain of interest. It may seem obvious that a task which requires students to explain how they answered an algebra problem may not assess how well the students could explain their answers to a geometry problem. Other situations are not so obvious. For example, students' ability to demonstrate one kind of scientific experiment may not be evidence that they can conduct an experiment that requires different scientific principles or methods.

Another limitation is that scores will be reliable only if the same assessment procedures are used with all students and the assessments are scored by trained raters using uniform rating scales or rubrics. ("Scale" and "rubric" are used interchangeably here.) Such scoring is often time consuming and can be expensive. A further limitation is that performance-based assessments may require much more time to comprehensively assess a body of knowledge than forced-choice assessments.

Some thoughts about this typology

Like any typology that might be used, this one has shortcomings. Assessment approaches cannot always be neatly categorized into forced-choice or performance-based. For example, completion items that require students to supply as much as several sentences certainly require them to generate their own responses. Completion items were categorized with forced-choice approaches because, generally, they are simpler and do not require students to generate their own responses in the same way as most performance-based responses. On the other hand, some performance-based tasks can be rather simplistic and lack some of the benefits commonly associated with performance-based assessment.

Some assessment approaches combine the two types presented here. One, known as "enhanced multiple-choice" may require students to explain their responses to multiple-choice items. Another approach uses thematic exercises that may, for example, require students to develop a theatrical character in response to a given stimulus, perform the character's role, and answer multiple-choice and essay questions about character development. An assessment task may require students to write an essay about a historical event and answer multiple-choice questions about it.

Videotaped assessments, which are currently being developed in Illinois to assess students' perceptual skills in the arts, may include forced-choice or performance-based questions. The tapes include excerpts from arts performances (such as in music or dance) and assessment items about them. Students watch and listen to the performances and then answer the questions. The questions may be of any type, such as multiple-choice or essay. They do not require students to perform. However, they do assess students' skills in interpreting and analyzing performances by others.

Portfolios may or may not qualify as performance-based approaches. For example, a collection of essays, stories, or poems written by a student and rated using a systematic scoring rubric can be considered a performance-based approach for assessing writing. Similarly, a collection of reports describing experiments conducted by a student might be appropriate performance-based assessment for science. A videotaped collection of a student's musical performances might be appropriate for the fine arts. To qualify as performance-based, portfolios must contain products or performances generated by students. Portfolios that are simply collections of forced-choice tests completed by students are not performance-based. Another condition that sometimes disqualifies portfolios is that they are simply used to assemble student work and the quality of the work is not judged. Unless the quality of portfolio contents is rated, the portfolios are not being used for assessment.

Written essays are not always clearly appropriate as performance-based assessment. If they are used to assess students' skills in writing or in higher-order thinking skills such as problem solving, analyzing, or critiquing, they are clearly appropriate. However, if they are used to estimate whether or not students have mastered a particular body of knowledge and simply require students to recite that knowledge (verbally or in writing), they do not require students to perform.
To help organize the materials in this document, assessment approaches will be discussed in these two categories. Readers should remember that distinctions between them sometimes blur and that teachers' most important assessment priority should be to ensure that individual assessment procedures are appropriate for their intended uses.

Assessment sources

As mentioned above, educators will probably want to consider three major sources of assessment procedures: (1) commercial publishers, (2) the local school or district, and (3) other educational organizations. The first and third sources would, of course, provide assessment procedures that have already been developed. However, they would have to be reviewed carefully; evidence of validity, reliability, and fairness may need to be collected. Also, revision or additional development may be necessary. The second source would require local educators to develop new assessment procedures specifically for use in the assessment system, although committees should also solicit and review existing assessment procedures that local staff have developed for their own use.

Most schools will probably use a combination of sources. Initially, it may be preferable to obtain assessment procedures from elsewhere in order to save the time and other resource costs of local development. These procedures might be used as they are, used as models, or otherwise adapted. On the other hand, assessment procedures that have been developed locally are likely to focus more directly on local outcomes and instruction and thus may provide more valid data.

Commercial publishers

The most commonly available commercially published assessments are standardized achievement tests (which may be norm referenced or criterion referenced), textbook tests, and tests in other published resources such as curriculum guides or textbook manuals for teachers. Other sources of commercially published assessments include customized tests which publishers tailor to local needs. Many of these tests are forced-choice, but more and more performance-based tests are being published--either by themselves or as components of test batteries which are primarily composed of forced-choice items.

Assessments published by testing corporations have several advantages. Many, especially standardized achievement tests, are likely to have been professionally constructed. It is highly probable that they have been reviewed carefully, pilot-tested, and analyzed for bias. Estimates of validity for various uses and of reliability are usually available in technical manuals and other sources. Item-difficulty data, which can be very useful in deciding how to use an item and in setting student expectations, may also be readily available. Scoring and reporting services can be obtained. Commercially published assessments may be appropriate for several uses, such as for accountability or program evaluation.

Textbook tests also have several advantages. They are likely to be readily available, closely aligned with instruction, and inexpensive. Schools and districts do not have to spend additional money or time to purchase or develop the assessments (although they may have to spend resources examining validity, reliability, and fairness).

Assessment procedures are sometimes also available from other published sources such as teachers' manuals, various resource materials, and books about assessment. Their quality varies considerably, as does their appropriateness in divergent local situations. They must be reviewed very carefully before being adopted. In addition, it may be necessary to obtain permission to copy and/or use them.

Before deciding to use commercially published assessment procedures, however, local planners should examine them carefully. The tests and tasks should be closely aligned with local learning outcomes. Commercial publishers sometimes provide information about alignment, but school personnel should independently examine assessment procedures to make sure that they agree with the publisher's interpretation of local curricula or outcomes and of what the items measure. The tests are unlikely to be valid locally unless such agreement exists. Educators should also review the alignment of textbook tests with the textbook, instruction, and local outcomes. Textbook tests are sometimes poorly aligned with the textbook. Even if textbook tests are aligned with the textbook and with instruction, the tests are not likely to provide useful information about student attainment of outcomes if instruction is poorly aligned with outcomes.

Local planners should also use information about validity, reliability, and fairness carefully. It may or may not be useful locally. Educators may need to ask textbook publishers for additional information about items or tasks. For example, for what uses will the assessment results be valid? What is known about the reliability and difficulty level of the items? Have the items been pilot-tested or reviewed for bias? (For more information about validity, reliability, and fairness, see Chapter 2.)

Before adopting commercially published assessments, local planning groups also need to make sure that assessment results can be aggregated or disaggregated as needed. For example, can a publisher or scoring service provide data for the specific clusters of items or tasks that measure particular local outcomes or objectives? If so, at what cost?

Local school or district

Locally developed assessment procedures are also appropriate to use as one of several types of measures. Since locally developed assessments will be administered on a smaller scale than published tests, they may permit more opportunities for such assessment alternatives as having students listen to audio recordings as they respond to questions about music, use videotapes with questions about dance or drama, or simulate scientific experiments on computers. Also, some audiences may have greater ownership for the assessment and use the results more since local staff are involved in development.

Developing local assessment procedures is a challenging - but rewarding - undertaking. Designing good assessment tasks and items can be time consuming. School or district personnel may have to develop several for each local learning outcome. As with assessments from other sources, staff must address the issues of validity, reliability, and fairness. They may also have to develop scoring scales and train staff to use them. Staff will need to develop processes for printing, scoring, and reporting. Computer software programs can be a great help.

Other educational organizations

Assessment procedures are often available from other schools and districts as well as other kinds of educational organizations (for example, regional education service agencies, professional organizations, colleges and universities, and ISBE). Local school assessment committees may want to adopt or adapt these. Staff should review them carefully and should make certain that they have permission to use them and should not amend them unless they have permission to do so. Staff may prefer not to use procedures they cannot adapt as needed. If procedures are revised substantially, they should be reviewed by additional staff and tried out with students.

DEVELOPING LOCAL ASSESSMENT SYSTEMS

Developing a comprehensive assessment system is an important task which must be done with care. Decisions made during this stage will have a major impact on the utility of information collected for various purposes. The system might simply produce information that meets requirements imposed by a larger, external system such as a school district or state department of education. Preferably, however, the local assessment system will help improve student learning by producing information that is genuinely useful in the learning process. Results might be used to help monitor whether students are meeting local learning outcomes or objectives, to evaluate local programs and needs, or to make day-to-day instructional decisions.

Issues which should be addressed during the development of an assessment system include:

What should be the major purposes/functions of assessment? How will the results be used?
What information do various groups (such as teachers, administrators, community members) want and need?
What types of assessment procedures are most appropriate for each purpose? For each learning outcome?
When should assessment occur (grade, time of year, and frequency)?

Practical Tip: Developing an assessment system that provides all of the information needed, is of high quality, and functions smoothly may require several years. Furthermore, modifications will be needed occasionally--for example, as curricula change or student learning improves. Local educators can probably make the most progress if they begin to assess all learning areas and goals. Then, committees of teachers (and others) can expand and refine the assessment procedures for each content area. At the same time, others can work on issues of validity, reliability, and fairness.

The process that a school or district uses to plan its assessment program is critical to the program's success. That process helps determine the quality of the system and its appropriateness for local purposes. Also, the process may help elicit crucial support. The program must be credible - that is, valid and useful - to everyone who is involved, including teachers, administrators, parents, students, and others. One of the most effective ways to generate such support is to involve a wide variety of people in the program's development. The involvement and roles of various committees will be discussed below.

Identifying assessment purposes/functions

Early in the process of developing an assessment system, local planning teams (in consultation with others) need to decide what functions the system should serve. That decision will strongly influence the remainder of the process. A major criterion used to select assessment procedures should be that they will produce the kinds of information needed for various purposes.

Some assessment functions/purposes that may be considered include:

To monitor student progress more systematically and frequently (for example, to learn whether students are meeting local outcomes),
To help inform school improvement planning (for example, by identifying local strengths and weaknesses),
To compare local students with nationwide samples of students (This is sometimes useful for public relation purposes because data from norm-referenced tests provide another perspective on student learning to supplement information about attainment of locally-important content.),
To meet legislative requirements (Although this may be a necessary function for most systems, it should be kept secondary. If it becomes primary, student learning is unlikely to improve and resources will have been wasted).

System developers may want to limit the purposes of a school or district assessment system to some or all of the functions listed above and leave others--such as diagnosing individual students' needs, placing students in instructional groups, or evaluating special programs--to the discretion of teachers or others. However, they may also want to consider other functions in order to minimize the assessment procedures that must be administered by selecting those which serve multiple functions.

Deciding what to assess

At some point, perhaps during this stage, planning needs to become specific so that (a) planning teams or committees have the information needed to make decisions regarding when students will be assessed and (b) people who are or will be involved in the assessment process (primarily committee members, school administrators, and teachers) develop clearer understandings of the scope of the assessment system and the commitments that will be needed to implement it. Those commitments will include planning time (for example, to select and develop specific assessment procedures), resources for purchasing assessment procedures as well as scoring and reporting services, and classroom time for administering the assessments.

During this stage, planning teams should decide what content areas (and perhaps what knowledge/skills/learning outcomes) need to be assessed in order to meet the purposes or functions that have been identified. They may want to simply name the content areas that will be assessed. However, more specific information will allow them to plan more effectively. They may even want to identify the learning outcomes or other subcategories for which information will be needed.

Selecting assessment types

When developing a broad outline for an assessment program, this stage is optional. The outline would specify the purposes/functions of resulting information, the outcomes or content which will be assessed, and a schedule that indicates when students will be assessed. However, if assessment types are identified at this point, school personnel can make better decisions about allocating planning time and other resources for developing and purchasing assessment procedures. Also, information about what types of assessment are most appropriate will facilitate later efforts to select or develop specific assessment procedures.

The most important consideration when selecting assessment types is the kind of evidence which will most validly represent the status of students' knowledge/skills in a particular content area or learning outcome. Frequently, more than one type of evidence will be needed. Most content areas and many learning outcomes, for example, describe what students should know and how they should be able to apply that knowledge. A forced-choice test might very efficiently provide comprehensive evidence about student knowledge, but a performance-based approach may be necessary to find out whether students can apply that knowledge. If applying the knowledge means that students should write analytically, written essays should be used for assessment. If it means that students should perform physical tasks such as running in relay races, the assessment procedure should probably be to observe and rate students' performances.

Another reason that an assortment of assessment methods is often needed is that many content areas and outcomes include multiple and varied skills which can be assessed most validly using different methods. In the biological and physical sciences, for example, students may need to be able to design and conduct scientific studies and prepare reports of the results. To determine whether students have attained these skills, the assessment should include procedures that require them to design, conduct, and report the results of scientific studies. Teachers would assess the quality of the designs (perhaps using "model designs" and rubrics), observe and rate students' actions in conducting the experiment (perhaps using accepted scientific standards and criteria and a rubric), and assess the quality of the reports (perhaps using exemplary reports and a rubric).

Establishing assessment schedules

As mentioned previously, assessment designs should specify the grades at which assessment procedures will be used and when they will be administered. Occasionally, such as for standardized assessments, schedules may specify a very limited time frame in which a test should be administered. More often, however, the schedule will indicate that the assessment should be given at the end of a particular instructional cycle or during a "window" of time that may be as long as several weeks.

Practical Tip: Don't allow assessment--or any element of school improvement planning--to be overly emphasized or burdensome. Incorporate assessments into instructional activities. Combine assessment tasks to address more than one learning outcome. Distribute assessment across the academic year. Do not try to formalize every classroom test, quiz, or performance-based task.

Assessment schedules should ensure that resulting information would be available when it is most useful. Ask the following questions:

At what grades is assessment information particularly needed for each content area? (To help identify those levels, examine local improvement plans, curricula, and organizational structures.)
When will the information be most useful to the major audience(s) for whom it is being collected (school improvement planning committees, teachers, etc.)?
Has sufficient time been allowed to score and compile the results so that they will be available when needed?
How and when can the assessment be integrated with instruction to help reduce intrusions on classroom time?

Involving staff and other constituent groups

To the extent possible, those who are affected by assessment results (district and school administrators, teachers, students, parents, and other community representatives) should participate in the assessment's design. Each group offers a unique and vital perspective on which skills are most important to assess, how to assess them, and how to use the results.

Depending on a school or district's particular needs, at least two different types of groups might participate in assessment development. These could include an overall committee (such as a school improvement team) and task forces. The composition and function of each group is discussed in the following paragraphs.

Practical Tip: Involving teachers (and other keyplayers) from the beginning of assessment planning should produce programs of better quality and result in less criticism. One reason that assessment generates so much controversy is that people feel they were not part of the process and that decisions were made without their knowledge -- leaving them with no other options than to "take it or leave it." Teachers often fear that inappropriate assessment procedures will produce misleading results which imply that their teaching is ineffective.

Overall committee (for policy recommendations)

The overall team or committee is likely to have general responsibility for developing the assessment system. It should probably be broadly representative of the school or district. The committee might include staff representing administrators who will be involved in system implementation or the use of resulting data (e.g., assessment or curriculum directors), school and district administrators, teachers from different grade levels and subject areas, perhaps the teachers' association or union, and community (e.g., school board) members.

One of the committee's first responsibilities might be to develop a framework for the assessment system. That framework should include the functions or purposes of the system and the types of assessment procedures to be used. The framework can then be used to guide the development of more specific assessment plans.

This planning team can form an important link between teachers and administrators. Depending on local needs, the link may be broadened to include others. Throughout the planning process, the committee should inform others of its work and invite their suggestions. The committee will probably want to involve other staff more directly by establishing special-purpose task forces.

Task forces (for specific areas and functions)

The number, composition and major functions of the task forces will vary according to local factors such as development needs, school or district size, and type of district (elementary, secondary, or unit). The functions of the task forces may include selecting or developing assessment procedures or designing specific assessment system elements such as procedures for processing or reporting the assessment data.

Special considerations

Members of all committees and task forces should be selected carefully. Whether they are asked to serve or selected from a pool of volunteers, they should be people who are:

Interested in the group's task,
Knowledgeable about the educational program,
Supported and respected by teachers and administrators, and
Able to devote time to the task.

Developing a comprehensive assessment system is time-consuming. School and district administrators may need to identify strategies that make it easier for school personnel to participate. For example, they might schedule meetings during inservice or institute days, hire substitutes, offer salary-schedule or college credits to participants, ask other teachers to cover participants' classes (and perhaps reward them for doing so), or pay staff for working during the summer.

The school board's role

A well-informed school board can be one of the best allies of any assessment program. Proactive administrators keep their boards closely involved throughout the planning and implementation of assessment programs. Regularly scheduled board review can help ensure that assessment practices remain responsive to changing local needs.

At each stage of the planning and implementation of an assessment program, a skillful administrator provides the board with relevant decision-making information. The information is more likely to be used if it is timely, complete, easily understood, and targeted to the decisions at hand.

Practical Tip: Keep teachers involved. Don't allow a scenario such as the following to occur. Administrators in one school district entered into a contract with a private corporation to develop their local assessment system. Corporation representatives held three half-day meetings with teachers near the beginning and end of the development year. The assessment system the corporation developed impressed the university consultants who reviewed it. However, four years later, annual results from two independent measures of student achievement (a commercially published test and the state assessment) showed no growth in student learning. A subsequent investigation found that teachers ignored the assessment results provided by the district's contractor. The teachers had had little input into the local assessment system and little knowledge about it until after it was completed. Consequently, they did not view the contractor's assessment data as credible indicators of student learning.

During the initial planning stages of an assessment program, the board should be educated about:

the potential purposes of assessment,
characteristics of effective assessment programs,
proposed procedures for test selection or development,
the limitations of assessment,
state assessment requirements, and
estimated costs for various assessment options.

After an assessment program has been implemented, the board should be kept informed. When assessment results are available, the board should receive a report of school- and district-level performance. This report, delivered prior to public reporting of the scores, should be designed to help members understand the results and prepare them for questions/comments from the public.

Reviewing the assessment system

The assessment system should be reviewed regularly. For example, it should be reviewed as indicated in school improvement plans, whenever students do not meet locally defined expectations or when trend information indicates a decline in student performance. The review should include input (oral or written) from staff members who have used the assessment procedures and results. It should include their judgments of quality, appropriateness, and utility. Also, various audiences can be surveyed about their perceptions of the program's effectiveness. Staff recommendations for changes in the assessment program can be presented to the board along with survey results. If board members have been provided with appropriate information since the beginning of the assessment program, decisions made at this point should be especially sound.

Examine questions such as:

What do assessment results indicate about student performance?
Are assessment procedures appropriate to help examine local curriculum and instruction? Are they appropriate for local students? Are the results useful?
What revisions are needed in local outcomes, standards, expectations, teaching-learning activities, or assessment procedures?
How can the assessment system be improved (for example, by modifying the assessment approaches, administration, processing procedures, or reporting)?

Chapter 2

The Quality of Assessment

This chapter discusses five major factors that are integral to the quality of assessment. The first three factors--validity, reliability, and fairness--have been particularly emphasized in school improvement planning in Illinois. The other two--assessment administration and alignment--are also critical to the success of the school improvement process.

Ensuring the Quality of Assessment

Everyone who participates in the development or implementation of assessment systems is responsible for helping ensure that assessment is of high quality. Quality must be a concern at every stage--when designing assessment systems; selecting or developing assessment procedures; administering the procedures; and scoring, reporting, and using the results. Assessment which is of poor quality is of limited utility. The information it produces does not represent student learning well enough to inform decision makers about the types of changes that are needed to help improve the educational system. The time and other resources devoted to planning and administering the system will have been poorly spent. Three major indicators of quality--validity, reliability, and fairness--are discussed in this section.

Validity is the extent to which an assessment measures what is needed for a particular purpose and to which the results, as they are interpreted and used, meaningfully and thoroughly represent the specified knowledge or skill.

Reliability is the consistency or stability of assessment results. It is often defined as the degree to which an assessment is free from errors of measurement.

Fairness means that assessment procedures do not discriminate against a particular group of students (for example, students from various racial, ethnic, or gender groups, or students with disabilities).

All three indicators represent critical elements of good assessment systems. Assessment procedures or results with purported high validity but low reliability are actually of low validity and cannot be used confidently because differences in scores across students or across time are more likely due to random error than actual differences in achievement. Assessments with high reliability but low validity accurately measure something other than what was intended and the results are meaningless with regard to the intended purpose. Since they only poorly represent the knowledge or skills they supposedly measure, they are not useful. Assessments which claim to be highly valid and reliable but discriminate against certain groups of students should not be used because they are not fair to those students and thus are not valid.

Although the three indicators mentioned above are essential, validity is the most important. However, educators have often spent time and other resources gathering evidence about reliability to the neglect of validity. In recent years, many articles in education publications have proposed abandoning forced-choice assessment in favor of performance-based assessment. A major criticism of forced-choice assessment is that it is strong in reliability but weak in validity. Unfortunately, performance-based assessment may suffer from that same dilemma. In response to criticisms that scoring is inconsistent, major resources are sometimes devoted to documenting the reliability of scoring while neglecting validity--as well as other sources of unreliability.

Validity

As mentioned above, validity is the extent to which an assessment measures what is needed for a particular purpose and the results, as they are interpreted and used, meaningfully and thoroughly represent the specified knowledge or skill. This definition highlights the importance of considering purposes and intended uses when developing and selecting assessment procedures. Those procedures must assess the knowledge or skill (or learning goal, outcome, or objective) that they claim to measure. The type of information produced must be useful for the intended purposes.

Traditionally, measurement specialists have discussed several types of validity, including content, construct, concurrent, and predictive validity. Content validity is usually established through expert review of assessment contents. Examining assessments to estimate their content validity is likely to be both feasible and useful at the local level. Examining the other types of validity requires using statistical procedures and may be difficult and of limited usefulness locally. Linn, Baker, and Dunbar (1991) proposed eight criteria for evaluating validity in performance-based assessment. However, the criteria are also appropriate for forced-choice assessment and may help improve it more than traditional statistical validation procedures.

Figure 2: Summary of Criteria for Estimating Validity

Criterion
Consequences
Content coverage
Content quality
Transfer and generalizability
Cognitive Complexity
Meaningfulness
Fairness
Cost and efficiency

Focus
Effects of assessment
Comprehensiveness
Consistency with current content onceptualization
Representativeness of broader domain
Level of knowledge assessed
Relevance to students
Freedom from bias against members of a group
Practicality and feasibility of assessment

Validity criteria

The eight criteria should help inform educators about whether they need to revise assessment procedures and the types of revisions they should make. Answers to the questions presented below should also influence how educators interpret and use assessment results.

When using the criteria, educators who plan to use two or more assessment procedures in combination with one another--for example, to estimate whether students meet a learning outcome--may want to ask whether the assessment procedures, considered together, meet the criteria. Alone, neither assessment procedure may, for example, cover content or represent cognitive complexity adequately. Together, they may be quite effective. This means, of course, that neither assessment procedure should be reported or used without the other.

1) Consequences

Focus: The effects of the assessment

Questions

Is the assessment likely to produce results that will be used to improve instructional programs or otherwise improve student learning?
Or is the assessment more likely to narrow curricula or change emphases because teachers view assessed concepts or skills as more important than others and emphasize only what is measured on the assessment?
Are instructional emphases likely to change because teachers will spend more time using practice exercises to prepare students for the assessment?
Might the assessment be used inappropriately to make important decisions regarding individual students (e.g., promotion or graduation decisions)?
What other positive and negative consequences are likely?

2) Content Coverage

Comment: This criterion may be of limited utility for individual assessment procedures. Educators will probably find it more useful to examine all procedures that, together, will assess a particular content unit (e.g., learning outcome). While individual assessment procedures must be appropriate for the content assessed, they may cover only a portion of it.

Focus: Comprehensiveness of assessment content

Questions

Does the assessment comprehensively cover the content and processes assessed?
Is the content covered in sufficient breadth and depth?
Does the assessment represent important (not trivial) components of the content?
Together, will the assessments provide sufficient evidence about the content?

3) Content Quality

Comment: This criterion is especially important in content areas (e.g., science) in which knowledge sometimes grows rapidly. An assessment that represents an outdated conceptualization of the content assessed is not likely to produce useful information and will waste students' and teachers' time.

Focus: Consistency with current content conceptualization

Questions

Is the assessment consistent with the best available conceptualization of the knowledge or skill assessed?
Does the assessment represent current, rather than outdated, perspectives?

4) Transfer and Generalizability

Comment: For different reasons, this criterion is of concern in both forced-choice and performance-based assessment. In the former, the nature of questions may make them poor indicators of student ability to deal with concepts in the domain assessed. In performance-based assessment, the small number of tasks makes it essential that each task or group of tasks is representative of the domain assessed. Also, continued re-use of any type often may compromise generalizability, because teachers and students may focus on specific items or tasks from the test rather than on the larger domain.

Focus: The assessment's representatives of a larger domain

Questions

Can the assessment results be generalized to the broader domain (knowledge, skill, or learning outcome)
they are intended to represent?
Or do the results simply indicate how well students have achieved the specific knowledge or skills directly included in the assessment questions or tasks?

5) Cognitive Complexity

Focus: Whether level of knowledge assessed is appropriate

Questions

Do the assessment tasks or questions represent the cognitive complexity of the knowledge or skill that it is intended to assess? (For example, if an outcome includes higher order or critical thinking skills--such as problem solving or synthesis--does the assessment measure them?)
Does the assessment actually require students to use higher-level knowledge or skills, or can students simply respond from memory without having to think?

6) Meaningfulness

Comment: Assessment procedures must be meaningful to students in order to produce valid, useful information. Assessment that is relevant to students' personal experiences is likely to motivate students to perform as well as possible. However, some assessments cannot be made relevant to problems students encounter in real life, and educators should realize that contrived assessments may poorly represent the knowledge or skills assessed.

Focus: The relevance of the assessment in the minds of students

Questions

Are assessment items or tasks meaningful to students?
Is the assessment relevant to problems students will encounter again in school, work, or daily living?
Does the assessment provide students with worthwhile or meaningful experiences?

7) Fairness

Comment: For some assessment purposes (e.g., to measure the achievement of individual students in a content area), it may be important to consider whether all students have had similar opportunities to acquire the knowledge or skills assessed. For example, are some students at an advantage because ancillary skills (such as prior knowledge or reading ability) that are not relevant to the focus of the assessment enable them to score higher? On the other hand, if the purpose is to find out whether a group of students has achieved something (such as a learning outcome) and some students have not had an opportunity to learn it, the assessment is not unfair or invalid. Rather, instruction may not have been adequate.

Focus: Fairness to members of all groups

Questions

Is the assessment biased against students who are members of various racial, ethnic, and gender groups or students with disabilities? Does it contain stereotypes of any groups?
Do students of similar ability, regardless of group membership, score the same?
(For other questions, see the section on "Fairness".)

8) Cost and Efficiency

Focus: The practicality or feasibility of an assessment

Questions

Is the assessment a reasonable burden on teachers, instructional time, and finances?
Is resulting information worth the required costs in money, time, and effort?

Examining and improving validity

Educators who use assessment should be continually concerned about validity. They should explicitly design assessment systems so that the information collected will serve the purposes for which it was intended. They should select or develop assessment procedures that are likely to produce the information needed. When processing and reporting results, educators should ensure that scores accurately represent results and analyses present the type of information needed. When using results, educators and others should avoid making interpretations or drawing conclusions that are not warranted by the data.

A common, efficient way to examine validity is through review by qualified people (teachers, curriculum coordinators, etc.) A panel consisting of people with recognized expertise in the subject area examines assessment procedures using criteria such as those above: consequences, content coverage, content quality, transfer and generalizability, cognitive complexity, meaningfulness, fairness, and cost and efficiency.

Educators may want to assemble panels of members who are qualified in the subject under consideration to make informed judgments. The panels might meet during the development or selection of assessment procedures and again after results are available. Ideally, they might also assemble each time a new assessment is ready for piloting. Validity review committees which systematically review local assessment procedures can help ensure that the procedures remain valid. (Figure 3 below illustrates a worksheet that validity review panels might want to use as is or modify as needed.)

Figure 3: Worksheet: Assessment Validity Review

Directions: Indicate your agreement or disagreement with the following statements by answering "yes," "no," or "not sure."

Content quality: Does the assessment procedure represent the current, best conceptualization of the content? Is it sufficiently challenging to students, and not trivial or restrictive?
Transfer and generalizability: In your judgement, does the assessment procedure measure student attainment of the curriculum or outcome rather than only the specific knowledge or skills required by the questions or tasks included? Does it sufficiently represent the content of the curriculum or outcome?
Cognitive complexity: Does the assessment procedure require students to use the higher order or critical thinking skills included in the target of the assessment? Or, can students respond to the question or task using only memory?
Meaningfulness: Will the assessment procedure be meaningful to students? Are they likely see it as relevant?
Fairness: Is the assessment procedure likely to be fair to all students? Will all students have the same opportunity to show what they know and can do? Will all have had similar opportunities to acquire the knowledge and skills assessed?
Cost and efficiency: Do the costs and time requirements of the assessment procedure (for purchase, administration, scoring, etc.) appear reasonable? Or, will it cost too much? Will it interfere with instructional time? Will it require too much time for scoring?
Consequences: Are the procedures likely to help improve student learning, or are they likely to narrow what is taught? Will assessment results be used appropriately?
Content coverage: Will the assessment procedures provide comprehensive information showing whether all important components of the curriculum or outcome have been met? Will they cover the breadth and depth of content enough to provide sufficient evidence?
Cognitive complexity: Do the assessment procedures as a whole require students to use the higher order learning or critical thinking skills included in the curriculum or outcome, even though individual assessment procedures may not?

For any items answered "no" or "not sure", explain below why the assessment procedures did not meet the criteria. For individual procedures, record which items or tasks receive a rating of "no" or "not sure".

Reliability

Reliability refers to the consistency or stability of assessment results. Traditionally, questions about reliability in assessment have asked whether students were likely to respond in the same way to a particular stimulus (such as a test) if it had been presented to them, for example, a week later or in a different but equivalent version. In addition, questions about reliability in performance-based assessment ask whether students' responses would have received the same ratings if different people had scored them--or if the same person had scored them at another time.

Reliability is a necessary condition of validity. An assessment procedure that is not reliable cannot be valid. A test that a student would respond to quite differently one day than the next (for example, because the test was very susceptible to the student's mood, which changed due to an argument at school or home) would not produce trustworthy results. Conversely, an assessment procedure can be reliable but not valid (i.e., not actually measuring what is intended). A multiple-choice spelling test may produce highly reliable scores and be validly interpretable as an indicator of student achievement in recognizing correct spelling. However, if one needs to know whether students can spell correctly when writing rather than simply recognize misspelled words, the reliable scores may not be valid indicators. A performance-based task which requires students to play a musical scale may be very reliable in the sense that individual students perform it in the same way repeatedly and different raters consistently assign the performances the same ratings. However, the task may be of low validity as an indicator of students' ability to play a piece of music.

Examining reliability

The process of examining reliability will be difficult for many schools. It requires analyzing assessment results statistically. The formulas for estimating the reliability of forced-choice assessments are sophisticated and can be applied much more easily using a computer. Although estimating the reliability of scoring in performance-based assessment is simpler, it involves so much information that computers are almost essential. Extensive, systematic procedures must be used if computers are not available. Several computer programs can help estimate reliability. The software program that is available for data management for Illinois school improvement planning provides a general measure of reliability (Kuder-Richardson Formula 21). Staff can also refer to several resources listed at the end of this chapter for further help in estimating reliability.

Most schools may need several years to accumulate strong evidence of reliability. However, local committees should attend to reliability almost continuously, beginning when they select or develop assessment procedures. This is necessary to help ensure that (a) sufficient data for computing reliability will be available later, (b) the computations will not force a school to make major adjustments in its assessment program, and (c) assessment information used for school improvement planning in the meantime is of high quality. The next sections will describe methods that are commonly used to estimate reliability and offer suggestions to schools regarding their gradual accumulation of reliability evidence.

Forced-choice assessment

Frequently used methods for examining the reliability of forced-choice assessments are: 1) internal consistency reliability, 2) equivalent forms reliability and 3) repeated measures reliability. Each will be described briefly below.

Figure 4: Methods of Estimating Reliability

Kuder-Richardson (Measure of internal consistency) Give test once. Score total test and apply Kuder-Richardson formula
Split-Half (Measure of internal consistency) Give test once. Score two equivalent halves of test (e.g. odd items and even items); correct reliability coefficient to fit whole test by Spearman-Brown formula.
Equivalent Forms (Measure of equivalence) Give two different forms of the test to the same group in close succession.
Test-retest with same test (Measure of stability) Give the same test twice to the same group with any time interval between tests from several minutes to several years.
Test re-test with equivalent forms (Measure of stability and equivalence) Give two different equivalent forms of the test to the same group with increased time interval between forms.

Internal consistency (or split-half) reliability provides an estimate of test stability without having to administer the test twice. It is calculated by splitting a test in half randomly (for example, separating odd- and even-numbered items), scoring the two halves separately, and correlating the results. The Kuder-Richardson formula is one way to determine split-half reliability. It correlates every item on a test with every other item, computes the mean, and computes the reliability of the test. High split-half reliability means that individual students would receive approximately the same score on both halves. The test questions are, for the most part, measuring the same content and contributing effectively to the overall content being measured.

Equivalent forms reliability is calculated by giving students two forms of a test (one set of questions on test form A and a parallel set on test form B) and correlating the results. With high reliability, students would receive roughly the same score on both forms. That is, form A and form B measure the same thing and result in the same measurement of the student's knowledge/skills.

Repeated measures (or test-retest) reliability is an estimate of an assessment's stability. It is calculated by administering a test to a group of students twice, with a given interval between the two testing times. Then, students' results are correlated. If reliability is high, each student's score would be roughly the same on the second test as on the first test. Given that other conditions were equal (for example, that the test was administered in the same way and that the student did not remember answers from the first administration or had not acquired new knowledge), a score on the test would be stable from one administration to another.

When schools plan to adopt tests that were developed elsewhere, such as by commercial test publishers or other school districts, they should consult technical manuals or contact the organization that developed the tests to request evidence of reliability. If reliability information is not available, staff must decide whether to establish a reliability estimate themselves or abandon the test. (It is critical to note here that estimates calculated by others will probably be for entire tests--or, perhaps, subtests. If a school plans, for example, to use a cluster of items to measure a particular knowledge, skill, or outcome, the developer's reliability estimate will not provide satisfactory evidence.)

Schools must take care in how they report reliability estimates. The Standards for Educational and Psychological Testing note that: "Because there are many ways of estimating reliability, each influenced by different sources of measurement error, it is unacceptable to say simply, 'The reliability of test X is .90' " (p. 21).

In any statement about the results of reliability studies, it is important to present:

the method that was used to estimate reliability,
the student population to whom it was administered, and
the decision rule that was followed to determine if the reliability estimate was sufficient to ensure accuracy.

Performance-based assessment

In performance-based or complex generated response assessments, some of the above approaches might be adapted. However, they would not be sufficient. The approaches assume that scoring is consistent. As mentioned previously, judgment is required to score performance-based assessments. Therefore, evidence of reliability must include evidence that a scoring rubric or scale was used uniformly by all persons who rated the responses, thus ensuring that a student's performance received roughly the same score regardless of which rater scored it.

Interrater agreement is often used as an indicator of the consistency of scoring performance-based assessments. It is calculated as the percentage of agreement between judges.

Interrater agreement is often estimated by "double scoring" a sample of student responses and comparing the ratings assigned by different people. A sample of students' responses (such as essays, science experiments, or dance performances) is selected. Each response is rated by at least two people. The resulting scores are compared. The percentage of agreement is then calculated. The sample should include responses representing the top, middle, and bottom of the scale to help ensure that agreement exists throughout the scale.

The size of the sample used to examine interrater agreement will vary according to the size of the local population and the consequences of the assessment. In some situations, it may be necessary to double score the responses of all students.

Before beginning an analysis of interrater agreement, local educators should decide what is required for agreement. Criteria for acceptable levels of agreement may vary according to tasks, scales, and the purposes of the assessment. On a limited scale, such as one with four points, raters should not he considered in agreement unless they assign exactly the same score to a performance.

On a larger scale, raters might be considered in agreement if they assign adjacent scores. For example, the state writing assessments in Illinois have 6-point rating scales. Interrater agreement is considered satisfactory if readers are in exact agreement on 65% - 70% or more of their ratings and are within one point of agreement on 90% or more of their ratings.

Interrater agreement should be established each time a new scoring scale (or assessment procedure) is used or a new group of teachers plans to use the scale. This will help provide evidence that the rubric or scale has been constructed well enough that different individuals can use it consistently.

All raters should be trained to use scoring rubrics. The training should include carefully reviewing the scales--which should have been developed so they could be understood clearly and used consistently. Ratings that others assigned to previous performances using the same rubric (sometimes referred to as "exemplars") may help new raters understand the scales. During the training, raters should use the scale--for example, to score students' written or videotaped responses to previous similar assessments or exercises. Interrater agreement should be examined to determine whether individual raters are using the scale correctly.

After interrater reliability has been established, other kinds of reliability can be examined. It may be important to learn, for example, whether students respond differently to performance-based assessments across brief time periods or with slight variations in the assessments.

Gradually accumulating evidence of reliability

As mentioned above, school improvement or assessment committees may consider it necessary to work over a period of several years in order to accumulate strong evidence of the reliability of assessment procedures. This section will suggest strategies that committees might use to accumulate such evidence and to help improve reliability in the meantime so that the eventual evidence is less likely to require that they revise assessment procedures substantially.

While designing assessment programs, local committees should develop plans specifying the methods they will use to estimate the reliability of assessment procedures, the information that will be needed to calculate the reliability estimates, and the procedures they will use to ensure that the information will be collected (for example, by requesting that scoring services compute specified types of reliability estimates or by keeping records showing what scores each rater assigned to each student's responses).

During the selection and development process, committees can:

Obtain estimates of the reliability of forced-choice assessment procedures which were developed elsewhere. Publishers' manuals generally contain such estimates. Write to developers for additional information. Even if the estimates are not sufficient because they refer to an entire assessment and local committees plan to use portions of it separately, the estimates will be helpful as indicators until local estimates are available.
Obtain or develop assessment administration procedures which will help make sure that all students take the assessment under uniform conditions.
Develop clear scoring rubrics for performance-based assessments as well as procedures for training people who will score the results. Collect samples of students' responses (for example, written essays or other products, or videotapes of performances) and use them to (a) illustrate responses that represent various points on the scoring scale, and/or (b) help train raters by having several of them score the same set of responses and later comparing the scores.
Compute preliminary estimates of reliability while pilot testing local assessment procedures using one of the statistical methods discussed previously.

During assessment administration, schools can help improve reliability by ensuring that the assessment administration is standardized. They can do this by making sure that everyone who administers the assessment procedure has a copy of administration instructions, understands them, and uses them to administer the assessment uniformly to all students within a designated time frame.

After assessments have been administered, schools can monitor scoring procedures. For example, they can send students' responses to commercial scoring services. They can also double score and analyze a sample of performance-based assessments.

By using the procedures suggested above, schools should help ensure that (a) reliability is high, and (b) appropriate information will be available to calculate reliability estimates. Thus, after several years, they are likely to assemble strong evidence of reliability.

Improving reliability

What is to be done if reliability estimates indicate an unreliable test? In forced-choice assessment, school personnel may want to look at data for individual items. Compare students' performance on an item with their performance on the overall test. Did students with the highest scores tend to respond correctly to the item? Did most students who did poorly on the test as a whole get the item incorrect? This can be examined by charting the performance of the item or by computing a point biserial correlation. A strong relationship, as indicated on the chart or by a high point biserial correlation, indicates that the item was consistent with the test as a whole. Low correlations help identify items which should be revised or deleted to improve reliability.

When reliability estimates are low in performance-based assessment, examining scoring rubrics as well as raters' perceptions of them may help identify inconsistencies which should be corrected. Ask questions such as:

Is the scoring scale clear with distinct and unambiguous values?
Have raters been adequately trained in its use?
Are conditions for examining the performance/product essentially the same for all raters?
Has the rubric been designed to capture the range of performance for all children?
Was the administration of the assessment always uniform?

Practical Tip: When information from different sources conflicts, it does not necessarily mean that some of the information is wrong. The various sources may (a) focus on different things (ability to write narratively or persuasively), (b) look at them from different perspectives (ability to analyze and answer a multiple-choice question about scientific procedures or to demonstrate their use), or (c) contain different kinds of error (performance ratings may represent the knowledge or skills assessed too narrowly; multiple-choice test scores may be influenced by test wiseness).

In addition, procedures similar to those described above for forced-choice assessment may be useful. Compare students' scores on multiple tasks, tests, and observations. Did students who usually demonstrated the most competence score highest on the performance-based assessment? Did most students who did poorly in the subject as a whole get lower scores on this particular assessment? This task can be facilitated by charting scores from various assessment procedures or by computing correlations between the performance-based assessment scores and one or more of the other measures. Relationships that are very low help identify assessment instruments or procedures which may need to be revised. However, information from various sources will not correspond perfectly. The various sources may focus on different aspects of student behavior or knowledge. They use different kinds of evidence such as teachers impressions or students check marks, written essays, or actual performance.

Fairness

School personnel must ensure that assessment procedures are fair to all students, that they do not discriminate against members of any ethnic/racial or gender group or against students with disabilities. Assessment procedures are not fair if they offend members of some groups, if the way they refer to some groups distracts students and lowers their scores, or if other qualities of the procedures reduce the ability of group members to answer questions correctly. The assessment results of students who are from different groups but have similar knowledge and skills should be similar. If students' scores are lowered due to their group membership, assessment procedures are discriminatory.

Educators can sometimes obtain evidence that commercial tests have been reviewed and/or statistically analyzed to eliminate bias. However, they should not simply accept such evidence. They should review the evidence to make sure that the assessments will be fair locally. They should review the methods used to examine fairness, the groups focused upon, the representativeness and credibility of the reviewers, and the issues and sensitivities the reviewers were most attentive to as they critiqued the assessments for potential bias. If the information provided does not ensure local fairness, educators will need to conduct additional analyses. They will also need to examine the fairness of assessment procedures obtained from organizations other than commercial test publishers (such as textbook publishers or other districts) or developed locally.

Bias review procedures may include several different types of questions:

Are any tasks or items likely to be offensive to members of some groups?
Do any items or tasks perpetuate stereotypes about a particular race, gender, or national origin?
Are various groups represented fairly in the assessment?
Have all groups of students had an equal opportunity to acquire the knowledge or skills that are assessed?
Do tasks or items require knowledge or skills that are not directly related to the outcome being measured?
Do answers to questions depend on knowledge that is not taught in school but that some groups are more likely than others to have acquired elsewhere?
Are the language and format of the assessment likely to be reasonably familiar to all groups?

Examining and improving fairness

Most procedures for reviewing fairness can be categorized as either judgmental or statistical. Judgmental review is usually conducted by a committee or panel which reads assessment procedures (including items, prompts, instructions, scoring scales), asks questions such as those listed above, and identifies items or tasks which appear to be biased. Those items or tasks are then either revised or deleted. Judgmental review is usually conducted by committees (which may include community members as well as educators) that are sensitive to the groups under consideration. They should receive training in assessment fairness.

Statistical review involves examining assessment results for various groups of students. A simple but useful type of statistical information is the proportion of students from each group who answered items correctly (commonly known as the item-difficulty level or p-value) or received particular scores on performance-based tasks. Items and tasks which appear to have been more difficult for some groups than others should be reviewed judgmentally to determine whether the differences were caused by bias. Since many group differences may represent actual differences rather than bias in assessment procedures, it may be helpful to examine point biserial correlations for each group on each test item. Those correlations estimate whether students with higher ability are more likely to answer an item correctly than students with lower ability. Educators who examine point biserials in order to identify discriminatory items should be aware of two limitations. First, the point biserial statistic is not appropriate for many types of assessments--particularly performance-based assessments. Secondly, the statistic is appropriate for evaluating individual items; it cannot identify an entire test that is biased.

Some bias review experts prefer either judgmental or statistical review procedures. However, a more thorough bias review can be conducted if the two types of procedures are used interactively. Each has different strengths and weaknesses. Neither is sufficient alone. To capitalize on the strengths of each method and counterbalance the weaknesses, both should be used.

Bias reviews should be conducted at several stages of the assessment cycle. Assessment developers, including developers of both forced-choice items and performance-based tasks, should be sensitized to the types and sources of bias. Later, the assessment procedures should be reviewed by others who are representative of various groups and knowledgeable about learning area content or statistical procedures. During the process of selecting specific assessment procedures, all procedures should be examined for bias. After assessment procedures have been administered, scoring should be monitored to ensure that scores were not influenced by bias. Results should be reviewed statistically.

Districts might establish bias review procedures that specify: (a) the types of procedures that will be used, (b) the types of committees or panels that will be involved in the process, and (c) the stages of assessment at which bias review will be conducted. Guidelines for bias review are included in Bias Issues in Test Development (National Evaluation Systems, Inc., 1987), which ISBE distributed to Illinois school districts previously.

Estimating Reliability - Forced-Choice Assessment

The split-half model and the Kuder-Richardson formula for estimating reliability will be described here. Given the demands on time and the need for all assessment to be relevant, school practitioners are unlikely to utilize a test-retest or equivalent forms procedure to establish reliability.

Reliability Estimation Using a Split-half Methodology

The split-half design in effect creates two comparable test administrations. The items in a test are split into two tests that are equivalent in content and difficulty. Often this is done by splitting among odd and even numbered items. This assumes that the assessment is homogenous in content. Once the test is split, reliability is estimated as the correlation of two separate tests with an adjustment for the test length.

Other things being equal, the longer the test, the more reliable it will be when reliability concerns internal consistency. This is because the sample of behavior is larger. In split-half, it is possible to utilize the Spearman-Brown formula to correct a correlation between the two halves--as if the correlation used two tests the length of the full test (before it was split), as shown on the next page.

For demonstration purposes a small sample set is employed here--a test of 40 items for 10 students. The items are then divided even (X) and odd (Y) into two simultaneous assessments.

Student	Score (40)	X Even (20)	Y Odd (20)	x	y	x²	y²	xy
A	40	20	20	4.8	4.2	23.04	17.64	20.16
B	28	15	13	-0.2	-2.8	0.04	7.84	0.56
C	35	19	16	3.8	0.2	14.44	0.04	0.76
D	38	18	20	2.8	4.2	7.84	17.64	11.76
E	22	l0	12	-5.2	-3.8	27.04	14.44	19.76
F	20	12	8	-3.2	-7.8	10.24	60.84	24.96
G	35	16	19	0.8	3.2	0.64	10.24	2.56
H	33	16	17	0.8	1.2	0.64	1.44	0.96
I	31	12	19	-3.2	3.2	10.24	10.24	-10.24
J	28	14	14	-1.2	-1.8	1.44	3.24	2.16
MEAN	31.0	15.2	15.8	�	�	95.60	143.60	73.40
SD	�	3.26	3.99	�	�	�	�	�

From this information it is possible to calculate a correlation using the Pearson Product-Moment Correlation Coefficient, a statistical measure of the degree of relationship between the two halves.

Pearson Product Moment Correlation Coefficient:

where

x is each student's score minus the mean on even number items for each student.
y is each student's score minus the mean on odd number items for each student.
N is the number of students.
SD is the standard deviation. This is computed by

squaring the deviation (e.g., x² ) for each student,
summing the squared deviations (e.g., S x² );
dividing this total by the number of students minus 1 (N-l) and
taking the square root.

The Spearman-Brown formulais usually applied in determining reliability using split halves. When applied, it involves doubling the two halves to the full number of items, thus giving a reliability estimate for the number of items in the original test.

Estimating Reliability using the Kuder-Richardson Formula 20

Kuder and Richardson devised a procedure for estimating the reliability of a test in 1937. It has become the standard for estimating reliability for single administration of a single form. Kuder-Richardson measures inter-item consistency. It is tantamount to doing a split-half reliability on all combinations of items resulting from different splitting of the test. When schools have the capacity to maintain item level data, the KR20, which is a challenging set of calculations to do by hand, is easily computed by a spreadsheet or basic statistical package.

The rationale for Kuder and Richardson's most commonly used procedure is roughly equivalent to:

1) Securing the mean inter-correlation of the number of items (k) in the test,
2) Considering this to be the reliability coefficient for the typical item in the test,
3) Stepping up this average with the Spearman-Brown formula to estimate the
��reliability coefficient of an assessment of k items.

ITEM (k)
	1	2	3	4	5	6	7	8	9	10	11	12	X	x=X-	x²
Student (N)	1=correct						0=incorrect						(Score)	mean (score- mean)
A	1	1	1	1	1	1	1	0	1	1	1	1	11	4.5	20.25
B	1	1	1	1	1	1	1	1	0	1	1	0	10	3.5	12.25
C	1	1	1	1	1	1	1	1	1	0	0	0	9	2.5	6.25
D	1	1	1	0	1	1	0	1	1	0	0	0	7	0.5	0.25
E	1	1	1	1	1	0	0	1	1	0	0	0	7	0.5	0.25
F	1	1	1	0	0	1	1	0	0	1	0	0	6	-0.5	0.25
G	1	1	1	1	0	0	1	0	0	0	0	0	5	-1.5	2.25
H	1	1	0	1	0	0	0	1	0	0	0	0	4	-2.5	6.25
I	1	1	1	0	1	0	0	0	0	0	0	0	4	-2.5	6.25
J	0	0	0	1	1	0	0	0	0	0	0	0	2	-4.5	20.25
S =	9	9	8	7	7	5	5	5	4	3	2	1	65	0	74.50
													mean 6.5		Sx² 74.50
P-values	0.9	0.9	0.8	0.7	0.7	0.5	0.5	0.5	0.4	0.3	0.2	0.1
Q-value	0.1	0.1	0.2	0.3	0.3	0.5	0.5	0.5	0.6	0.7	0.8	0.9
pq	0.09	0.09	0.16	0.21	0.21	0.25	0.25	0.25	0.24	0.21	0.16	0.09
Spq	*2.21*

Here, Variance Kuder-Richardson Formula 20

p is the proportion of students passing a given item
q is the proportion of students that did not pass a given item
s² is the variance of the total score on this assessment
x is the student score minus the mean score;
x is squared and the squares are summed (S x²);
the summed squares are divided by the number of students minus 1 (N-l)
k is the number of items on the test.

For the example,

Estimating Reliability Using the Kuder-Richardson Formula 21

When item level data or technological assistance is not available to assist in the computation of a large number of cases and items, the simpler, and sometimes less precise, reliability estimate known as Kuder-Richardson Formula 21 is an acceptable general measure of internal consistency. The formula requires only the test mean (M), the variance (s ²) and the number of items on the test (k). It assumes that all items are of approximately equal difficulty. (N=number of students)

For this example, the data set used for computation of the KR 20 is repeated.

Student (N=l0)	X (Score)	x= X-mean (score-mean)	x²�
A	11	4.5	20.25
B	10	3.5	12.25
C	9	2.5	6.25
D	7	0.5	0.25
E	7	0.5	0.25
F	6	-0.5	0.25
G	5	-1.5	2.25
H	4	-2.5	6.25
I	4	-2.5	6.25
J	2	-4.5	20.25
�	mean = 6.5	�	S x²= 74.50

Variance

Kuder-=Richardson formula 21

M - the assessment mean (6.5)
k - the number of items in the assessment (12)
s ² - variance (8.28).

Therefore; in the example:

The ratio [ mean (k-mean)] / ks² in KR21 is a mathematical approximation of the ratio Spq/s² in KR20. The formula simplifies the computation but will usually yield, as evidenced, a lower estimate of reliability. The differences are not great on a test with all items of about the same difficulty.

In addition to the split-half reliability estimates and the Kuder-Richardson formulas (KR20, KR21) as mentioned above, there are many other ways to compute a reliability index. Another one of the most commonly used reliability coefficients is Cronbach's alpha (a ). It is based on the internal consistency of items in the tests. It is flexible and can be used with test formats that have more than one correct answer. The split-half estimates and KR20 are exchangeable with Cronbach's alpha. When examinees are divided into two parts and the scores and variances of the two parts are calculated, the split-half formula is algebraically equivalent to Cronbach's alpha. When the test format has only one correct answer, KR20 is algebraically equivalent to Cronbach's alpha. Therefore, the split-half and KR20 reliability estimates may be considered special cases of Cronbach's alpha.

Given the universe of concerns which daily confront school administrators and classroom teachers, the importance is not in knowing how to derive a reliability estimate, whether using split halves, KR20 or KR21. The importance is in knowing what the information means in evaluating the validity of the assessment. A high reliability coefficient is no guarantee that the assessment is well-suited to the outcome. It does tell you if the items in the assessment are strongly or weakly related with regard to student performance. If all the items are variations of the same skill or knowledge base, the reliability estimate for internal consistency should be high. If multiple outcomes are measured in one assessment, the reliability estimate may be lower. That does not mean the test is suspect. It probably means that the domains of knowledge or skills assessed are somewhat diverse and a student who knows the content of one outcome may not be as proficient relative to another outcome.

Establishing Interrater Agreement for Performance-Based or Product
Assessments (Complex Generated Response Assessments)

In performance-based assessment, where scoring requires some judgment, an important type of reliability is agreement among those who evaluate the quality of the product or performance relative to a set of stated criteria. Preconditions of interrater agreement are:

A scoring scale or rubric which is clear and unambiguous in what it demands of the student by way of demonstration.
Evaluators who are fully conversant with the scale and how the scale relates to the student performance, and who are in agreement with other evaluators on the application of the scale to the student demonstration.

The end result is that all evaluators are of a common mind with regard to the student performance and that one mind is reflected in the scoring scale or rubric and that all evaluators should give the demonstration the same or nearly the same ratings. The consistency of rating is called interrater reliability. Unless the scale was constructed by those who are employing the scale and there has been extensive discussion during this construction, training is a necessity to establish this common perspective.

Training evaluators for consistency should include:

A discussion of the rating scale by all participating evaluators so that a common interpretation of the scale emerges and so diverse interpretations can be resolved or referred to an authority for determination.
The opportunity to review sample demonstrations which have been anchored to a particular score on the scale. These representative works were selected by a committee for their clarity in demonstrating a value on the scale. This will provide operational models for the raters who are being trained.
Opportunities to try out the scale and discuss the ratings. The results can be used to further refine common understanding. Additional rounds of scoring can be used to eliminate any evaluator who cannot enter into agreement relative to the scale.

Gronlund (1985) indicated that "rater error" can be related to:

Personal bias which may occur when a rater is consistently using only part of the scoring scale, either in being overly generous, overly severe or evidencing a tendency to the center of the scale in scoring.
A "halo effect" which may occur when a rater's overall perception of a student positively or negatively colors the rating given to a student.
A logical error which may occur when a rater confuses distinct elements of an analytic scale. This confounds rating on the items.

Proper training to an unambiguous scoring rubric is a necessary condition for establishing reliability for student performance or product. When evaluation of the product or performance begins in earnest, it is necessary that a percentage of student work be double scored by two different raters to give an indication of agreement among evaluators. The sample of performances or products that are scored by two independent evaluators must be large enough to establish confidence that scoring is consistent. The smaller the number of cases, the larger the percentage of cases that will be double scored. When the data on the double-scored assessments is available, it is possible to compute a correlation of the raters' scores using the Pearson Product Moment Correlation Coefficient. This correlation indicates the relationship between the two scores given for each student. A correlation of .6 or higher would indicate that the scores given to the students are highly related.

Another method of indicating the relationship between the two scores is through the establishing of a rater agreement percentage--that is, to take the assessments that have been double scored and calculate the number of cases where there has been exact agreement between the two raters. If the scale is analytic and rather extensive, percent of agreement can be determined for the number of cases where the scores are in exact agreement or adjacent to each other (within one point on the scale). Agreement levels should be at 80% or higher to establish a claim for interrater agreement.

Establishing Rater Agreement Percentages

Two important decisions which precede the establishment of a rater agreement percentage are:

How close do scores by raters have to be to count as in "agreement?" In a limited holistic scale, (e.g., 1-4 points) it is most likely that you will require exact agreement among raters. If an analytic scale is employed with 30 to 40 points, it may be determined that exact and adjacent scores will count as being in agreement.
What percentage of agreement will be acceptable to ensure reliability? 80% agreement is promoted as a minimum standard above, but circumstances relative to the use of the scale may warrant exercising a lower level of acceptance. The choice of an acceptable percentage of agreement must be established by the school or district. It is advisable that the decision be consultative.

After agreement and the acceptable percentage of agreement have been established, list the ratings given to each student by each rater for comparison:

Student	Score: Rater 1	Score: Rater 2	Agreement
A	6	6	X
B	5	5	X
C	3	4
D	4	4	X
E	2	3
F	7	7	X
G	6	6	X
H	5	5	X
I	3	4
J	7	7	X

Dividing the number of cases where student scores between the raters are in agreement (7) with the total number of cases (10) determines the rater agreement percentage (70%).

When there are more than two teachers, the consistency of ratings for two teachers at a time can be calculated with the same method. For example, if three teachers are employed as raters, rater agreement percentages should be calculated for

Rater 1 and Rater 2
Rater 1 and Rater 3
Rater 2 and Rater 3

All calculations should exceed the acceptable reliability score. If there is occasion to use more than two raters for the same assessment performance or product, an analysis of variance using the scorers as the independent variable can be computed using the sum of squares.

In discussion of the various forms of performance assessment, it has been suggested how two raters can examine the same performance to establish a reliability score. Unless at least two independent raters have evaluated the performance or product for a significant sampling of students, it is not possible to obtain evidence that the score obtained is accurate to the stated criteria.

Chapter 3

Selection and Development of Assessment Procedures

After deciding what to assess, local educators should decide what they need to learn from the assessment (such as whether students meet an outcome or help identify the school's strengths and weaknesses). The first task will be to decide what types of assessment will most validly provide that information. Then, educators will need to select or develop specific tests and other assessment procedures.

Most local planning groups will probably want to use a combination of assessment procedures that are adopted or adapted from elsewhere as well as those that are developed locally. Because local development is difficult and time-consuming, many schools may want to conserve their resources by selecting as many assessment procedures from elsewhere as possible. However, local educators must carefully examine procedures developed elsewhere and will probably need to conduct special reviews or studies to estimate validity, reliability, and fairness.

Practical Tip: When feasible, use the same assessment procedures for school- and classroom-level assessment. This should improve the efficiency of assessment as well as alignment between instruction and assessment. At the same time, it should increase the time and other resources available for collecting whatever information will be most useful locally. Make certain, though, that the assessment procedures will provide useful information for both levels.

After local planning groups have selected assessment procedures from commercial publishers, other districts, and other sources, they are likely to need additional measures to complete their assessment systems. They will probably need to develop those procedures themselves. Although constructing assessment procedures is difficult and demands rigorous attention to quality, many resources are available to help. The rewards for producing a good assessment system that is particularly valid locally and provides high-quality information for school improvement planning can be worth the efforts.

Selection and development will be discussed in this chapter. After the following discussion of selection, subsequent sections will address the development of forced-choice and performance-based assessments, respectively.

Selecting Assessment Procedures

As mentioned previously, assessment procedures are available from a variety of sources. Local educators will probably want to investigate those sources before beginning to develop local assessments. Assessment procedures should be selected carefully. The time and other resources that assessment requires should be used wisely. The information produced should be useful. It may be wise to use a committee process to select assessment procedures. Regardless, all procedures should be considered carefully before they are adopted.

This section discusses criteria which should be considered when assessment procedures are selected from commercial and other sources. Also, at least some of the criteria may be appropriate when considering whether to use local materials which were developed previously for another purpose.

General selection criteria

1. How well does the assessment match the targeted content or educational intent?

The most important criterion to use when deciding whether to adopt any type of assessment procedure is whether the content (knowledge/skills) it assesses matches local content (such as that included in a learning outcome). If the match is poor, the assessment procedure should be eliminated from consideration. It will not provide information which local educators can use validly. If an assessment procedure matches only a proportion of local content, educators will need to use it in combination with one or more other assessment procedures which measure the remainder of the content.

2. Is the type of assessment appropriate?

Another important criterion is whether the assessment requires that students demonstrate their knowledge/skills appropriately. For example, if educators need to learn whether students can communicate effectively, the assessment should require students to speak or write, not answer multiple-choice questions. If the assessment is going to be used to estimate whether students have acquired a wide variety of specified knowledge in a content area, forced-choice tests may cover the content more thoroughly than essays or other performance-based assessments.

3. Will the assessment produce the kind of information needed?

Before adopting a specific assessment tool, educators should make sure that they will be able to obtain the types of information they will need about the results. For example, test publishers may produce information only about results for an entire test or subtest. If educators plan to use other groupings of items, they should find out how and whether they can get results for the specific groupings.

4. What is known about the validity, reliability, and fairness of the assessment?

Information about validity, reliability, and fairness is more likely to be available from commercial test publishers than other sources of assessment procedures. However, it may also be available from those other sources, and seeking information from others may be worth the effort it takes. All information should be reviewed carefully. It is unlikely to show conclusively that an assessment will be valid, reliable, and fair locally. However, it may provide evidence that will be useful in the selection of assessment procedures and will serve as a starting point for local examinations of validity, reliability, and fairness. Furthermore, the existence of such information indicates that the educators who constructed the procedure were concerned about quality.

Educators should be cautious about using evidence of validity that was collected elsewhere for several reasons. It may not address the local situation, or the purposes for which the assessment will be used locally, well enough. It may only consider the content of the assessment, not whether that content is assessed in an appropriate manner or whether the kinds of information produced represent what is needed locally. In addition, the information may not be appropriate because (a) it addresses a test as a whole or predetermined subtests and local educators plan to use groups of items for separate purposes, or (b) local educators plan to modify tasks, thus creating new procedures to which the original information does not apply.

Reliability estimates should also be examined carefully. To determine whether the estimates can be used locally, educators should ask: What method(s) was used to estimate reliability? Is that method appropriate locally? Is the local student population similar to the population included in the reliability studies?

Fairness estimates should be questioned primarily on two factors. First, what methods were used to examine fairness? If only statistical methods were used, local educators may want to convene a panel to conduct judgmental reviews. Second, what groups were included in the fairness estimates? Did they include all racial/ethnic groups that are represented locally? Did they include students with disabilities? It might also be important to ask questions such as: Does it appear that the people who conducted the bias review would be sufficiently representative and credible locally? Did the bias reviewers attend to the questions, issues, and sensitivities that are most important locally?

5. Is the scoring scale appropriate and of high quality?

When selecting performance-based assessments, educators should carefully review scoring scales. Assessment procedures which will not produce useful, reliable information should not be used. The scoring criteria should be appropriate for determining the extent to which students have the knowledge/skills that will be assessed. Definitions of individual points on the scale should be clear and appropriate for measuring the knowledge/skills. Educators should be cautious about adopting procedures with scales or rubrics that appear to be difficult to learn to use or time consuming to apply.

6. What costs (including time) will be incurred while using the assessment?

Educators should consider several kinds of costs. They should consider whether each is a one-time cost or will recur annually.

The costs to consider include:

obtaining sufficient copies of the instrument or procedure (including costs to purchase, print, or copy tests),
scoring and analyzing results (including reporting services from test publishers, time and other costs for scoring performance- based assessments, and time required to assemble information for various reporting needs),
administering the assessment, especially the time required of teachers and students, and
collecting and analyzing additional information about validity, reliability, and fairness.

Developing Forced Choice Assessment Procedures

Local staff who decide to construct forced-choice assessment procedures can use several different sources of items. They can purchase commercial item banks, network with other districts and schools in the creation of a shared item bank, and/or write their own items.

Writing your own assessment items

School and district personnel who write their own assessment items must devote considerable resources, including training time, to that task. (Regional education service agency staff may be able to assist.) Schools and districts may need to allocate several weeks or more to writing, piloting, and revising items in each content area.

Many guidelines are available to help local staff write assessment items.

staff should try out the items by administering them to a representative sample of students. While doing this, they should ask several question about each item:

Does the item appear to assess the intended knowledge/skill?
Do students clearly understand the instructions and the items? (Talk to some students about why they responded as they did.)
In comparison with other information such as teacher judgment, do those students who might be expected to respond correctly (or incorrectly) do so?
Are any distractors (incorrect response options) confusing or too blatantly incorrect?
Do the responses of students from different racial, ethnic and gender groups as well as students with disabilities indicate that the assessment is not biased against them?

Schools, districts, and regional collaborative groups may want to assemble assessment items in item banks. After items have been tried out with students, staff can review the items and select those which they want to use. Item banks can be computerized or stored on paper. To increase the utility of the items, several types of information might be stored with each one. For example, the items should be coded to indicate the content area assessed (and perhaps more specific information such as the state goal and local learning outcome assessed). The items might also include data from students in the pilot or tryout--for example, the grade level of the students and the proportion of students who answered the item correctly.

Constructing tests from items

Local staff who write or acquire a large number of test items have taken a major step toward constructing their own local tests. Test construction, however, involves more than simply assembling the items into booklets. To construct tests, staff will need to perform the following tasks:

Identify the test purpose. The intended use of test results will influence at least two types of decisions about test construction. First, if a test will be used to estimate whether students have mastered a particular body of knowledge or skills (such as a criterion-referenced test of learning outcomes), the difficulty of items should be different from items in a test that will be used to compare students (such as a norm- referenced test). A test for comparing students should include items at a wide range of difficulty levels, but a test of student mastery needs to concentrate on items that distinguish between students who have mastered the knowledge/skills and students who have not. Second, the purpose of a test will be critical when estimating validity. As discussed in the previous chapter, any test will be more valid for some purposes or uses than others. A test is likely to be more valid for purposes that are identified in advance and used to direct test construction than for other purposes. To maximize validity for a test's major purposes, it is important to identify those purposes at the beginning of test construction and to construct the test accordingly.

Develop test specifications. This stage, sometimes referred to as developing test "blueprints," involves making decisions about the composition of the test. Test blueprints specify the distribution of test items across one or more factors. For example, a blueprint might be a matrix with content categories across one axis and item difficulty level across the other axis. Each cell of the matrix would specify the number (or percent) of test items which should fall in the cell. Staff might want to distribute items equally across the cells. Or, they may decide that some cells are more important than others and should receive comparable emphasis on the test.

Blueprints may or may not be in matrix format. Staff may want to distribute test items across more than the two factors or variables that can be included in a matrix. They may want a test to include items that are at several different cognitive levels. For example, they may decide that some items should assess factual knowledge, but that others should assess whether students can apply that knowledge. Staff may want to develop test specifications for several types of content categories, and at some point may want to become rather specific (for example, specifications for history, literature, or fine arts tests might indicate what proportion of the items should refer to each of several major historical periods or cultures).

Assemble the test. This stage includes three major activities: 1) selecting items, 2) arranging them into test booklets, and 3) developing instructions for standardized administration of the test. When selecting items, staff should refer to the test specifications. However, staff should also review each item carefully. They should consider whether it accurately reflects local learning outcomes/instruction and whether it assesses knowledge/skills they consider important. Local staff should also examine information that is available from previous administrations of the item. What proportion of students selected the correct response? Does the distribution of responses to incorrect alternatives suggest that the item is poor because one or two alternatives were so obviously incorrect that only a few students (e.g., less than 10%) chose them? Did a high percentage of students who performed well on the test as a whole select an incorrect response?

Standard administration instructions are important to ensure that the tests are given uniformly to all students. The instructions will indicate directions to be given to students, the resources students may use during the test (e.g., books or calculators), whether guessing is allowed or encouraged, and the amount of time allotted for the test.

Field-test and then revise the test. Before tests are used widely, they should be given to a small, representative sample of students using processes similar to those described previously for trying out items. Each time a test is revised, the new version should be field-tested. Following these processes with items, and then with tests, should limit the amount of time and resources that are lost because a test did not perform as expected.

Practical Tip: As staff develop assessment tools (items, tests, or other procedures), they may want to try them out with small groups of their own students before giving them to a larger sample. This may help them identify problems such as language that is difficult for students to interpret, make modifications, and thus reduce the need for modifications and tryouts after assessing the larger sample of students.

Review the test for validity and fairness. This stage needs to occur both before and after field-testing. Major changes in an assessment procedure will require additional reviews, and perhaps field-testing. The reviews should be conducted by panels which include teachers and others who are knowledgeable about the content area assessed or about bias review procedures. Ideally, these panels should be independent of the test construction committees. People who review tests for validity or nondiscrimination should not have been closely involved in the test's development.

Developing Performance Based Assessment Proceudres

The process of developing performance-based assessment procedures is somewhat like that of developing forced- choice assessments. Assessment purposes and specifications should be considered carefully. The assessments should be drafted, tried out with students, and--if necessary-- revised and tried out with students again. Validity, reliability, and fairness should be reviewed at several stages.

In addition, scales or rubrics for scoring performance-based assessments must be available--whether they are constructed locally or are adapted or adopted from other sources. People who score forced-choice items generally use a scoring key and, thus, have common understandings of which responses are correct. However, each person rating a performance- based assessment must make decisions about the specific scores to assign. To help ensure that raters use the same criteria and have similar understandings of what kinds of performances to assign each value on a scoring scale, schools and districts must design explicit scoring scales and train raters to use them.

Students' scores must not vary according to who rates their performance. Such scores will not be reliable and should not be used.

One of the first tasks in the development of performance-based assessments will be to decide what type of performance to use to assess a particular outcome or knowledge/skill. For example, should students be required to make an oral presentation, show a mathematical proof, or demonstrate how to use scientific equipment? What should be rated--performances, products, or processes used to make the products?

Developing Tasks

Once a school or district has decided what type of performance-based assessment to use, they should develop specific tasks for eliciting the desired student behavior. Tasks should be clear and appropriate. They should be directly related to the knowledge/skill they are being used to assess. They should include the instructions which will be given to students to help ensure that the assessment is administered uniformly. Tasks should meet the criteria discussed in Chapter 2: consequences, content coverage, content quality, transfer and generalizability, cognitive complexity, meaningfulness, fairness, and cost and efficiency.

Developers should also ask questions such as:

What are students asked to do?
To whom will the students be responding?
What response options are allowed?
What materials/resources are required, and how can they be obtained?
What limits are to be placed on time or design?
What should the scoring criteria be, and how will they be applied?

The information which is given to students is critical. Students should know what knowledge and skills the task is assessing, and the criteria that will be used to rate their performances. Unless students clearly understand what is expected of them, their responses to the task may not be relevant enough to permit the scoring scale to be applied appropriately and rigorously. Assessing students on criteria they are not aware of is unfair.

Developing Scoring Scales

As indicated above, the scoring scale or rubric is a major tool for helping ensure that performance assessments are uniform. Constructing a scoring scale requires three major activities. First, developers must decide what type of scale to use. Second, they should identify the criteria that will be used to judge the quality of the performance. Third, they should identify the specific values, or points, which will be used with each criterion and define each one clearly.

Types of Scoring Scales

There are two basic types of scoring scales: analytic and holistic. Analytic scales assign separate ratings to separate criteria. Holistic scales combine the criteria into one score.

These two basic types may be combined, as on the scoring scale used for the writing assessment in the Illinois Goal Assessment Program (IGAP). This scale uses both analytic criteria (conventions, focus, organization and support) and an overall score (integration). The integration score is informed by the analytic criteria. It is double weighted to emphasize its value. The scale is shown in Figure 5.

Figure 5

	SUMMARY OF KEY FEATURES FROM THE ILLINOIS WRITING ASSESSMENT
	Absent	Developing		Developed	Fully-Developed
FEATURES	1	2	3	4	5	6
FOCUS Degree to which idea/theme or point of view is clear and maintained.	Absent; unclear; insufficient writing to ascertain maintenance	Attempted; subject unclear or confusing; main print is unclear or shifts; resembles brainstorming; insufficient writing to sustain issue	Subject clear/ position is not; "underpromise, overdeliver"; "overpromise, underdeliver"; infer; two or more positions without unifying statement; abrupt ending	Bare bones; position clear; main point(s) clear and maintained; prompt dependent; launch into support w/o preview	Position announced; points generally previewed; has a closing	All main points are specified and maintained; effective closing; narrative event clear, importance/ significance stated or inferred
SUPPORT Degree to which main point/ elements are elaborated and/or explained by specific evidence and detailed reasons.	No support; insufficient writing	Support attempted; ambiguous/ confusing; unrelated list; insufficient writing	Some points elaborated; most general/some questionable; may be a list of related specifics; sufficiency?	Some second-order elaboration; some are general; sufficiency ok - not much depth	Most points elaborated by second-order or more	All major points elaborated with specific second-order support; balanced/ evenness
ORGANIZATION Degree to which logical flow of ideas and text plan are clear and connected.	No plan; insufficient writing to ascertain maintenance	Attempted; plan can be inferred; no evidence of paragraphing; confusion prevails; insufficient writing	Plan noticeable; inappropriate paragraphing; major digressions; sufficiency?	Plan is evident; minor digressions; some cohesion and coherence from relating to topic	Plan is clear; most points logically connected; coherence and cohesion demonstrated; most points appropriately paragraphed	All points logically connected and signaled with transitions and/or other cohesive devices; all appropriately paragraphed; no digressions
CONVENTIONS Use of conventions of standard English.*	Many errors, cannot read, problems with sentence construction; insufficient writing to ascertain maintenance	Many major errors; confusion; insufficient writing	Some major errors, many minor; sentence construction below mastery	Minimally developed; few major errors, some minor, but meaning unimpaired; mastery of sentence construction	A few minor errors, but no more than one major error	No major errors, few or no minor errors
INTEGRATION Evaluation of the paper based on a global judgement of how effectively the paper as a whole uses basic features to address the assignment.	Barely deals with topic; does not present most or all features; insufficient writing	Attempts to address assignment; some confusion or disjointedness; insufficient writing	Partially developed; some or one feature not developed, but all present; reader inference required	Only the essentials present; paper is simple, informative, and clear	Developed paper; each feature evident, but not all equally developed	Fully developed paper; all features evident and equally well developed
* Usage, sentence construction, spelling, punctuation/capitalization, paragraph format.

The type of scale selected should be governed by the type of task to be assessed. If a task involves an integrated activity and performance on one criterion will influence all other criteria, judgment should be integrated and a holistic scale should be used. If examining specific criteria is important, an analytic scale may be most useful. A rubric that may be used either analytically or holistically is shown in Figure 6. It was developed for scoring open-ended items in mathematics.

Figure 6

FIGURE 6. MATHEMATICS SCORING RUBRIC: A GUIDE TO SCORING OPEN-ENDED ITEMS
SCORE LEVEL	MATHEMATICAL KNOWLEDGE	STRATEGIC KNOWLEDGE	COMMUNICATlON
SCORE LEVEL	Knowledge of mathematical principles and concepts which result in a correct solution to a problem.	Identification of important elements of the problem and the use of models, diagrams and symbols to systematically represent and integrate concepts.	Written explanation and rationale for the solution process.
4	shows complete understanding of the problem's mathematical concepts & principles uses appropriate mathematical terminology & notation (e.g. labels answer as appropriate)¹ executes algorithms completely and correctly	identifies all the important elements of the problem and shows complete understanding of the relationship between elements. reflects an appropriate and systematic strategy for solving the problem gives clear evidence of a complete and systematic solution process	gives a complete written explanation of the solution process employed; explanation addresses what was done, and why it was done if a diagram is appropriate, there is a complete explanation of all the elements in the diagram.
3	shows nearly complete understanding of the problem's mathematical concepts and principles uses nearly correct mathematical terminology and notations executes algorithms completely; computations are generally correct but may contain minor errors	identifies most of the important elements of the problem and shows general understanding of the relationships among them reflects an appropriate strategy for solving the problem solution process is nearly complete	gives a nearly complete written explanation of the solution process employed; may contain some minor gaps may include a diagram with most of the elements explained
2	shows some understanding of the problem's mathematical concepts and principles may contain major computational errors	identifies some important elements of the problem but shows only limited understanding of the relationships between them appears to reflect an appropriate strategy but application of strategy is unclear gives some evidence of a solution process	gives some explanation of the solution process employed, but communication is vague or difficult to interpret may include a diagram with some of the elements explained
1	shows limited to no understanding of the problem's mathematical concepts and principles may misuse or fail to use mathematic terms may contain major computational errors	fails to identify important elements or places too much emphasis on unimportant elements may reflect an inappropriate strategy for solving the problem gives minimal evidence of a solution process; process may be difficult to identify may attempt to use irrelevant outside information	provides minimal explanation of solution process; may fail to explain or may omit significant parts of the problem explanation does not match presented solution process may include minimal discussion of elements in diagram; explanation of significant elements is unclear
0	no answer attempted	no apparent strategy	no written explanation of the solution process is provided.
¹ �As Appropriate� or �if appropriated� relate to whether or not the specific element if called for in the stem of the item. Adapted from Lame (1993).

Rating scales or checklists might be used. Rating scales are, essentially, continua which place student performance at a particular level. Many have four or six points (because many people prefer even-numbered scales to avoid the tendency to select the middle point on odd-numbered scales). Individual points on scales may be numerical or qualitative (with verbal descriptions rather than numbers). Checklists can be used to indicate whether a criterion is present or absent. Again, the type of scale which is most appropriate depends on what is being assessed and how the results will be used.

Scales can be developmental. Generally, this means that one scale will be used across grades. Student growth will be shown by improved performance as students advance through the grades.

Selecting Criteria

The criteria which are used to rate student performance are absolutely critical. They are the dimensions or variables (also known by terms such as traits or components) upon which student performance is judged. The type of information that is available from performance-based assessment is dependent on the criteria used. The criteria tell raters what they should look at when judging the quality of a performance or product. Criteria must clearly represent the knowledge/skills for which the assessment is being used. Criteria should also represent what knowledgeable educators agree need to be assessed in order to judge the quality of the performance/product. Therefore, assessment criteria should be established and reviewed thoughtfully.

Practical Tip: As staff develop assessment systems, they should document their work as much as possible--so that the work doesn't have to start over when different teachers or administrators take responsibility for the system due to turnover in committee membership or school staff. The documentation should include descriptions of decisions made about the system (such as when assessment will occur and what types of assessment procedures will be used); examinations of validity, reliability, and fairness; and selection and development of assessment procedures.

Select criteria which are:

Easily understood
Relevant to the learning outcome
Compatible with other criteria used in the rubric
Precise
Representative of the vocabulary of the discipline
Observable, requiring minimal
interpretation
Unique, not overlapping with another criterion or trait

A final consideration is the number of criteria or traits to include in the scale. The number should be
reasonable, to avoid overwhelming raters and to limit the amount of time required for scoring. When too many criteria are used, scoring can become so burdensome that the gain in specificity is lost.

Identifying Scale Values

One effective approach to establishing different values on a scale is to use actual examples of the performance or product to make direct comparisons of different levels of student performance. These "anchors" can then be used to create mental images which may be as meaningful to raters as written descriptions. However, written descriptions--which may be based on summaries of the anchor performances/products--are necessary to establish definitions of the different value levels.

Another approach is to assign descriptors to value levels. These descriptors can be numerical (1, 2, 3, 4) or verbal (unacceptable, minimally acceptable, clearly acceptable, excellent).

Suggestions for scale developers

Investigate how the assessed discipline defines quality performance.
Gather sample rubrics for assessing the discipline as models to adapt.
Gather samples of students' and experts' work that demonstrate the range of performance or products.
Discuss with others the characteristics of these models that distinguish levels of quality.
Write descriptions of the various levels of quality.
Gather another sample of student work which reflects these descriptions.
Try out the criteria to see if they help in making accurate judgments about students.
Revise the criteria.
Continue trying the rubric until it captures the "quality" of the work.

Chapter 4

Interpretation, Use, and Reporting of Assessment Results

Three important functions of assessment are deciding what assessment results mean, what their implications are, and what changes or other decisions should be made.

The purpose of most assessment is to collect information about learning in order to improve decisions about how to help schools improve and students learn even more. Such information includes identifying:

the extent to which a school is meeting its intents (as expressed in outcomes or objectives),
strengths and weaknesses in individual content areas,
the learning status and needs of students, and
which portions of a school's curriculum are most in need of assistance.

Because those decisions are critical, educators should make them thoughtfully and carefully. Assessment results should be interpreted and used cautiously. Factors which should be considered during interpretation and use will be discussed here.

Theoretically, interpreting and using assessment results are separate actions. Interpretation occurs when someone attaches meaning to the results. Use occurs when someone makes decisions or takes other actions on the basis of the results. However, because interpretation and use are sometimes difficult to distinguish and are closely related to one another, they will be discussed together.

Validity

Current scholars (including Messick, 1989; Shepard, 1993, and Linn, 1993) say that validity is a quality of assessment interpretation and use rather than of assessment procedures themselves. Validity is critical to anyone who attempts to attach meaning to assessment results or make decisions based on them. All assessment results are more valid for some interpretations and uses than for others.

The intended purposes and uses of assessment results are considered several times during the planning phase--for example, when an assessment system is designed, when assessment procedures are selected and developed, and when validity is examined. Those purposes and uses help determine the types of interpretation and use that are appropriate. New interpretations or uses, whether they are voluntary or in response to others such as legislators or superiors, require renewed planning. Otherwise, assessment may be misused.

After assessment results are available, people often want to interpret or use them for multiple purposes which go beyond those considered in the planning phase. For example, they may want to use an assessment that was designed to learn whether schools are meeting learning outcomes to make decisions about whether individual students should be promoted to the next grade. They may want to use the results of a large-scale (e.g., state-level) assessment to determine whether schools have met local outcomes. They may want to interpret the results of a test intended to help make college admissions decisions about individual students as indicators of the quality of instruction in individual schools. Interpretations and uses which exceed an assessment's limitations are inappropriate.

Generalization

People who use assessment results often draw conclusions about content domains that are much broader than those actually included in the assessment. For example, they may make statements about student achievement in world history or environmental sciences based on responses to just a few forced-choice test questions. Or, they may make general statements about student writing skills based on responses to a single prompt that required only expository writing, or about artistic abilities based on a crayon drawing of a house. These examples clearly represent unwarranted generalizations. Other inappropriate generalizations may be much more subtle. Educators should be careful when interpreting assessment results to avoid implying that they represent broader content or skill areas than the assessment actually measured.

Multiple Sources of Information

Many sources of error influence assessment results: a student misreads a question or went to bed late the previous night and has trouble staying awake; a teacher forgets to review sample problems with students or allows students to work beyond the specified time; a testing coordinator fails to open boxes in advance and discovers too late that there are not enough assessments for all students. The sources of error are likely to vary from one testing situation to another. If errors resulting from different sources of information--such as different kinds of assessment procedures--are random, they are likely to balance one another. For this reason, assessment interpretation and use will probably be improved if they are based on multiple sources of information.

Multiple sources of information also help compensate for the shortcoming that information does not cover the assessed content comprehensively. It occurs because, as discussed in the previous section about generalization, the assessment is brief and covers only a small portion of the content. Or, perhaps the content includes both knowledge and skills, and different types of assessment procedures (such as a paper-and-pencil test and a performance-based examination) should be used with different portions of it. Again, multiple sources of information should enable educators to make more valid interpretations and uses.

To illustrate, one could draw firmer conclusions about students' knowledge and skills related to a civil war from the results of several assessments that each focused on different aspects of the war. The assessments were of several types, such as multiple-choice or matching tests and essays about the causes of the war, social conditions during it, or the impact of it on the country. Together, the assessments covered the war comprehensively. Multiple performance-based assessments might be used--for example, requiring students to build models of battlefields, write and perform a simulated Lincoln cabinet meeting, and develop a script of Lee's conversation with Grant at Appomattox.

Results for Individual Students

Many assessments are explicitly intended to produce information about the achievement of individual students. Others, however, are for collecting evidence about the effectiveness of schools--often for accountability purposes. It may or may not be appropriate to use such information to learn about the status of individual students. Schools should use individual students' results from school-level assessments only after determining that such uses are appropriate.

Reporting Results

Reports of assessment results can be critical documents. They are the only source of information that some audiences receive about actual student learning. Therefore, educators should thoughtfully design and prepare reports about assessment results. They should consider the factors discussed above. Otherwise, they might mislead audience members and cause them to misinterpret and/or misuse the results. In addition, educators need to consider questions such as:

What audiences will each report address?
What do audience members want to know, and what else do we want them to learn?
In what form should the results be reported?

Audiences

As schools and districts develop plans for reporting assessment results, they will base many of their decisions on their perceptions of what various audiences want and need. Educators should consider several different types of local audience including:

Staff and administrators (school and district)
Parents and students
School board members
Teachers' associations
Community members
Media

Many educators will probably conclude that they can serve multiple audiences most effectively by issuing several reports, each tailored to one or more specific audiences.

Information

When deciding what types of information to include in various reports, educators should consider not only what various audiences want, but what additional information will give them a more meaningful understanding of what students are learning and of a school's effectiveness. In addition to data showing the proportion of students meeting local outcomes or objectives, the following types of information might be included:

demographic data that help describe the school population and identify special problems or needs (e.g., racial/ethnic distribution, mobility rate, proportion of students from low-income families),
other indicators of school effectiveness (e.g., attendance, dropout, and graduation rates; academic awards; constituent satisfaction),
analyses showing the proportion of students in various groups (e.g., racial/ethnic minority groups, Title I, students with disabilities) who meet objectives/outcomes,
descriptions of programs (regular or special programs),
comparisons with standards and expectations or with results from previous years,
other assessment results (e.g., from state assessment or commercial norm-referenced tests),
local interpretations regarding why some students are not meeting outcomes/objectives, and
descriptions of efforts to improve student learning (future, current, or previous).

Educators should carefully prepare reports that will communicate effectively with each intended audience. They should avoid developing reports that contain so much information that many audience members will ignore them. Indexing reports so that readers can quickly find information can help considerably. However, educators should not overload reports with extraneous information that will discourage potential readers.

Format

The format of assessment reports will influence readers' motivation to read the reports as well as the perceptions that readers acquire from the reports. Questions that educators might want to ask as they design reports include:

What level of detail should be reported? Should each learning outcome always be reported separately? If not, how should outcomes be clustered? Should results for each grade always be reported separately? Remember to avoid overwhelming readers. Also, be careful to avoid reporting results that can be linked to individual teachers. Even though assessment data often represent learning students accumulated during several years, many readers are likely to forget that and interpret results as indicators of the effectiveness of individual teachers.
What types of charts, graphs, and tables should be used? These visual aids can often present information more efficiently and meaningfully than text. However, they must be used carefully. Charts and graphs are sometimes misleading, such as when the scale used is much smaller than the range of scores on a test and differences among groups or from one testing period to the next are exaggerated. Tables should be used sparingly. They can be used to present so much information so efficiently that they quickly confuse readers.
What should be the ratio of charts, graphs, and tables to text? Should information presented in charts, graphs, and tables always be repeated in the text? Some people prefer to examine charts, graphs, or tables in order to obtain quantitative information. They may become bored when information is repeated in the text. Others need the contextual information that is presented when numbers are embedded in text. Regardless, most reports should probably contain a balance among charts, graphs, and/or tables with text.
What should be the approximate length of the report? Educators should avoid allowing reports to become so long that few people will read them. To facilitate reading, educators should organize and index reports so that readers can quickly locate information they need. Devices such as executive summaries with page or chapter number references and detailed tables of contents can be very effective.

The answers to many of these questions will vary by audience. For example, some audiences may prefer brief, easy-to-read summaries while others want considerable detail. As indicated above, it may be necessary to prepare different reports for different audiences.

References

Alaska Department of Education. (1986). Assessment Handbook: A Practical Guide for Assessing Alaska's Students: Juneau: State of Alaska, Department of Education.

A resource that was used extensively when both versions of the Illinois Assessment Handbook were developed.

American Educational Research Association, American Psychological Association and National Council on Measurement in Education. (1985). Standards for Educational and Psychological Testing. Washington, DC: American Psychological Association.

The fourth version of a publication developed by a joint committee representing three major professional organizations. The document's purpose is to guide the development and use of tests.

Gronlund, N.E. (1985). Measurement and Evaluation in Teaching. New York: MacMillan Publishing Company, Inc.

The fifth edition of a classic textbook on assessing students. The book is intended for elementary and secondary teachers. It is practical and contains clear descriptions of major concepts in assessment.

Gronlund, N.E. (1988). How to Make Achievement Tests and Assessments. Boston: Allyn and Bacon.

The fifth edition of a practical guide for constructing forced-choice and--to a more limited extent--performance-based assessment procedures.

Herman, J.L., P.R. Aschbacher, and L. Winters. (1992). A Practical Guide to Alternative Assessment. Alexandria, VA: Association for Supervision and Curriculum Development.

A practical guide for developing and using performance-based assessments.

ERROR. Unbalanced Paragraph Marks. Do not forget to terminate <P> with </P>

Illinois State Board of Education. (1993). An Overview of IGAP Performance Standards for Reading, Mathematics, Writing, Science, and Social Sciences.

This document describes the development of the standards, tells how they will be used in the evaluation of schools, and presents the standards.

Illinois State Board of Education. (1994a). The Illinois Public School Accreditation Process: Resource Document.

An overview of the Illinois Public School Accreditation Process that explains its three parts and explains how schools can use it.

Illinois State Board of Education. (1994b). The Illinois Public School Accreditation Process: School Improvement Plan Workbook.

A training manual on the Illinois School Improvement Plan.

Illinois State Board of Education (1994c). Illinois School Improvement Plan: Assessment Systems. Brochure.

An information brochure on the assessment component of the Illinois School Improvement Plan.

Illinois State Board of Education (1994d). Illinois School Improvement Plan: Introduction. Brochure.

An information brochure for teachers and the general public on the School Improvement Plan and the Illinois State Goals for Learning.

ERROR. Unbalanced Paragraph Marks. Do not forget to terminate <P> with </P>

Illinois State Board of Education (1994e). Illinois School Improvement Plan: Learning Outcomes, Standards, and Expectations. Brochure.

An information brochure on the learning outcomes, standards, and expectations component of the Illinois School Improvement System.

Illinois State Board of Education. (1994f). Learning Outcomes, Standards and Expectations: Linking Educational Goals, Curriculum and Assessment.

A theoretical paper on options for addressing learning outcomes and standards from various curriculum orientations.

Illinois State Board of Education. (l994g). Write on Illinois, III!

The third edition of a guide for understanding performance assessment in writing. It can help schools develop performance-based assessment in writing.

Illinois State Board of Education. (1995). Performance Assessment in Mathematics: Approaches to Open-Ended Problems.

This document, which includes the scoring rubric shown in Figure 6, provides guidelines and suggestions for creating and using open-ended performance items for problem solving.

Lane, S. (1993). The conceptual framework for the development of a mathematics performance instrument. Educational Measurement: Issues and Practice, 12, l6-23.

Linn, R.L. (1993). Educational assessment: Expanded expectations and challenges. Educational Evaluation and Policy Analysis, 15(1), 1-16.

Linn, R.L., E.L. Baker, and S.B. Dunbar. (1991). Complex, performance-based assessment: Expectations and validation criteria. Educational Researcher, 20(8), 15-23.

Marzano, R.J., D. Pickering, and J. McTighe. (1993). Assessing Student Outcomes: Performance Assessment Using the Dimensions of Learning Model. Alexandria, VA: Association for Supervision and Curriculum Development.

Describes a practical approach to student assessment that recognizes the connections among teaching, learning, and assessment. Includes many generic rubrics for teachers to adapt and use.

Messick, S. (1989) Validity. In R.L. Linn (Ed.), Educational Measurement (3rd ed, pp 13-103). New York: Macmillian.

National Evaluation Systems, Inc. (1987). Bias Issues in Test Development. Amherst, MA: National Evaluation Systems, Inc.

Prepared under contract with the Illinois State Board of Education. Discusses assessment bias through language usage, stereotyping, representational unfairness, and content exclusion. Includes guidelines for avoiding bias.

Ory, J. C. and K. E. Ryan. (1993). Tips for Improving Testing and Grading. Newbury Park, CA: Sage.

A practical guide for developing forced-choice and performance-based assessments (including classroom-level assessments) and for assigning grades.

Shepard, L.A. (1993) Evaluating test validity. In Review of Research in Education, 19, pp. 405-484. Washington, D.C.: American Educational Research Association.

Stiggins, R. (1987, Spring). Design and development of performance assessments. Educational Measurement: Issues and Practices, 35.

Wilson, L. R., B. C. Sherbarth, H. M. Brickell, S. T. Mayo, and R. H. Paul. (1988). Determining Validity and Reliability of Locally Developed Assessments. Springfield, IL: Illinois State Board of Education.

Note: Most Illinois State Board of Education publications are available from ISBE and from regional education service agencies (although some may be out of print).

Assessment Handbook: A guide for developing assessment programs in Illinois schools

1995 Edition

ILLINOIS STATE BOARD OF EDUCATION

Table of Contents

Chapter 1

Local Assessment Systems

Overview of Assessment

What is assessment?

What is a comprehensive assessment system?

Assessment types

Forced-choice assessment

Performance-based assessment

Some thoughts about this typology

Assessment sources

Commercial publishers

Local school or district

Other educational organizations

DEVELOPING LOCAL ASSESSMENT SYSTEMS

Identifying assessment purposes/functions

Deciding what to assess

Selecting assessment types

Establishing assessment schedules

Involving staff and other constituent groups

Overall committee (for policy recommendations)

Task forces (for specific areas and functions)

Special considerations

The school board's role

Reviewing the assessment system

Chapter 2

The Quality of Assessment

Ensuring the Quality of Assessment

Validity

Figure 2: Summary of Criteria for Estimating Validity

Validity criteria

1) Consequences

2) Content Coverage

3) Content Quality

4) Transfer and Generalizability

5) Cognitive Complexity

6) Meaningfulness

7) Fairness

8) Cost and Efficiency

Examining and improving validity

Figure 3: Worksheet: Assessment Validity Review

Reliability

Examining reliability

Forced-choice assessment

Figure 4: Methods of Estimating Reliability

Performance-based assessment

Gradually accumulating evidence of reliability

Improving reliability

Fairness

Examining and improving fairness

OTHER IMPORTANT TOPICS

Assessment administration

Alignment

Steps toward improving alignment

Estimating Reliability - Forced-Choice Assessment

Reliability Estimation Using a Split-half Methodology

Chapter 3

Selection and Development of Assessment Procedures

Selecting Assessment Procedures

General selection criteria

Developing Forced Choice Assessment Procedures

Writing your own assessment items

Constructing tests from items

Developing Performance Based Assessment Proceudres

Developing Tasks

Developing Scoring Scales

Types of Scoring Scales

Figure 5

Figure 6

Selecting Criteria

Identifying Scale Values

Chapter 4

Interpretation, Use, and Reporting of Assessment Results

Validity

Generalization

Multiple Sources of Information

Results for Individual Students