The Neutralization of Benefits in Standardized Testing
Standardized testing has become an important part of the United States education system. The concept of a test prepared and administered to allow a universal and non-biased assessment of all students taking it sounds like the Golden Fleece of education. Indeed, the standardized test is a well-integrated part of most school systems and frequently used for system wide assessment. These school systems together spend about $516 million annually on system wide testing (Chelimsky, 2) and soak up approximately twenty million school days (Coley,1). A half-billion dollars is a large amount of money and large amounts of time and money often come under scrutiny and accusations by the public as inefficient or ineffectual.
Many reasons exist for such an assault on standardized tests as they exist today. They stand accused of many things, including bias toward certain student types and mismatching to what students are taught (Haney and Madaus, 9). As more research comes out about traditional practices, the more inaccurate and biased these practices are revealed to be. In the scramble to replace them with new and better methods, new problems, such as added costs or uneasy results, are created (Coley, 4). The result is a fundamental difference between the perceived worth of standardized tests and their actual worth to administrators and educators.
Now more than ever, the tests are the ones being tested. Many agencies, including the National Assessment of Educational Progress (NAEP), National Council on Educational Standards and Testing (NCEST), and the Educational Testing Service (ETS) are conducting examinations of testing methods and practices. The information compiled about tests themselves is now growing and there is a beginning of stronger understanding of the benefits and consequences of testing practices.
There are still many benefits attributed to standardized tests. Some are real; some are only perceived. Standardized tests can provide two useful sorts of information. They can be used for assessment, the weighing of an individual student's abilities to provide better teaching, and accountability, the control mechanism that outside agencies have over educators and students to make sure both are doing their job. This information is often used to make important decisions. The focus of standardized testing is its universal nature. It is designed to be presented to students in a universal fashion so that the methods of administration do not become a variable in the performance of the students. Likewise, the questions themselves are to be universally accessible by the students. These controlled variables lead to a perception and treatment of standardized test results as laboratory results (Haney and Madaus, 18).
One very popular use of standardized testing is for purposes of determining students suitable for remediation. Standardized testing moved into the educational spotlight for its ability to test "minimum competency" in the '70s (Coley, 1). The way it treats students equally makes it a suitable judge for finding the lagging students objectively and allows them to be removed by empirical methods.
Also, standardized tests can be seen as a mandate by the state or federal government to a school in order to concentrate in certain areas. If the math section of a state test required for graduation adds a new section, local curriculums will change in order to compensate. In this way, tests become messages between governments and school systems to communicate what law-makers believe should be learned and requires more attention. With greater importance given to a standardized test, the power of that test over students and educators increases. These tests, used to make decisions directly related to the test-taker, are referred to as "high-stakes." Remediation exams, exit exams, and entrance exams carry a lot of influence with them.
One of the values of assessment is in showing where teaching needs to be changed in order to better suit the subject matter. The results of standardized testing can be used to show what areas educators do well in and what areas are less developed. Similar to its effect on curriculum, standardized testing can serve as a guideline to individual educators who need to change their methods in order to better focus on what should be learned.
However, standardized testing is not without its consequences. Each of the above benefits also have a negative side to them.
For instance, the local curriculum can be changed by an important test. If the test is one over necessary skills, the curriculum focuses more on those skills. However, no test can completely cover an issue, which leads to problems. If there is a certain bit of information on the tests, it shows up in the curriculum. The educators may do this in order to help students or their own marks. The rest of the curriculum has to fight to stay included. With standardized testing, a big line is drawn between the tested information and the non-tested information. Both students and educators begin to see this line as dividing relevant and non-relevant information. (Haney and Madaus, 14)
While a main goal of accountability is to put pressure on educators and students to perform, undue pressure does more harm than good. The greater importance a test is given, the greater its effect on the students. Also, with the greater concentration on tests supposedly desired of students, it is not surprising that students tend to push out other areas of study when under the pressure of an important test. Teachers complain that students immediately glaze over when they find out that presented information will not be tested (O'Shea et. al., 24). "If important decisions are presumed to be related to test results, then teachers will teach to the test." (Haney and Madaus, 12) If there are any undisputed facts associated with standardized testing, it is the fact that teachers change their methods of teaching in order for their students to do better on important tests.
Standardized tests are certainly not the Holy Grail for educators, as much as they have appeared to be in the past. After seeing the consequences caused by their implementation, one begins to wonder about their value to a school system. If their main benefits are to provide data to the administration and to provide structure to the educators but the data is faulty and the structure is domineering, the worth of standardized testing comes into question.
There are many different variables that arise when dealing with standardized tests. The method of questioning, the grading method, and the scope of the tests all become characteristics that contribute to the value of testing. However, it will soon be shown that each testing characteristic has an Achilles' Heel that either strips it of its value as a testing mechanism or makes it too difficult to employ.
The most significant recent research and experimentation in testing has been with the question methods. However, even as older testing methods are determined as inaccurate, they remain because they are too costly to replace. For instance, multiple choice is far and away the most widely used form of testing. This stems largely from the fact that it is so easy to employ. Most educators can make a test able to be machine-scored, cutting down on grading time and additional cost for human graders.
Multiple choice is also "least valued by state and local testing officials." (Chelimsky, 3) It encourages rote memorization of its subjects. In accordance with this encouragement, educators' methods will adapt to include an emphasis on rote memorization. As with curriculum, teaching methods often conform to questioning methods on "high-stakes" testing. "Teachers and students pay particular attention to the form of the questions on a high-stakes test . . . and adjust their instruction accordingly." (Haney and Madaus, 16) By limiting the breadth of the questioning method, the teaching method is also limited. Teaching becomes reduced to children "filling out ditto answer sheets or workbooks." (Haney and Madaus, 17)
Open-ended questions are another option. The idea of a traditional essay seeping into a standardized test scares some people who claim that essays cannot be graded in an unbiased manner. Some are worried that these types of questions "would not necessarily be comparable among themselves or over time." (Chelimsky, 3) ETS has made a strong effort to train multiple graders and construct grading formulas so that this is not the case, but their efforts fall apart as student abilities begin to vary more strongly.
A pattern that is apparent from the tables is a consistent tendency for the reliability statistics of the age 9/grade 4 papers to be the largest and reliability statistics of the age 13/grade 8 papers to be the smallest. . . . There are few 9 year-old students who receive the highest score, so the range is more restricted. The 13 year-olds probably have a wider range of abilities so that the opportunity of rater disagreement is increased. (Kaplan and Johnson, 8)
ETS labels current methods of grading open-ended test questions as "not a cost-effective way to reduce the standard errors of key statistics . . . " (Kaplan and Johnson, 9) Cost tends to be the strongest criticism that anyone has of open-ended questions, but it is a very strong criticism. These performance based tests are considerably more expensive in both development and implementation. Multiple choice tests average $16 per student when taken, open-ended question tests are twice that at $33 (Chelimsky, 2). Also, a $10 per student one-time development cost would be needed (Chelimsky, 2). It is for this reason that 71% of standardized tests nationally are multiple-choice only, even after the many faults of multiple choice have been shown.
The latest entry into the fray are machine-scored free response questions. Meant to take the pluses of both multiple choice and open-ended questions. There are two variants: free response and figural. Free response, used almost exclusively with mathematics, provides a bubble grid that allows students to fill in the answer that they arrived at. This would differ from multiple choice which provides a range of answers to be chosen by the student, a process that some feel "[manipulate] students into making process errors." (O'Shea et. al., 26)
The preliminary results, under guidance of ETS, has shown that the free response questions, if not a better estimate of ability than multiple choice, are more difficult. On tests with identical questions, multiple-choice users answered 7.4% more questions correctly (Coley, 5). Students seemed to agree that the new testing method was a better evaluation of their ability, only 22% saying that multiple choice tests were a better measure. ETS cautions that they "need to know more about how much difference there is, in terms of what is measured, between choosing among answers and constructing them." (Coley, 4) Figural questions are drawings left to be completed by the student. The questions do not provide answers for the student, but can still be machine scored.
Machine-scored free response questions seem to capture the benefits of multiple-choice questions but still lack the depth associated with open-ended essay questions. Free response is best suited for math or science, where computation is involved and figural drawings, while more broad in its applications, still cannot be applied to language evaluations. There is little argument that these question methods are definitely a step in the right direction, but they are still limited as far as what skills still lack satisfactory testing.
The grading method itself is also a bit of a puzzle. Tests are referenced in one of two ways: norm-referenced or criterion-referenced. Norm-referenced tests show grades in terms relative to the students who took the test. This method is extremely useful for tracking purposes and accountability purposes, sub-par students and educators are easily determined. It is incredibly less useful for students and providing them a good education. Students are referred to in terms of other students' proficiency, there are educational absolutes. Criterion-referenced tests, that grade students directly by their individual masteries, are "increasingly hailed as superior to norm-referenced." (Haney and Madaus, 5) However, criterion-referenced tests, especially when purchased commercially, can often differ greatly from the local curriculum causing disruption in teaching methods, as mentioned earlier, and inaccurate results.
Another important item to be considered when dealing with standardized testing is the scope of the tests. Individual tests as well as testing systems change the way the nation views standardized testing.
Specific tests, like the SAT, have destroyed their own validity due to their popularity. They no longer test the information included but the method. "[H]igh schools are increasingly offering courses to prepare students to take these tests, and commercial coaching schools are doing a land office business." (Haney and Madaus, 8) As mentioned in the debate over multiple choice questions, teaching method over content can lead to problems, including a limited curriculum and concentration on memorization.
The misuse of test data, especially SAT scores, can cause the test to reach unintended levels of importance. SAT scores are used for decisions like scholarship awarding and are even used by real estate agents to show the value of certain neighborhoods (Haney and Madaus, 19). When a certain "high-stakes" test is especially important, classes develop in order to better teach it. In Japan, extracurricular "cram schools" exist, called juku. In the U.S., entire high school classes can appear to teach one test. One English department head explains:
Because we now are devoting our best efforts to getting the largest number of students past the essay exam . . . we are teaching to the exam, with an entire course, English III, given over to developing one type of essay writing, the writing of a five-paragraph argumentative essay written under a time limit on a topic about which the author may or may not have knowledge, ideas, or personal opinions. (in Haney and Madaus, 16)
A possible solution to the growing popularity of certain nationally renown tests could come from overseas. France, though it used to have a single national test, has branched out and has a "complex exam system with 28 different options and 23 different sets of questions." (Shafer, 3) This process helps in two ways. One, it's more fair to the students, allowing them a choice between different tests to best determine their level of ability. Second, the individual tests do not acquire popularity and thus remain more useful in terms of statistics.
Again, this method falls just short of a solution. The effect of the multi-test process fractures data-taking and prohibits a "strict comparability across the entire country." (Shafer, 3) The students are no longer taking the same test and their choice of tests is based on individual ability, not randomness, so the testing loses validity from a sampling standpoint as well. Further, the tests might not be basic enough. Though students used to be assured of college entry by passing the exam, now "the most prestigious schools require students to pass further exams (Shafer, 3).
All of this analysis leads to a realization that the fundamental aspects of standardized testing, at least as they are presently administered, are flawed. The concept of a single mastery test leads to significant stress and therefore invalidates itself by promoting method-based study instead of content-based study. While they are meant as a balance to assure that state and local curricula cover certain topics, standardized tests have been shown to be more of a warping factor than a controlling one. Classes become based on the tests and little else. Yet, there's still a need for the two kinds of use that standardized testing has: assessment and accountability.
What needs to be done is separate these two uses into two different tests. Testing for assessment, that is: testing what is known, requires testing on an individual basis. Testing for accountability, that is: testing what is taught, requires only a broad knowledge. L. A. Shepard notes that "the differences between accountability and instructional assessment are so fundamental and necessary that it may not be desirable to merge the two purposes." (in O'Shea et. al., 2)
For example, accountability testing is not nearly as urgent as assessment.
Every ten years the nation conducts a census. Its regular information comes from carefully constructed national samples of households, carried out by carefully trained interviewers, using instruments carefully developed and tested over many years. . . . This type of sampling could be used for accountability testing. (Coley, 4)
Accountability tests also need not be given to every individual student, but could be distributed to a smaller sampling and still attain valid results. With accountability testing greatly minimized in its affect on students, little else remains. Of the 47 states that offer system wide tests at some point during education, 38 of them say that the results are used for monitoring the school systems (Coley, 2).
Assessment tests are more selected, and less routine. While accountability tests may be given to several grades at several times a year, assessment tests normally fall along important boundaries: graduation, college acceptance, etc. This creates a "knowledge bottleneck" around these dates and prompts the schools to concentrate on preparation for these specific points. The clogging of the bottleneck becomes greater and greater as, even before 12th grade courses designed to prepare for a significant test, curricula develop to "get these students ready for 12th grade." (O'Shea et. al., 9)
To alleviate this problem, as well as others, a portfolio testing method seems like a good idea. Portfolios of a students work would be accumulated over the course of a year (or more) and then evaluated at the end of a term. Several standardized tests would take the place of one big one both alleviating stress for the student and the bottleneck affect. Portfolios could also be more easily incorporated into a curriculum, instead of dictating one.
In both the case of sampling and portfolios, the tests themselves would have to drift from the all-or-nothing construction of standardized tests. A mix of multiple choice, free response, and open-ended questions need to be employed for many reasons. First, cost is minimized but value of results is retained. Though open-ended questions are generally regarded as better evaluations, their drawback is cost. The opposite is true of multiple choice. Forged together like a sword, testing become more reliable but not too costly. Second, a test is in essence a picture of a student's performance. In football, there are multiple referees in order to ensure multiple angles of viewing the same play; they often disagree. In testing, multiple snapshots help define the overall picture. From these different angles of viewing one student, more is learned than from careful analysis of any one snapshot.
Standardized testing has many faults currently. There are some solutions coming out now, as research continues. But in general, standardized testing has gone mad. There is too much emphasis on the tests, too many decisions based on them, and too many agencies producing them. In spite of all the evidence against the validity of standardized testing, their use is growing (Haney and Madaus, 18). What is needed is a return to the teaching side of school and less concentration the evaluations. While evaluations can be useful to determine the effects of teaching, there are currently no evaluations that justify their interventions. I'm reminded of a phrase that I heard once from a farmer which is applicable to standardized testing: "Cows grow faster when you feed them than when you weigh them."