The Good and the Bad of Standardized Tests in Schools
In this rather long blog, I point out why standardized tests started out as valid indicators of school effectiveness but became forces that dilute the ability of schools to prepare students for the AI age.
For most of the history of schooling, there were no standardized tests given during the K-12 years of schooling. Teachers were assumed to be professionals who knew what they were doing, and each teacher developed her or his own approach to assessing student learning progress. Over time, though, successful completion of schooling through twelfth grade became an important indicator of suitability for employment, leading to the possibility that some schools might give diplomas to students who had not learned much, which was unfair to others trying to use their diplomas as qualifications for jobs or for entrance into universities. So, states like New York developed examinations that needed to be passed to get a diploma (the Regents’ Exams) and colleges formed the College Board to arrange development of a college entrance examination, the SAT. The Regents’ Exams require extended responses by students and not just multiple choice answers, though multiple choice and other “objective” items predominate. The SAT contains mostly objective items, though it also includes an essay element that some colleges consider while others do not.
Validity. Developers validate the items on standardized tests in several ways. Generally, the main consideration for the SAT is whether it predicts grades in the first year of college. In addition, varieties of statistical techniques are used to assure that the test items all help in that prediction. Items vary in difficulty, of course, and techniques are used to make sure that the set of items on each form of the test includes a sufficient spread of item difficulty that the test is valid for people at all levels of school mastery from very low achievement to extreme excellence.
Fairness. Another big consideration is fairness. It is a problem if the SAT provides valid scores for rich children but not for poor children, or for white children but not for African American children. This can easily happen. Consider a simple analogy item like “Caesar is to salad as Napoleon is to _____.” Most likely, a rich kid would have a better chance of answering this item than a poor kid, simply because expensive brandy is more likely to be in the experience of a wealthy family. If that is the case, then performance on that item should correlate more highly with total test score for the rich kid than for the poor kid. Developers use a statistical technique called differential item functioning analysis to detect such problems, and items that are unfair in this way are eliminated.
Reliability. A third consideration is reliability. This is achieved by having many items on a test. So, if a particular item taps some experience that a particular child may not have had, other items compensate for that. Testing companies have large pools of items, and reliability means that the score a person gets should be about the same regardless of which items are sampled from the pool for a particular version of the test.
Cheat-proof. Especially when government is involved, there is huge pressure for making tests cheat-proof, even if that slightly erodes validity. Overall, tests like the SAT – and the many subject-matter tests that are used by states to evaluate school performance – are valid, fair, and reliable. Tests are scored by machine or by independent scorers to assure honesty in scoring. Today, even essay items can be machine-scored, though it is still possible that human scoring is better able to handle students with dramatically different but still effective writing styles.
From validating diplomas to checking up on schools. Once it was accepted that standardized tests can be produced that are valid, reliable, fair, and cheat-proof, then there started to be pressure to build similar tests to evaluate the effectiveness of schools in teaching the various subjects in different grades throughout the K-12 years. Today, there is a rather large business around testing. For example, in 2000, Pearson PLC, a major educational publisher, paid $2.5 billion to acquire National Computer Systems, whose business was largely in standardized testing. Pearson is not the only company in the testing world, either. Today, a school district or a state has many choices for standardized tests aimed at the various subjects and the various grade levels, and the larger companies develop custom tests for the various states. It is a big business. So, what can go wrong with just checking to be sure that schools are working?
Test stresses. Parents are probably most worried about the stresses that these tests produce for students. In reality, almost all school-based standardized testing has minimal consequences for students. Indeed, even completion tests like the Regents tend to have a variety of ways that a student who does not score well can still end up with a diploma. Moreover, as the supply of 18-year-olds continues to drop, few students are locked out of college by a bad test score, though it does make a difference in chances of admission to the top colleges. However, stress comes in other ways. I have seen cases of students being told before a test that their teachers or their principal might lose their job if the scores are low. I have even heard of cases where principals screamed at students because low results for their school threatened the principal’s bonus payment.
A second concern is what psychologists call stereotype threat. Claude Steele is a researcher who showed that when minority students are aware that test score averages will be broken down by race, simply knowing this makes them perform less well on the test. The bottom line is that no one wants to be labeled as part of a group that did not get good scores on these tests, and that reality leads to unhealthy stress.
The technology of standardized testing, combined with high stakes, can lead to diluted curriculum. Another problem with standardized tests is that they can lead to a diluted curriculum. In order to be reliable, the tests need to have many items. That way, any peculiarities in how different students perceive a particular item are balanced out. Because there is strong pressure to limit the total time given up to testing in our schools, tests also need to be short. To have many items on a short test, they need to be questions that students can answer quickly. When there is strong pressure on schools to have high average test scores, it is not surprising that they focus on teaching students to handle questions of the kind the tests have.
So, what makes a question answerable quickly? First, it can be answered quickly if it just requires recalling some fact that has been learned. Also, some questions can be answered quickly if they require carrying out a procedure that has been extensively practiced. Related to both of these are questions where it is easy to pick the “safe” answer, the one that matches what you think everyone else is thinking.
What, in contrast, takes a long time? Complex problems that require integrating different bodies of knowledge to solve are one kind. Complex arguments where there are multiple points for and against a position, requiring a sophisticated strategy to take the best path but avoid unhealthy side effects are another. Tasks that require searching out and evaluating information are a third. In fact, almost all the things that people do better than machines tend to take more time than test developers can allocate to a single test item. As a result, standardized tests place a premium on instruction in doing stuff that can be done fast.
Now, this does not mean that standardized tests cannot predict deeper cognitive competences. If teachers ignored tests and just taught deeper thinking and acting, the current tests would remain valid, since people who can do the complex stuff generally also can do the simple stuff. However, given the push to boost test scores, some schools and some teachers will be tempted to focus on stuff that can be done fast. As that happens, the tests become less valid, since being able to do the quick stuff does not guarantee being able to do the complex stuff.
Possible solutions. I will soon be writing a blog on how to avoid the negative side of standardized testing, but here are a few previews. First, one can remove the high stakes in testing. This will not be easy, since the public demands accountability and the press finds test scores a quick an easy kind of outcome to report. Indeed, the basic process of journalism also favors that which can be reported quickly and simply. Still, I think schools might experiment with tactics for reducing the high stakes and anxiety-producing side of testing. A second approach is to build intelligent systems to extract valid assessments from children’s performance throughout the year. Valerie Shute uses the term “stealth assessment” to describe an approach to building assessments from tracking the details of student performance during learning games over extended periods. We are in the AI age, so why not use AI technologies to gather assessment data from the continuing stream of student learning activity. That broad approach has great promise. Some of it can be done today, but mainly when students are learning from online activity. On the horizon, though, are possibilities for assessments based upon recordings of student activity in the classroom during real human interactions. More on that in a future blog.