PME-NA 2009 Atlanta: Plenary plus Additional Resources

What does it mean for mathematics tests to be insensitive to instruction? There are at least two senses to this title that these materials engage. One is an account of a set of data-driven analyses showing not just the specifics of what we mean when we say high stakes mathematics tests are insensitive to instruction, but also the outlines of our proposal for how this instructional insensitivity emerges in relation to current psychometric practice. The second meaning relates how our psychologies (plural) of mathematics education stand in relation to mainstream educational psychologies having their roots extending back to Francis Galton, and that continue to center on analyses of intelligent, knowing-related, behaviors into some variant of general reasoning ability (“g”) and specific factors or skills (“s”).

What it Means for Mathematics Tests to be Insensitive to Instruction

Presenter: Walter M. Stroup


Questions to Ask | Hearing | Ed Week | Apologists | Paper | TAKS Graph | Pham Thesis | Alternative Items |Feedback Form | LInks | Copyright

Note: This web page was originally intended to be used to make the plenary address widely available. Additional information and links have been added since this time, but the structure of the page still reflects its original intent.

Questions to Ask (modified from response to Wiggins included below):

Two closely related questions should be asked by state legislators, governors, secretaries of education, parents, the business community, and anyone else having a stake in our educational system:

(1) What is the evidence, at scale, that the tests are adequately sensitive to the quality of instruction, or other school input factors, to be able to serve the goals of a high-stakes accountability system? Put another way, when implemented at scale (e.g. across a district or state) what fraction of the variance can be shown to be sensitive to instruction (or, more precisely, differences in instruction) on anything like an annual yearly basis? If these tests are meant to be used for accountabiliy purposes "at scale", then they should be evaluated for their effectiveness ("sensitivity") at scale.

(2) Given that from the earliest days of high-stakes testing nearly everyone involved has acknowledged that some level of preparation for testing (e.g., what sometimes is discussed as a kind of familiarity) impacts results, we need to ask ourselves, not as theoretical ideal but as a practical matter in real-world test development, what fraction of the overall variance in student outcomes would we be willing to tolerate as a test-taking ability? ... 5% ... 10% ... more? [The empirical results, based on analyses of the statewide data sets, are surprising to nearly everyone ... everyone, that is, except the test vendors (and their un-appologists)... who now want us to see this flaw as a feature in their tests ... and where, in Texas, they've even managed to put their name for this test taking ability in the title of the "new" tests ...]

Texas House of Representatives Public Education Committee Hearing - June 19, 2012

Notice of Public Hearing

Online Video of Full Hearing :: Standalone Version - 7 hours 32 minutes

TAMSA Member presents 5:35:54 (who also appears in KVUE video below)

Discussion of Insensitivity to Instruction begins at 6:45:00 ( to ~7:14:30)

Texas AFT - Texas House Committee Hears Plenty from Critics of STAAR Testing

TAMSA - Press Release and Response to TAB Press Release and Letter to State Lawmakers

Austin American Statesman - Blog Comment Regarding Preliminary STAAR Results for Austin

KVUE TV Coverage - Video

Ed Week Article

What Bernie Madoff Can Teach Us About Accountability in Education - Stroup

Bridging Differences - Meier & Ravitch

A Publisher's (un)Appologists and Some Responses

Lobbyist's Opinion Article appears in Statesman during period when STAAR legislation was being debated (March 17, 2011) [No mention is made of his substantial, longstanding relation to TAKS & STAAR vendor].

A Submitted but Unpublished Response to lobbyist's article (indeed, so far as we know, no substantive responses were ever published by the Statesman).

Texas Observer article discussing the influence of the publisher of TAKS/STAAR, including the relationship to this lobbyist (search for "Kress")

Concerns about increased testing called hysterical and irrational on a blog by influential curriculum expert, Grant Wiggins. Our responses can be found under "Comments" ... look for "Stroup").

Press release announcing publisher's "exclusive partnership" with this same curriculum expert. "How could I go wrong with working for the best publisher in the world" - Grant Wiggins.

Plenary Paper (as Read) & Slides (2009)


What it Means for Mathematics Tests to be Insensitive to Instruction (DOC) (PDF)


PPT File :: PDF of PPT

Go to Feedback Form


Nothing about the transition from TAKS (Texas Assessment of Knowledge and Skills) to STAAR (State of Texas Assessments of Academic Readiness) is likely to fix this insensitivity to instruction (or tendency of the tests to put the children in the same rank order, year in and year out, in ways that are [overwhelmingly] not sensitive to meaningful content-specific instruction).

If the tests are mostly about this test-taking profile (on the order of 72% of the variance), then this aspect alone should disqualify their continued use as metrics of accountability (certainly for evaluating individual teachers whose students score at either ends, or even in the middle, of the range for the high-stakes test outcomes). The tests simply aren't sensitive enough to what educators and taxpayers think they are measuring to warrant their continued use ... and even more so given the implication graphs like the one shown above will have, and are having, for the lives of children and their communities.

Indeed, if anything, the STAAR tests -- with their much greater emphasis on "readiness" standards over "support" standards -- are likely to be even less sensitive than the TAKS tests to meaningful content-specific instruction (that is, what most teachers and what most parents think taking a class called "Algebra" should be about ... and not simply about re-inscribing each student's relative position on a largely content-neutral, test-taking, profile that is now being sold to the public as some measure of "college and career" readiness).

Included below are links to a powerpoint based on information from the Texas Education Agency explaining the shift in the relative emphasis on readiness vs. support standards in going from TAKS, or even from the state curriculum standards ("TEKS"), to STAAR. Implications for curricula are then illustrated as a way of responding, in the face of the insensitivity, that stands to make the best out of the very small portion of the test results that are likely to be related to learning meaningful concepts and practices in science and mathematics. Teachers and other educators have to make the best of the world they find themselves in. The fact that they are willing to do this as part of their commitments to children and their communities should not distract us from the basic fact that the current generation of tests are almost entirely broken in terms of performing for their intended function -- being an effective tool for holding the educational system accountable for student learning outcomes. STAAR Notes ppt :: STAAR Notes pdf

Computer Modeling of the Instructionally Insensitive Nature of the Texas Assessment of Knowledge and Skills (TAKS) Exam - PhD Thesis by Vinh Pham

Stakeholders of the educational system assume that standardized tests are transparently about the subject content being tested and therefore can be used as a metric to measure achievement in outcome-based educational reform. Both analysis of longitudinal data for the Texas Assessment of Knowledge and Skills (TAKS) exam and agent based computer modeling of its underlying theoretical testing framework have yielded results that indicate the exam only rank orders students on a persistent but uncharacterized latent trait across domains tested as well as across years. Such persistent rank ordering of students is indicative of an instructionally insensitive exam. This is problematic in the current atmosphere of high stakes testing which holds teachers, administrators, and school systems accountable for student achievement. (PDF)


Yes, there are clear alternatives. This is true even imposing the constraint of having to use the "existing infrastructure" for implementing large scale, multiple-choice tests. If, for example, we allow students to bubble-in more than one answer we can readily create and use at scale items that would be be both rigorous and much more informative than the current items.

These "non-dichotomous multiple choice" (NDMC) items are also sensitive to instruction in ways that support much richer conversations among educators and the larger public about the kinds of learning and teaching we want in our schools. In other words, items like these would better support meaningful accountability at scale.

NCMC items are not dependent for their interpretation on comparisons to proprietary psychometric profiles (i.e., they are transparent in ways that would allow them to better serve the goals of "criterion referenced" testing) and they can be written by a wide range of responsible colleagues (including experienced classroom educators) with expertise in domain-situated learning (e.g., issues related to 12 year-olds learning Newton's Laws). A few examples are provided for both mathematics and science. [Note, in current practice the last response -- "none of the above" -- shown in these sample items is no longer included.]

NDMC Items similar to these sample items have been used in various contexts in the US and Mexico (at scale with N=500,000 students).

Non-Dichotomous Multiple Choice Items: PDF or PPT

Related Links

Conference Site:

Feedback Form

Stroup, W. M., and Hills, T. & Carmona, L (2012). "Computing the Average Square: An Agent-based Introduction to Aspects of Current Psychometric Practice." Technology, Knowledge and Learning. Download.


© generative design center, 2005-2012