Study Snapshot: How Test Format May Influence Gender Achievement Gaps on State Standardized Tests

For Immediate Release: March 28, 2018

Contact:
Tony Pals, tpals@aera.net
(202) 238-3235, (202) 288-9333 (cell)

(202) 238-3233, (860) 490-8326 (cell)

Study Snapshot: How Test Format May Influence Gender
Achievement Gaps on State Standardized Tests

Study: “The Relationship Between Test Item Format and Gender Achievement Gaps on Math and ELA Tests in Fourth and Eighth Grades”

Authors: Sean F. Reardon (Stanford University), Demetra Kalogrides (Stanford University), Erin M. Fahle (Stanford University), Anne Podolsky (Learning Policy Institute), and Rosalía C. Zárate (Stanford University)

Published online March 27, 2018, in Educational Researcher, a peer-reviewed journal of the American Educational Research Association

Main Finding:

Measured achievement gaps between male and female students on state accountability tests are larger (more male-favoring) on tests with more multiple-choice questions and fewer constructed-response (i.e., open-ended) questions. Gaps are more female-favoring on tests with fewer multiple-choice questions and more constructed-response questions. Differences in the question format among states’ tests explain approximately 25 percent of the variation in achievement gaps across states and districts.

Details:

The researchers examined whether there is a relationship between the format of test questions—multiple choice versus constructed response (open ended)—and differences between male and female students’ scores on state accountability tests, and whether the association varies across grades and subjects.

The researchers used scores of roughly 8 million students tested in fourth and eighth grades in math and reading/English language arts (ELA) in 47 states during the 2008–09 school year to estimate state- and district-level subject-specific achievement gaps on each state’s accountability tests.

The student achievement data were pulled from three primary sources: the National Center for Education Statistics EDFacts Database, National Assessment of Educational Progress data, and Northwest Evaluation Association Measures of Academic Progress data.

In each subject and grade, the study found that achievement gaps were more female-favoring, on average, on tests with higher proportions of constructed-response questions than on tests with higher proportions of multiple-choice questions. Conversely, gaps were more male-favoring, on average, on tests with more multiple-choice questions and fewer constructed-response questions.

The difference in the measured achievement gap that is attributable to test question types was moderately large. If we measure the male-female gap in terms of grade-level equivalents, the difference in male and female students’ average scores is, on average, one third of a grade-level larger (in favor of male students) in states where all of the questions are multiple choice than in states where only half of the questions are multiple choice.

In other words, male students appear to perform 1/3 of a grade-level higher, relative to female students, on multiple-choice tests than they do on tests that are 50 percent multiple choice. Equivalently, female students appear to perform 1/3 of a grade-level higher, relative to male students, on tests where half the questions are open-ended than they do on tests with no open-ended questions.
The authors found that test format explained approximately 25 percent of the variation in state- and district-level gender achievement gaps in the United States. The association appears stronger in ELA than in math, but the differences are not statistically significant.

The findings are consistent with earlier studies suggesting that measured gender gaps are sensitive to item format on standardized tests. However, only a few such studies have analyzed recent state accountability tests, and those studies have generally focused on a single state. Because this study used state accountability test data from 47 states, it has broad generalizability to the kinds of high-stakes accountability tests used in the U.S.

The authors caution that these findings do not necessarily mean that multiple-choice tests are biased in favor of boys or that open-ended tests are biased in favor of girls.

It may be that the tests are biased—perhaps because male and female students differ, on average, in the test-taking skills that are rewarded on multiple-choice tests (for example, willingness to guess, which may favor male students) and the skills rewarded on constructed-response tests (such as handwriting, which may favor female students). Such skills are irrelevant to the domain (math or ELA) that tests purport to measure, so differences in scores that result from item format would, in this case, reflect differences in “construct-irrelevant” skills.

On the other hand, it may be that multiple-choice and constructed-response items are used to test different dimensions of math or reading skills (multiple-choice items may be used to test addition and subtraction; open-ended items may be used to test problem-solving skills). In this case, the differences in gender gaps between multiple-choice and open-ended questions may reflect real differences in the dimensions of “construct relevant” skills.

Although the study could not determine whether the association of gender gaps with test question format was due to gender differences in construct-relevant or construct-irrelevant skills, the findings suggest the differences are large enough to have meaningful consequences for students.

“The evidence that how male and female students are tested changes the perception of their relative ability in both math and ELA suggests that we must be concerned with questions of test fairness and validity,” said Reardon. “Does the assessment measure the intended skills? Does it produce consistent scores for different student subgroups? Is the assessment appropriate for its intended use?”

This implies that test developers and educators will need to attend more carefully to the mix of item types and the multidimensional sets of skills measured by tests. Policymakers, too, will need to be aware of how states’ use of different test formats or emphases on different skills may influence cross-state comparisons of gender gaps and funding decisions based on those results.

One limitation of the study is that researchers used test data from 2008–09. To the extent that test content and item formats have changed, the study’s results may not generalize to some of the tests being used for accountability purposes today.

Funding note: This research was supported by grants from the Institute of Education Sciences, the Spencer Foundation, and the William T. Grant Foundation.

To see the full study, click HERE.

Periodically, AERA will send out a brief overview, or snapshot, of a recent study that has been published in one of its peer-reviewed journals. AERA's "Study Snapshots" provide a high-level glimpse into new education research.

To speak with study coauthor Sean F. Reardon, please contact Tony Pals at tpals@aera.net or Collin Boylin at cboylin@aera.net.

About AERA
The American Educational Research Association (AERA) is the largest national interdisciplinary research association devoted to the scientific study of education and learning. Founded in 1916, AERA advances knowledge about education, encourages scholarly inquiry related to education, and promotes the use of research to improve education and serve the public good. Find AERA on Facebook, Twitter, and Instagram.

# # #