No, the Sky Is Not Falling: Interpreting the Latest SAT Scores

Originally published on Brooking Institution's Brown Center Chalkboard, October 1, 2015

Earlier this month, the College Board released SAT scores for the high school graduating class of 2015. Both math and reading scores declined from 2014, continuing a steady downward trend that has been in place for the past decade. Pundits of contrasting political stripes seized on the scores to bolster their political agendas. Michael Petrilli of the Fordham Foundation argued that falling SAT scores show that high schools need more reform, presumably those his organization supports, in particular, charter schools and accountability.* For Carol Burris of the Network for Public Education, the declining scores were evidence of the failure of polices her organization opposes, namely, Common Core, No Child Left Behind, and accountability.

Petrilli and Burris are both misusing SAT scores. The SAT is not designed to measure national achievement; the score losses from 2014 were miniscule; and most of the declines are probably the result of demographic changes in the SAT population. Let’s examine each of these points in greater detail.

The SAT is not designed to measure national achievement

It never was. The SAT was originally meant to measure a student’s aptitude for college independent of that student’s exposure to a particular curriculum. The test’s founders believed that gauging aptitude, rather than achievement, would serve the cause of fairness. A bright student from a high school in rural Nebraska or the mountains of West Virginia, they held, should have the same shot at attending elite universities as a student from an Eastern prep school, despite not having been exposed to the great literature and higher mathematics taught at prep schools. The SAT would measure reasoning and analytical skills, not the mastery of any particular body of knowledge. Its scores would level the playing field in terms of curricular exposure while providing a reasonable estimate of an individual’s probability of success in college.

Note that even in this capacity, the scores never suffice alone; they are only used to make admissions decisions by colleges and universities, including such luminaries as Harvard and Stanford, in combination with a lot of other information—grade point averages, curricular resumes, essays, reference letters, extra-curricular activities—all of which constitute a student’s complete application.

Today’s SAT has moved towards being a content-oriented test, but not entirely. Next year, the College Board will introduce a revised SAT to more closely reflect high school curricula. Even then, SAT scores should not be used to make judgements about U.S. high school performance, whether it’s a single high school, a state’s high schools, or all of the high schools in the country. The SAT sample is self-selected. In 2015, it only included about one-half of the nation’s high school graduates: 1.7 million out of approximately 3.3 million total. And that’s about one-ninth of approximately 16 million high school students. Generalizing SAT scores to these larger populations violates a basic rule of social science. The College Board issues a warning when it releases SAT scores: “Since the population of test takers is self-selected, using aggregate SAT scores to compare or evaluate teachers, schools, districts, states, or other educational units is not valid, and the College Board strongly discourages such uses.”

TIME’s coverage of the SAT release included a statement by Andrew Ho of Harvard University, who succinctly makes the point: “I think SAT and ACT are tests with important purposes, but measuring overall national educational progress is not one of them.”

The score changes from 2014 were miniscule

SAT scores changed very little from 2014 to 2015. Reading scores dropped from 497 to 495. Math scores also fell two points, from 513 to 511. Both declines are equal to about 0.017 standard deviations (SD).^[i] To illustrate how small these changes truly are, let’s examine a metric I have used previously in discussing test scores. The average American male is 5’10” in height with a SD of about 3 inches. A 0.017 SD change in height is equal to about 1/20 of an inch (0.051). Do you really think you’d notice a difference in the height of two men standing next to each other if they only differed by 1/20^th of an inch? You wouldn’t. Similarly, the change in SAT scores from 2014 to 2015 is trivial.^[ii]

A more serious concern is the SAT trend over the past decade. Since 2005, reading scores are down 13 points, from 508 to 495, and math scores are down nine points, from 520 to 511. These are equivalent to declines of 0.12 SD for reading and 0.08 SD for math.^[iii] Representing changes that have accumulated over a decade, these losses are still quite small. In the Washington Post, Michael Petrilli asked “why is education reform hitting a brick wall in high school?” He also stated that “you see this in all kinds of evidence.”

You do not see a decline in the best evidence, the National Assessment of Educational Progress (NAEP). Contrary to the SAT, NAEP is designed to monitor national achievement. Its test scores are based on a random sampling design, meaning that the scores can be construed as representative of U.S. students. NAEP administers two different tests to high school age students, the long term trend (LTT NAEP), given to 17-year-olds, and the main NAEP, given to twelfth graders.

Table 1 compares the past ten years’ change in test scores of the SAT with changes in NAEP.^[iv] The long term trend NAEP was not administered in 2005 or 2015, so the closest years it was given are shown. The NAEP tests show high school students making small gains over the past decade. They do not confirm the losses on the SAT.

Table 1. Comparison of changes in SAT, Main NAEP (12th grade), and LTT NAEP (17-year-olds) scores. Changes expressed as SD units of base year.

SAT

2005-2015

Main NAEP

2005-2015

LTT NAEP

2004-2012

Reading

-0.12*

+.05*

+.09*

Math

-0.08*

+.09*

+.03

*p<.05

Petrilli raised another concern related to NAEP scores by examining cohort trends in NAEP scores. The trend for the 17-year-old cohort of 2012, for example, can be constructed by using the scores of 13-year-olds in 2008 and 9-year-olds in 2004. By tracking NAEP changes over time in this manner, one can get a rough idea of a particular cohort’s achievement as students grow older and proceed through the school system. Examining three cohorts, Fordham’s analysis shows that the gains between ages 13 and 17 are about half as large as those registered between ages nine and 13. Kids gain more on NAEP when they are younger than when they are older.

There is nothing new here. NAEP scholars have been aware of this phenomenon for a long time. Fordham points to particular elements of education reform that it favors—charter schools, vouchers, and accountability—as the probable cause. It is true that those reforms more likely target elementary and middle schools than high schools. But the research literature on age discrepancies in NAEP gains (which is not cited in the Fordham analysis) renders doubtful the thesis that education policies are responsible for the phenomenon.^[v]

Whether high school age students try as hard as they could on NAEP has been pointed to as one explanation. A 1996 analysis of NAEP answer sheets found that 25-to-30 percent of twelfth graders displayed off-task test behaviors—doodling, leaving items blank—compared to 13 percent of eighth graders and six percent of fourth graders. A 2004 national commission on the twelfth grade NAEP recommended incentives (scholarships, certificates, letters of recognition from the President) to boost high school students’ motivation to do well on NAEP. Why would high school seniors or juniors take NAEP seriously when this low stakes test is taken in the midst of taking SAT or ACT tests for college admission, end of course exams that affect high school GPA, AP tests that can affect placement in college courses, state accountability tests that can lead to their schools being deemed a success or failure, and high school exit exams that must be passed to graduate?^[vi]

Other possible explanations for the phenomenon are: 1) differences in the scales between the ages tested on LTT NAEP (in other words, a one-point gain on the scale between ages nine and 13 may not represent the same amount of learning as a one-point gain between ages 13 and 17); 2) different rates of participation in NAEP among elementary, middle, and high schools;^[vii] and 3) social trends that affect all high school students, not just those in public schools. The third possibility can be explored by analyzing trends for students attending private schools. If Fordham had disaggregated the NAEP data by public and private schools (the scores of Catholic school students are available), it would have found that the pattern among private school students is similar—younger students gain more than older students on NAEP. That similarity casts doubt on the notion that policies governing public schools are responsible for the smaller gains among older students.^[viii]

Changes in the SAT population

Writing in the Washington Post, Carol Burris addresses the question of whether demographic changes have influenced the decline in SAT scores. She concludes that they have not, and in particular, she concludes that the growing proportion of students receiving exam fee waivers has probably not affected scores. She bases that conclusion on an analysis of SAT participation disaggregated by level of family income. Burris notes that the percentage of SAT takers has been stable across income groups in recent years. That criterion is not trustworthy. About 39 percent of students in 2015 declined to provide information on family income. The 61 percent that answered the family income question are probably skewed against low-income students who are on fee waivers (the assumption being that they may feel uncomfortable answering a question about family income).^[ix] Don’t forget that the SAT population as a whole is a self-selected sample. A self-selected subsample from a self-selected sample tells us even less than the original sample, which told us almost nothing.

The fee waiver share of SAT takers increased from 21 percent in 2011 to 25 percent in 2015. The simple fact that fee waivers serve low-income families, whose children tend to be lower-scoring SAT takers, is important, but not the whole story here. Students from disadvantaged families have always taken the SAT. But they paid for it themselves. If an additional increment of disadvantaged families take the SAT because they don’t have to pay for it, it is important to consider whether the new entrants to the pool of SAT test takers possess unmeasured characteristics that correlate with achievement—beyond the effect already attributed to socioeconomic status.

Robert Kelchen, an assistant professor of higher education at Seton Hall University, calculated the effect on national SAT scores of just three jurisdictions (Washington, DC, Delaware, and Idaho) adopting policies of mandatory SAT testing paid for by the state. He estimated that these policies explain about 21 percent of the nationwide decline in test scores between 2011 and 2015. He also notes that a more thorough analysis, incorporating fee waivers of other states and districts, would surely boost that figure. Fee waivers in two dozen Texas school districts, for example, are granted to all juniors and seniors in high school. And all students in those districts (including Dallas and Fort Worth) are required to take the SAT beginning in the junior year. Such universal testing policies can increase access and serve the cause of equity, but they will also, at least for a while, lead to a decline in SAT scores.

Here, I offer my own back of the envelope calculation of the relationship of demographic changes with SAT scores. The College Board reports test scores and participation rates for nine racial and ethnic groups.^[x] These data are preferable to family income because a) almost all students answer the race/ethnicity question (only four percent are non-responses versus 39 percent for family income), and b) it seems a safe assumption that students are more likely to know their race or ethnicity compared to their family’s income.

The question tackled in Table 2 is this: how much would the national SAT scores have changed from 2005 to 2015 if the scores of each racial/ethnic group stayed exactly the same as in 2005, but each group’s proportion of the total population were allowed to vary? In other words, the scores are fixed at the 2005 level for each group—no change. The SAT national scores are then recalculated using the 2015 proportions that each group represented in the national population.

Table 2. SAT Scores and Demographic Changes in the SAT Population (2005-2015)

	Projected Change Based on Change in Proportions	Actual Change	Projected Change as Percentage of Actual Change
Reading	-9	-13	69%
Math	-7	-9	78%

The data suggest that two-thirds to three-quarters of the SAT score decline from 2005 to 2015 is associated with demographic changes in the test-taking population. The analysis is admittedly crude. The relationships are correlational, not causal. The race/ethnicity categories are surely serving as proxies for a bundle of other characteristics affecting SAT scores, some unobserved and others (e.g., family income, parental education, language status, class rank) that are included in the SAT questionnaire but produce data difficult to interpret.

Conclusion

Using an annual decline in SAT scores to indict high schools is bogus. The SAT should not be used to measure national achievement. SAT changes from 2014-2015 are tiny. The downward trend over the past decade represents a larger decline in SAT scores, but one that is still small in magnitude and correlated with changes in the SAT test-taking population.

In contrast to SAT scores, NAEP scores, which are designed to monitor national achievement, report slight gains for 17-year-olds over the past ten years. It is true that LTT NAEP gains are larger among students from ages nine to 13 than from ages 13 to 17, but research has uncovered several plausible explanations for why that occurs. The public should exercise great caution in accepting the findings of test score analyses. Test scores are often misinterpreted to promote political agendas, and much of the alarmist rhetoric provoked by small declines in scores is unjustified.

* In fairness to Petrilli, he acknowledges in his post, “The SATs aren’t even the best gauge—not all students take them, and those who do are hardly representative.”

[i] The 2014 SD for both SAT reading and math was 115.

[ii] A substantively trivial change may nevertheless reach statistical significance with large samples.

[iii] The 2005 SDs were 113 for reading and 115 for math.

[iv] Throughout this post, SAT’s Critical Reading (formerly, the SAT-Verbal section) is referred to as “reading.” I only examine SAT reading and math scores to allow for comparisons to NAEP. Moreover, SAT’s writing section will be dropped in 2016.

[v] The larger gains by younger vs. older students on NAEP is explored in greater detail in the 2006 Brown Center Report, pp. 10-11.

[vi] If these influences have remained stable over time, they would not affect trends in NAEP. It is hard to believe, however, that high stakes tests carry the same importance today to high school students as they did in the past.

[vii] The 2004 blue ribbon commission report on the twelfth grade NAEP reported that by 2002 participation rates had fallen to 55 percent. That compares to 76 percent at eighth grade and 80 percent at fourth grade. Participation rates refer to the originally drawn sample, before replacements are made. NAEP is conducted with two stage sampling—schools first, then students within schools—meaning that the low participation rate is a product of both depressed school (82 percent) and student (77 percent) participation. See page 8 of: http://www.nagb.org/content/nagb/assets/documents/publications/12_gr_commission_rpt.pdf

[viii] Private school data are spotty on the LTT NAEP because of problems meeting reporting standards, but analyses identical to Fordham’s can be conducted on Catholic school students for the 2008 and 2012 cohorts of 17-year-olds.

[ix] The non-response rate in 2005 was 33 percent.

[x] The nine response categories are: American Indian or Alaska Native; Asian, Asian American, or Pacific Islander; Black or African American; Mexican or Mexican American; Puerto Rican; Other Hispanic, Latino, or Latin American; White; Other; and No Response.