Measuring Up, SAT Scores, and Coding Interviews


Measuring Up: What Academic Testing Actually Tells Us is a 2008 e-book by Professor Daniel Koretz of Harvard. Koretz, the Henry Lee Shattuck Professor of Training at Harvard, is a famous knowledgeable on academic evaluation and testing coverage. Professor Koretz is each a wonderful author and likewise public speaker as evidenced by many brief movies at BigThink and YouTube.

Measuring Up is an “accessible” e-book that tries and principally succeeds in instructing the essential ideas, each statistical and testing particular, of academic achievement checks such because the SAT (the check previously often known as the Scholastic Aptitude Take a look at), maybe essentially the most well-known and vital standardized check in america, with a minimal of arithmetic and equations, counting on graphs and verbal descriptions as a substitute. Some technical definitions such because the exact definition of the usual deviation are offered in footnotes. Measuring Up relies on a category with an analogous objective that Professor Koretz gave at Harvard geared toward grasp’s diploma college students who want a great understanding of academic testing however lack the time and maybe inclination to grasp the arcane statistics and mathematical strategies used within the testing subject. In lots of respects, most of us together with mother and father, college students, academics, and public coverage makers are in the identical boat.

Professor Koretz’s views appear principally reasonable, middle-of-the-road, for lack of a greater label. He helps academic testing and “accountability,” however expresses appreciable frustration with over-reliance on checks and check scores and has a lot of extremely vital issues to say about “high-stakes” testing and President Bush’s controversial No Baby Left Behind (NCLB) training reform and the “Texas Miracle” that preceded it. It might even be famous that President Obama’s Race to the High training initiative is definitely fairly much like Bush’s program, maybe reflecting the agenda and beliefs of the financiers and businessmen who fund each political events.

I discovered the e-book extremely informative, particularly the comparability of academic checks to political polls which made me consider checks in a manner I often don’t. The important thing level is {that a} check, particularly most standardized checks just like the SAT, is definitely a tiny pattern of a big area of data that the coed is meant to be taught and ideally grasp. In the identical manner {that a} ballot of some thousand individuals can precisely predict the votes of a whole bunch of tens of millions of voters in a nationwide election, an academic check makes an attempt to guage mastery of a generally huge matter based mostly on a small quantity, maybe forty to eighty, of questions chosen from the area. It additionally implies that a check might be extremely deceptive in a lot the identical manner that some polls famously predicted that Thomas Dewey would defeat Harry Truman within the Presidential election in 1948.

Particularly, standardized checks are prone to being “gamed” if the test-taker or the test-taker’s academics, mother and father, or others know the particular questions or varieties of questions on the check — or just cheat not directly. Professor Koretz makes a reasonably convincing case that many “high-stakes” checks such because the Texas Evaluation of Educational Expertise (TAAS) throughout George W. Bush’s tenure as governor have been gamed not directly; he calls this “rating inflation.” If a check is “high-stakes,” which means that one thing vital equivalent to admission to a selective faculty (the SAT) or funding for a faculty (TAAS) is determined by the result of the check, there’s a sturdy incentive to recreation the check, for instance by “instructing to the check” and even outright dishonest.

Though I just like the e-book and extremely suggest it, I’ve some severe reservations about some elements of it. What follows is a dialogue of those reservations, particularly the e-book’s dialogue of the decline in SAT scores from 1963 to 1980 in addition to a dialogue of among the implications of the important thing factors in Measuring Up on the now widespread follow of coding interviews within the pc business, an excessive instance of high-stakes testing.

The Mysterious Decline in SAT Scores

From 1963 till 1980, scores on the SAT declined considerably in america, particularly on the verbal check. The SAT at the moment consisted of two checks: a verbal check and a arithmetic check. Scores have been reported on a scale from 200 to 800. This scale had been established in 1941 and normalized to information from 1941 in order that the imply for each checks in 1941 was 500 with a typical deviation within the distribution of scholar scores of 100. Which means in 1941 about two-thirds of scholars scored between 400 and 600 on each checks, about ninety-five p.c scored between 300 and 700. The SAT was designed to yield a traditional distribution, the Bell Curve, for scholar scores.

By 1981 the imply verbal SAT rating had declined to 424 and the imply math rating to 466. Each rebounded barely in the course of the 1980’s. The imply verbal SAT rating rose again to 428, hardly a lot of an enchancment though statistically vital, and the imply math rating to 482 in 1995 when the scoring was rescaled in order that the 428 grew to become the brand new 500 for the verbal SAT and the 482 the brand new 500 for the maths SAT, making historic comparisons tougher for fogeys, college students, and academics with restricted time to investigate the numbers. Everybody wins and everybody will get a prize 🙂

This decline was not restricted to the SAT check as Professor Koretz discusses clearly in his e-book. In actual fact, most academic checks such because the Nationwide Assesssment of Academic Progress (NAEP) confirmed related declines. The decline was widespread. It occurred in each private and non-private faculties and in Canada as effectively, not less than suggesting one thing impartial of US authorities coverage. Nevertheless, the SAT is essentially the most well-known and doubtless a very powerful standardized academic check for People and the decline in SAT scores performed a central position within the ensuing controversies.

Particularly, political conservatives and training reformers, typically vital of public faculties and trainer’s unions, seized upon the decline in SAT scores. The decline performed a central position within the Reagan administration’s A Nation at Danger report with its well-known, broadly quoted opening passage:

Our Nation is in danger. Our as soon as unchallenged preeminence in commerce, business, science, and technological innovation is being overtaken by opponents all through the world. This report is worried with solely one of many many causes and dimensions of the issue, however it’s the one which undergirds American prosperity, safety, and civility. We report back to the American people who whereas we are able to take justifiable satisfaction in what our faculties and faculties have traditionally achieved and contributed to america and the well-being of its individuals, the academic foundations of our society are presently being eroded by a rising tide of mediocrity that threatens our very future as a Nation and a individuals. What was unimaginable a era in the past has begun to happen–others are matching and surpassing our instructional attainments.

If an unfriendly international energy had tried to impose on America the mediocre academic efficiency that exists at the moment, we would effectively have seen it as an act of battle. Because it stands, we now have allowed this to occur to ourselves. We have now even squandered the good points in scholar achievement made within the wake of the Sputnik problem. Furthermore, we now have dismantled important assist methods which helped make these good points doable. We have now, in impact, been committing an act of unthinking, unilateral academic disarmament.

The decline in SAT and different scores from 1963 to 1980, and the comparatively low scores since then when in comparison with 1941, is definitely a really vital problem in each academic testing and academic coverage in america. It’s a sensitive matter and I discovered the dialogue of it in Measuring Up a bit complicated. Early within the e-book, Professor Koretz writes:

The impact of compositional modifications might be exacerbated when check taking is voluntary, and the decline in SAT scores was worsened by a serious compositional change: a big improve within the proportion of SAT-takers drawn from traditionally lower-scoring teams. As faculty attendance grew to become extra widespread, the proportion of high-school graduates electing to take admissions checks rose, and lots of of these newly added to the rolls have been lower-scoring college students. This was studied in appreciable element by the Faculty Entrance Examination Board within the Nineteen Seventies, and the analysis confirmed clearly {that a} sizable share of the drop in SAT scores was the results of this compositional change. Had the traits of the test-taking group remained fixed, the decline would have been a lot smaller.

Daniel M Koretz. MEASURING UP (Kindle Areas 912-916). Kindle Version.

Simply to be clear, the liberal/trainer’s union rationalization for the decline is that the SAT check in 1941 was taken primarily by wealthy white children and by the 1960’s the check was being taken by non-rich, typically non-white children as effectively. No decline in instructing high quality however slightly a rise in alternatives, not less than partly because of liberal reforms within the 1960’s and 1970’s. The technical time period for that is “compositional modifications.” Go Group Liberal! 🙂

The issue is what precisely is supposed by “a large share of the drop in SAT scores”. Later, Professor Koretz writes:

The obtainable proof about particular hypothesized causes of the rating traits isn’t ample to guage all of them, however it’s ample to rule out a few of them and to estimate the scale of the consequences others might need had. The proof means that quite a lot of each social and academic components might have contributed to the traits however that nobody issue can account for greater than a modest share of the whole. For instance, by my estimate, modifications within the demographic composition of the coed inhabitants might have accounted for 10 or 20 p.c of the decline and considerably damped the following improve in scores.

Daniel M Koretz. MEASURING UP (Kindle Areas 1400-1404). Kindle Version.


These two passages actually don’t appear constant. Is 10 or 20 p.c a “sizable share?” Most individuals in all probability imply a bigger fraction after they use the time period sizable share.

What’s going on right here? Within the first passage, Professor Koretz might be referring to a “blue ribbon” panel report produced for the Faculty Board in 1977: On Additional Examination: Report of the Advisory Panel on the Scholastic Aptitude Take a look at Rating Decline. This report truly has lots of waffling within the high-quality print, however concludes:

Most-probably two-thirds to three-fourths-of the SAT rating decline between 1963 and about 1970 was associated to the “compositional” modifications within the group of scholars taking this faculty entrance examination.

That was a interval of main enlargement within the quantity and proportion of scholars finishing highschool, ensuing solely partly from the post-World Warfare II inhabitants wave, which got here alongside then. The remainder of the expansion mirrored the deliberate nationwide endeavor throughout that interval to develop and lengthen academic alternative by lowering the highschool drop-out price, by making an attempt to remove earlier discrimination based mostly on ethnicity or intercourse or household monetary circumstance, and by opening faculty doorways a lot wider.


From about 1970 on, the composition of the SAT-taking inhabitants has turn out to be comparatively extra stabilized with respect to its financial, ethnic, and social background.

But the rating decline continued after which accelerated; there have been notably sharp drops in the course of the three-year interval from 1972 to 1975. Solely a couple of quarter of the decline since 1970 might be attributed to persevering with change within the make-up of the test-taking group. With a handful of exceptions, the drop in scores in recent times has been nearly throughout the board, affecting high-scoring and lower-scoring teams alike.

Is 1 / 4 (of the decline since 1970) a large fraction? In widespread utilization sizable fraction tends to suggest not less than a half.

I feel the precise state of affairs is that we don’t know past an inexpensive doubt what induced the decline since 1941 and in reality fashionable SAT scores, correctly scaled and adjusted for modifications within the SAT checks, are on common decrease than in 1941. As Professor Koretz writes later within the e-book, compositional results in all probability contributed however one thing else will need to have occurred as effectively. I are inclined to assume Professor Koretz is tap-dancing round this as a result of it has the potential to offend many events. Compositional results let mother and father, academics, college students, faculty directors, politicians, the Faculty Board, virtually all people off the hook.

Professor Koretz repeatedly makes the purpose that there’s a sturdy incentive to recreation “high-stakes” checks such because the SAT in numerous methods. He cites a lot of research, together with a few of his personal research, that present proof that this has occurred. There’s a fairly good case that the rescaling of scores on the SAT in 1994/1995 is an instance of this. There are actually reputable statistical causes for the rescaling, however it clearly has the impact of hiding the long run decline from cursory examination by busy mother and father, academics, and college students.

How Vital was the Decline in SAT Scores within the Actual World?

Not very. Historical past has spoken. The Berlin wall fell in 1989. Soviet troops pulled out of Jap Europe, Afghanistan, and different areas. The Chilly Warfare ended — though it appears to be making a comeback these days.

An enormous concern within the 1980’s and early 1990’s was Japan. The menace of Japan whose college students constantly outperform US college students on common in comparisons of math and different academic efficiency measures appeared in common tradition, motion pictures, finest sellers equivalent to Karel van Wolferen’s The Enigma of Japanese Energy (I’ve a duplicate), and lots of different venues. Japan, superior check scores however, faltered, skilled a monetary crash, suffered financial stagnation, and is never cited as a priority at the moment.

Competency versus Rating

Within the e-book, Professor Koretz displays a powerful choice for checks used to check particular person college students or teams, in actual fact these particularly designed to supply a traditional distribution of scores, the Bell Curve, just like the SAT slightly than easy cross/fail checks like a driver’s license examination which can be used to guage competency.

Competency and rating checks are fairly totally different. Professor Koretz provides a great instance of the distinction at the beginning of the e-book. He discusses a easy vocabulary check of forty phrases. We will select the vocabulary phrases to be widespread phrases equivalent to mattress, journey, and carpet that anybody who is aware of English ought to know. On this case, most check takers, if they’re competent English audio system, will get each or almost each query. If somebody scored lower than ninety p.c proper on a check like this, we’d rightly fail them and conclude they in all probability lack primary competence in English. A competency check typically received’t have a traditional distribution (Bell Curve) of scores.

Professor Koretz doesn’t like checks like this. He additionally doesn’t like checks with obscure phrases like silliculose, vilipend, and epimysium that nearly everybody will fail. Fairly he likes vocabulary checks with phrases like feckless, disparage, and minuscule that some check takers will know, others won’t, and that usually produce a traditional distribution, Bell Curve, of scores. He likes checks that allow us to check particular person college students (Johnny has a bigger vocabulary than Bobby) or teams (Harvard college students have a bigger vocabulary than Texas A&M college students maybe).

The issue with this emphasis on rating and comparability is that one of many goals of training is not to establish the most effective college students or teams of scholars. In arithmetic, most college students have to be taught to stability their checkbook, consider costs in a retailer, formulate and monitor a private or household finances, consider complicated statistics about medical merchandise or faculty efficiency for his or her children. 🙂 For these on a regular basis actions most individuals, who usually are not skilled mathematicians or one thing related, must be competent and so they want checks that inform them and academics whether or not they’re competent — not the most effective and never higher or worse than different individuals. That is what a driver’s examination is for. Most individuals could be non-plussed to obtain an SAT-like rating of 429 on a driver’s examination; what does that imply? We need to know whether or not somebody can safely drive a automotive. Not solely is the rating reported in another way (cross/fail) however the check is designed in another way.

Competency is sort of vital. For instance, if American farmers have been incompetent, america would starve. If however, American farmers are competent however maybe on common not fairly pretty much as good as farmers in Japan, effectively — not likely a giant drawback.

If American scientists and engineers have been incompetent, certainly the nation would have been in danger within the 1980’s. But when American scientists and engineers have been on common considerably much less good than Japanese scientists or engineers or than American scientists and engineers in 1941, not perfect however not likely a giant drawback. In each circumstances, this inference in regards to the relative high quality of scientists and engineers is a giant soar from variations within the check scores of Ok-12 college students.

Individuals from hyper-competitive environments like Harvard or Microsoft are inclined to confuse competency evaluation and rating. I point out Microsoft due to Invoice Gates intensive actions in training and academic testing.

Living proof:

The Clueless CEO

Professor Koretz expresses exasperation with the angle of some authorities officers and CEO’s concerned in training reform to testing:

Early in his first time period as president, George W. Bush, considered one of whose signature packages, No Baby Left Behind, is constructed round testing, declared, ‘A studying comprehension check is a studying comprehension check. And a math check within the fourth grade-there’s not some ways you’ll be able to foul up a check … It’s fairly straightforward to `norm’ the outcomes.”‘ No matter one thinks of No Baby Left Behind-and there are good arguments each for and in opposition to numerous elements of it-this declare was solely fallacious: it’s all too straightforward to foul up the design of a check, and it’s even straightforward to foul up in decoding check scores.
And Bush is hardly alone on this mistaken view. A couple of years in the past, a consultant of a distinguished enterprise group addressed a gathering of the Board on Testing and Evaluation of the Nationwide Analysis Council, of which I used to be then a member. She complained that her bosses-some of essentially the most distinguished CEOs in America engaged in training reform have been exasperated as a result of we within the measurement career saved giving them way more difficult solutions than they needed. I responded that we gave them advanced solutions as a result of the solutions are in actual fact advanced. Considered one of her bosses had been the CEO of a pc firm wherein I then owned some inventory, and I identified that my retirement financial savings would have taken a beating if that individual CEO had been silly sufficient to demand solely easy solutions when his employees confronted him with issues of chip structure or software program design. She didn’t seem persuaded.

Daniel M Koretz. MEASURING UP (Kindle Areas 60-69). Kindle Version.

Really, from my very own expertise, it’s fairly conceivable that the CEO was silly sufficient to demand solely easy solutions when his employees confronted him with issues of chip structure or software program design. 🙂 In protection of this assertion, it might be identified that profitable excessive tech firms even have a excessive failure price. Many don’t keep profitable that lengthy. Witness, for instance, Blackberry, the as soon as King of the smartphone market.

The pc business is the land of the sixty second elevator pitch. Excessive tech startup firms are routinely anticipated to current their “pitch” in a ten minute, ten slide Powerpoint “slide deck.” For those who can’t make your level convincing within the forty character topic line of an e-mail learn on an iPhone, you aren’t government materials.

In protection of the CEO, operating even a small excessive tech enterprise is a ton of labor. The CEO should deal with a whole bunch of points concurrently: gross sales, advertising and marketing, finance, interpersonal squabbles, quite a few technical points and so forth. The CEO is often the general public face of the corporate and should spend an infinite period of time on the street assembly buyers and potential buyers, main prospects and potential prospects, attending key commerce exhibits and conferences, and lots of different public relations actions. Most good CEOs attempt to restrict the quantity of labor they should cope with straight and to pick subordinates who can both cope with the problems or simplify them to PowerPoint bullets so the CEO could make a easy, fast resolution — considered one of a whole bunch and even 1000’s the CEO should make. A easy, dependable metric like a standardized check rating is usually a Godsend to a harried government.

Sadly, as Professor Koretz notes, some points simply can’t be simplified to a single metric or just a few bullet factors on a PowerPoint slide. They’re inherently advanced. The seductive attraction of a easy however fallacious reply stays.

Coding Interviews

CEO’s of pc firms don’t simply promote excessive stakes testing for varsity academics and college students. They’re consuming their very own pet food. The present fad in job interviews within the pc business is the “coding interview.” Coding interviews are available in two essential flavors. One is a grueling a number of hour interview answering algorithm and coding questions drawn from the topic space of the job. The opposite taste is a grueling a number of hour interview answering algorithm and coding questions drawn from algorithms programs and books within the pc science (CS) curriculum that nearly nobody truly makes use of in the true world. Google, which has a popularity for hiring most workers proper out of faculty, is famous for this latter sort of coding interview.

In software program design orthodoxy, software program engineers are imagined to reuse code — not reinvent the wheel. A lot of the algorithms taught in CS courses at faculties and universities have been found out way back and are integrated in broadly obtainable libraries and programming languages. Consequently, training software program engineers not often have to know tips on how to implement these algorithms. In actual fact, software program engineers growing or implementing innovative algorithms usually spend their time engaged on algorithms not taught in class. Shock, shock.

A considerable proportion of precise software program engineers usually are not formally skilled in pc science. For instance, many have levels in different STEM (Science, Expertise, Engineering, and Math) fields. A good quantity are self-taught or faculty, even highschool dropouts, together with such famous figures as Invoice Gates of Microsoft and Jan Koum of WhatsApp. Members of under-represented minority teams are particularly more likely to be self-taught and/or dropouts. Even training software program engineers with formal CS coaching are inclined to neglect the varsity algorithm programs that they not often or by no means use. Thus these coding interviews are inclined to act like Koretz’s vocabulary check populated with phrases like silliculose, vilipend, and epimysium (not that dangerous, however the level stays legitimate).

With respect to Koretz’s “rating inflation,” a coding interview is clearly a excessive stakes check — with a job, typically a excessive paying fascinating job at stake. Not surprisingly, there are intensive efforts to “recreation” the coding interviews. There are dozens of books on tips on how to ace a coding interview at Amazon together with the market chief Gayle Laakmann McDowell’s Cracking the Coding Interview: 150 Programming Questions and Options (fifth Version). The creator, a former engineer at Google and different large title tech firms, has her personal firm CareerCup with a number of books, movies and different sources for interviewing for jobs at excessive tech firms. There may be even a preferred meetup group within the Silicon Valley for training coding interviews:

The faculty stage CS algorithm coding interviews additionally exhibit the confusion between competency and rating widespread within the extremely aggressive pc business. The principle cause for testing primary abilities taught in class is to verify primary competency. Because it occurs, on this case, these primary abilities taught in class usually are not essential to program (and program effectively) in the true world since actual world programmers reuse implementations of the algorithms in widespread libraries and programming languages. In follow, the businesses evaluate candidates based mostly on how effectively they do on these unrepresentative checks — for rating slightly than competence analysis. That is extremely unlikely to supply fascinating outcomes until the corporate limits itself to latest pc science graduates and perhaps not even then. The essential CS algorithm coding interviews are neither good competency nor good rating checks.


Learn Measuring Up however take the e-book with a grain of salt. So too, use standardized checks however put money into cautious design of the checks and train warning in utilizing the outcomes of the checks, combining them with different info and standards as Professor Koretz recommends.

© 2015 John F. McGowan

In regards to the Creator

John F. McGowan, Ph.D. solves issues utilizing arithmetic and mathematical software program, together with growing gesture recognition for contact gadgets, video compression and speech recognition applied sciences. He has intensive expertise growing software program in C, C++, MATLAB, Python, Visible Primary and lots of different programming languages. He has been a Visiting Scholar at HP Labs growing pc imaginative and prescient algorithms and software program for cellular gadgets. He has labored as a contractor at NASA Ames Analysis Middle concerned within the analysis and growth of picture and video processing algorithms and know-how. He has printed articles on the origin and evolution of life, the exploration of Mars (anticipating the invention of methane on Mars), and low cost entry to area. He has a Ph.D. in physics from the College of Illinois at Urbana-Champaign and a B.S. in physics from the California Institute of Expertise (Caltech). He might be reached at [email protected].