SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Non-Tech : Binary Hodgepodge

 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext  
To: 10K a day who wrote (109)5/20/2001 10:49:47 AM
From: Jon Koplik  Read Replies (1) of 6763
 
(Long) NYT article on standardized testing scoring services (and mistakes made).

May 20, 2001

NONE OF THE ABOVE / First of two articles

Right Answer, Wrong Score: Test Flaws Take Toll

By DIANA B. HENRIQUES and JACQUES STEINBERG

One day last May, a few weeks before
commencement, Jake Plumley was pulled
out of the classroom at Harding High
School in St. Paul and told to report to his
guidance counselor.

The counselor closed the door and asked him to
sit down. The news was grim. Jake, a senior,
had failed a standardized test required for
graduation. To try to salvage his diploma, he had
to give up a promising job and go to summer
school. "It changed my whole life, that test,"
Jake recalled.

In fact, Jake should have been elated. He
actually had passed the test. But the company
that scored it had made an error, giving Jake and
47,000 other Minnesota students lower scores
than they deserved.

An error like this — made by NCS Pearson, the
nation's biggest test scorer — is every testing
company's worst nightmare. One executive
called it "the equivalent of a plane crash for us."

But it was not an isolated incident. The testing
industry is coming off its three most
problem-plagued years. Its missteps have
affected millions of students who took
standardized proficiency tests in at least 20
states.

An examination of recent mistakes and
interviews with more than 120 people involved
in the testing process suggest that the industry
cannot guarantee the kind of error-free,
high-speed testing that parents, educators and
politicians seem to take for granted.

Now President Bush is proposing a 50 percent
increase in the workload of this tiny industry —
a handful of giants with a few small rivals. The
House could vote on the Bush plan this week,
and if Congress signs off, every child in grades
3 to 8 will be tested each year in reading and
math. Neither the Bush proposal nor the
Congressional debate has addressed whether the
industry can handle the daunting logistics of this
additional business.

Already, a growing number of states use these
so-called high-stakes exams — not to be
confused with the SAT, the college entrance
exam — to determine whether students in
grades 3 to 12 can be promoted or granted a
diploma. The tests are also used to evaluate
teachers and principals and to decide how much
tax money school districts receive. How well
schools perform on these tests can even affect
property values in surrounding neighborhoods.

Each recent flaw had its own tortured history.
But all occurred as the testing industry was
struggling to meet demands from states to test
more students, with custom-tailored tests of
greater complexity, designed and scored faster
than ever.

In recent years, the four testing companies that
dominate the market have experienced serious
breakdowns in quality control. Problems at
NCS, for example, extend beyond Minnesota. In
the last three years, the company produced a
flawed answer key that incorrectly lowered
multiple-choice scores for 12,000 Arizona
students, erred in adding up scores of essay
tests for students in Michigan and was forced
with another company to rescore 204,000 essay
tests in Washington because the state found the
scores too generous. NCS also missed important
deadlines for delivering test results in Florida and
California.

"I wanted to just throw them out and hire a new
company," said Christine Jax, Minnesota's top
education official. "But then my testing director
warned me that there isn't a blemish-free testing company out there. That really shocked
me."

One error by another big company resulted in nearly 9,000 students in New York City
being mistakenly assigned to summer school in 1999. In Kentucky, a mistake in 1997 by
a smaller company, Measured Progress of Dover, N.H., denied $2 million in achievement
awards to deserving schools. In California, test booklets have been delivered to schools
too late for the scheduled test, were left out in the rain or arrived with missing pages.

Many industry executives attribute these errors to growing pains.

The boom in high-stakes tests "caught us somewhat by surprise," said Eugene T. Paslov,
president of Harcourt Educational Measurement, one of the largest testing companies.
"We've turned around, and responded to these issues, and made some dramatic
improvements."

Despite the recent mistakes, the industry says, its error rate is infinitesimal on the
millions of multiple-choice tests scored by machine annually. But that is only part of the
picture. Today's tests rely more heavily on essay-style questions, which are more
difficult to score. The number of multiple-choice answer sheets scored by NCS more
than doubled from 1997 to 2000, but the number of essay- style questions more than
quadrupled in that period, to 84.4 million from 20 million.

Even so, testing companies turn the scoring of these writing samples over to thousands
of temporary workers earning as little as $9 an hour.

Several scorers, speaking publicly for the first time about problems they saw,
complained in interviews that they were pressed to score student essays without
adequate training and that they saw tests scored in an arbitrary and inconsistent manner.

"Lots of people don't even read the whole test — the time pressure and scoring pressure
are just too great," said Artur Golczewski, a doctoral candidate, who said he has scored
tests for NCS for two years, most recently in April.

NCS executives dispute his comments, saying that the company provides careful,
accurate scoring of essay questions and that scorers are carefully supervised.

Because these tests are subject to error and subjective scoring, the testing industry's
code of conduct specifies that they not be the basis for life-altering decisions about
students. Yet many states continue to use them for that purpose, and the industry has
done little to stop it.

When a serious mistake does occur, school districts rarely have the expertise to find it,
putting them at the mercy of testing companies that may not be eager to disclose their
failings. The surge in school testing in the last five years has left some companies
struggling to find people to score tests and specialists to design them.

"They are stretched too thin," said Terry Bergeson, Washington State's top education
official. "The politicians of this country have made education everybody's top priority,
and everybody thinks testing is the answer for everything."

The Mistake: When 6 Wrongs Were Rights

The scoring mistake that plagued Jake Plumley and his Minnesota classmates is a
window into the way even glaring errors can escape detection. In fact, NCS did not
catch the error. A parent did.

Martin Swaden, a lawyer who lives in Mendota Heights, Minn., was concerned when his
daughter, Sydney, failed the state's basic math test last spring. A sophomore with
average grades, Sydney found math difficult and had failed the test before.

This time, Sydney failed by a single answer. Mr. Swaden wanted to know why, so he
asked the state to see Sydney's test papers. "Then I could say, `Syd, we gotta study
maps and graphs,' or whatever," he explained.

But curiosity turned to anger when state education officials sent him boilerplate e-mail
messages denying his request. After threatening a lawsuit, Mr. Swaden was finally given
an appointment. On July 21, he was ushered into a conference room at the department's
headquarters, where he and a state employee sat down to review the 68 questions on
Sydney's test.

When they reached Question No. 41, Mr. Swaden immediately knew that his daughter's
"wrong" answer was right.

The question showed a split-rail fence, and asked which parts of it were parallel. Sydney
had correctly chosen two horizontal rails; the answer key picked one horizontal rail and
one upright post.

"By the time we found the second scoring mistake, I knew she had passed," Mr. Swaden
said. "By the third, I was concerned about just how bad this was."

After including questions that were being field-tested for future use, someone at NCS
had failed to adjust the answer key, resulting in 6 wrong answers out of 68 questions.
Even worse, two quality control checks that would have caught the errors were never
done.

Eric Rud, an honor-roll student except in math, was one of those students mislabeled as
having failed. Paralyzed in both legs at birth, Eric had achieved a fairly normal school life,
playing wheelchair hockey and dreaming of becoming an architect. But when he was told
he had failed, his spirits plummeted, his father, Rick Rud, said.

Kristle Glau, who moved to Minnesota in her senior year, did not give up on high school
when she became pregnant. She persevered, and assumed she would graduate because
she was confident she had passed the April test, as, in fact, she had.

"I had a graduation party, with lots of presents," she recalled angrily. "I had my cap and
gown. My invitations were out." Finally, she said, her mother learned what her teachers
did not have the heart to tell her; according to NCS, she had failed the test and would not
graduate.

When the news of NCS's blunder reached Ms. Jax, the state schools commissioner, she
wept. "I could not believe," she said, "how we could betray children that way."

But when she learned that the error would have been caught if NCS had done the quality
control checks it had promised in its bid, she was furious. She summoned the chief
executive of NCS, David W. Smith, to a news conference and publicly blamed the
company for the mistake.

Mr. Smith made no excuses. "We messed up," he said. "We are extremely sorry this
happened." NCS has offered a $1,000 tuition voucher to the seniors affected, and is
covering the state's expenses for retesting. It also paid for a belated graduation ceremony
at the State Capitol.

Jake Plumley and several other students are suing NCS on behalf of Minnesota teenagers
who they say were emotionally injured by NCS's mistake. NCS has argued that its
liability does not extend to emotional damages.

The court cases reflect a view that is common among parents and even among some
education officials: that standardized testing should be, and can be, foolproof.

The Task: Trying to Grade 300 Million Test Sheets

The mistake that derailed Jake Plumley's graduation plans occurred in a bland building in
a field just outside Iowa City. From the driveway on North Dodge Street, the structure
looks like an overgrown suite of medical offices with a small warehouse in the back.

Casually dressed workers, most of them hired for the spring testing season, gather
outside a loading dock to smoke, or wander out for lunch at Arby's.

This is ground zero for the testing industry, NCS's Measurement Services unit. More of
the nation's standardized tests are scored here than anywhere else. Last year, nearly 300
million answer sheets coursed through this building, the vast majority without mishap. At
this facility and at other smaller ones around the country, NCS scores a big chunk of the
exams from other companies. What the company does in this building affects not only
countless students, but the reputation of the entire industry.

Inside, machines make the soft sound of shuffling cards as they scan in student answers
to multiple-choice questions. Handwritten answers are also scanned in, to be scored later
by workers.

But behind the soft whirring and methodical procedures is an often frenzied rush to meet
deadlines, a rush that left many people at the company feeling overwhelmed, current and
former employees said.

"There was a lack of personnel, a lack of time, too many projects, too few people,"
sighed Nina Metzner, an education assessment consultant who worked at NCS. "People
were spread very, very thin."

Those concerns were echoed by other current and former NCS employees, several of
whom said those pressures had played a role in the Minnesota error and other problems
at the company.

Mr. Smith, the NCS chief executive, disputed those reports. The company has sustained
a high level of accuracy, he said, by matching its staffing to the volume of its business.
The Minnesota mistake, he said, was not caused by the pressures of a heavy workload
but by "pure human error caused by individuals who had the necessary time to perform a
quality function they did not perform."

Betsy Hickok, a former NCS scoring director, said she had worked hard to ensure the
accurate scoring of essays. But that became more difficult, she said, as she and her
scorers were pressed into working 12-hour days, six days a week.

"I became concerned," Ms. Hickok said, "about my ability, and the ability of the scorers,
to continue making sound decisions and keeping the best interest of the student in mind."

Mr. Smith said NCS was "committed to scoring every test accurately."

The Workers: Some Questions About Training

The pressures reported by NCS executives are affecting the temporary workers who
score the essay questions in vogue today, said Mariah Steele, a former NCS scorer and a
graduate student in Iowa City.

In today's tight labor markets, Ms. Steele is the testing industry's dream recruit. She is
college-educated but does not have a full-time job; she lives near a major test-scoring
center and is willing to work for $9 an hour.

For her first two evenings, she and nearly 100 other recruits were trained to score math
tests from Washington State. This training is critical, scoring specialists say, to make
sure that scorers consistently apply a state's specific standards, rather than their own.

But one evening in late July, as the Washington project was ending, Ms. Steele said, she
was asked by her supervisor to stop grading math and switch to a reading test from
another state, without any training.

"He just handed me a scoring rubric and said, `Start scoring,' " Ms. Steele said. Perhaps a
dozen of her co-workers were given similar instructions, she added, and were offered
overtime as an inducement.

Baffled, Ms. Steele said she read through the scoring guide and scored tests for about 30
minutes. "Then I left, and didn't go back," she said. "I really was not confident in my
ability to score that test."

Two other former scorers for NCS say they saw inconsistent grading.

Renée Brochu of Iowa City recalled when a supervisor explained that a certain response
should be scored as a 2 on a two-point scale. "And someone would gasp and say, `Oh,
no, I've scored hundreds of those as a 1," Ms. Brochu said. "There was never the
suggestion that we go back and change the ones already scored."

Another former scorer, Mr. Golczewski, accused supervisors of trying to manipulate
results to match expectations. "One day you see an essay that is a 3, and the next day
those are to be 2's because they say we need more 2's," he said.

He recalled that the pressure to produce worsened as deadlines neared. "We are actually
told," he said, "to stop getting too involved or thinking too long about the score — to just
score it on our first impressions."

Mr. Smith of NCS dismissed these anecdotes as aberrations that were probably caught
by supervisors before they affected scores.

"Mistakes will occur," he said. "We do everything possible to eliminate those mistakes
before they affect an individual test taker."

New York City did not use NCS to score its essay-style tests; instead, like a few other
states, it used local teachers. But like the scorers in Iowa, they also complained that they
had not been adequately trained.

One reading teacher said she was assigned to score eighth-grade math tests. "I said I
hadn't been in eighth-grade math class since I was in eighth grade," she said.

Another teacher, she said, arrived late at the scoring session and was put right to work
without any training.

Roseanne DeFabio, assistant education commissioner in New York State, said she
thought the complaints were exaggerated. State audits each year of 10 percent of the
tests do not show any major problems, she said, "so I think it's unlikely that there's any
systemic problem with the scoring."

The Demand: States Pushing For More, Faster

Testing specialists argue that educators and politicians must share the blame for the rash
of testing errors because they are asking too much of the industry.

They say schools want to test as late in the year as possible to maximize student
performance, while using tests that take longer to score. Yet schools want the results
before the school year ends so they can decide about school financing, teacher
evaluations, summer school, promotions or graduation.

"The demands may just be impossible," said Edward D. Roeber, a former education
official who is now vice president for external affairs for Measured Progress.

Case in point: California. On Oct. 9, 1997, Gov. Pete Wilson signed into law a bill that
gave state education officials five weeks to choose and adopt a statewide achievement
test, called the Standardized Testing and Reporting program.

The law's "unrealistic" deadlines, state auditors said later, contributed to the numerous
quality control problems that plagued the test contractor, Harcourt Educational
Measurement, for the next two years.

That state audit, and an audit done for Harcourt by Deloitte & Touche, paint a
devastating portrait of what went wrong. There was not time to test the computer link
between Harcourt, the test contractor, and NCS, the subcontractor. When needed, it did
not work, causing delays. Some test materials were delivered so late that students could
not take the tests on schedule.

It got worse. Pages in test booklets were duplicated, missing or out of order. One
district's test booklets, more than two tons of paper, were dumped on the sidewalk
outside the district offices at 5 p.m. on a Friday — in the rain. Test administrators were
not adequately trained. When school districts got the computer disks from NCS that
were supposed to contain the test results, some of the data was inaccurate and some of
the disks were blank.

In 1998, nearly 700 of the state's 8,500 schools got inaccurate test results, and more
than 750,000 students were not included in the statewide analysis of the test results.

Then, in 1999, Harcourt made a mistake entering demographic data into its computer.
The resulting scores made it appear that students with a limited command of English
were performing better in English than they actually were, a politically charged statistic in
a state that had voted a year earlier to eliminate bilingual education in favor of a one-year
intensive class in English.

"There's tremendous political pressure to get tests in place faster than is prudent," said
Maureen G. DiMarco, a vice president at Houghton Mifflin, whose subsidiary, the
Riverside Publishing Company, was one of the unsuccessful bidders for California's
business.

Dr. Paslov, who became president of Harcourt Educational Measurement after the 1999
problems, said that the current testing season in California is going smoothly and that
Harcourt has addressed concerns about errors and delays.

But California is still sprinting ahead.

In 1999, Gov. Gray Davis signed a bill directing state education officials to develop
another statewide test, the California High School Exit Exam. Once again, industry
executives said, speed seemed to trump all other considerations.

None of the major testing companies bid on the project because of what Ms. DiMarco
called "impossible, unrealistic time lines."

With no bidders, the state asked the companies to draft their own proposals. "We had
just 10 days to put it together," recalled George W. Bohrnstedt, senior vice president for
research at the American Institutes for Research, which has done noneducational testing
but is new to school testing.

Phil Spears, the state testing director, said A.I.R. faced a "monumental task, building and
administering a test in 18 months."

"Most states," Mr. Spears said, "would take three-plus years to do that kind of test."

The new test was given for the first time this spring.

The Concern: Life Choices Based on a Score

States are not just demanding more speed; they are demanding more complicated exams.
Test companies once had a steady business selling the same brand- name tests, like
Harcourt's Stanford Achievement Test or Riverside's Iowa Test of Basic Skills, to school
districts. These "shelf" tests, also called norm- referenced tests, are the testing equivalent
of ready-to-wear clothing. Graded on a bell curve, they measure how a student is
performing compared with other students taking the same tests.

But increasingly, states want custom tailoring, tests designed to fit their homegrown
educational standards. These "criterion referenced" tests measure students against a fixed
yardstick, not against each other.

That is exactly what Arizona wanted when it hired NCS and CTB/McGraw-Hill in
December 1998. What it got was more than two years of errors, delays, escalating costs
and angry disappointment on all sides.

Some of the problems Arizona encountered occurred because the state had established
standards that, officials later conceded, were too rigorous. But the state blames other
disruptions on NCS.

"You can't trust the quality assurance going on now," said Kelly Powell, the Arizona
testing director, who is still wrangling with NCS.

For its part, NCS has thrown up its hands on Arizona. "We've given Arizona nearly $2 of
service for every dollar they have paid us," said Jeffrey W. Taylor, a senior vice
president of NCS. Mr. Taylor said NCS would not bid on future business in that state.

Each customized test a state orders must be designed, written, edited, reviewed by state
educators, field-tested, checked for validity and bias, and calibrated to previous tests —
an arduous process that requires a battery of people trained in educational statistics and
psychometrics, the science of measuring mental function.

While the demand for such people is exploding, they are in extremely short supply
despite salaries that can reach into the six figures, people in the industry said. "All of us
in the business are very concerned about capacity," Mr. Bohrnstedt of A.I.R. said.

And academia will be little help, at least for a while, because promising candidates are
going into other, more lucrative areas of statistics and computer programming, testing
executives say.

Kurt Landgraf, president of the Educational Testing Service in Princeton, N.J., the titan
of college admission tests but a newcomer to high-stakes state testing, estimated that
there are about 20 good people coming into the field every year.

Already, the strain on the test-design process is showing. A supplemental math test that
Harcourt developed for California in 1999 proved statistically unreliable, in part because it
was too short. Harcourt had been urged to add five questions to the test, state auditors
said, but that was never done.

Even more troubling, most test professionals say, is the willingness of states like Arizona
to use standardized tests in ways that violate the testing industry's professional standards.
For example, many states use test scores for determining whether students graduate. Yet
the American Educational Research Association, the nation's largest educational research
group, specifically warns educators against making high-stakes decisions based on a
single test.

Among the reasons for this position, testing professionals say, is that some students are
emotionally overcome by the pressure of taking standardized tests. And a test score, "like
any other source of information about a student, is subject to error," noted the National
Research Council in a comprehensive study of high- stakes testing in 1999.

But industry executives insist that, while they

(see next post ...)
Report TOU ViolationShare This Post
 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext