Computers to Grade Written Answers on Student Exams in Texas

(Hans Slegers/Dreamstime)

Students taking the State of Texas Assessment of Academic Readiness exams this week will have their written answers graded automatically by computers, in a new evaluation method the Texas Education Agency is implementing, according to The Texas Tribune.

The “automated scoring engine,” which reportedly uses natural language processing technology like artificial intelligence chatbots, will be used to evaluate students’ answers for open-ended questions in reading, writing, science, and social studies.

The switch to computer grading will save the state agency approximately $15-20 million per year that would have been spent hiring human graders, according to the Tribune.

Following a redesign of the STAAR test last year, there are now six to seven times more open-ended questions, or constructed response items, than there are multiple choice questions. The exam measures students’ comprehension of state-mandated core curriculum.

“We wanted to keep as many constructed open-ended responses as we can, but they take an incredible amount of time to score,” Jose Rios, director of student assessment at the Texas Education Agency, told the Tribune.

As students sit for their exams this spring, the computer will grade all of the constructed response items, before human scorers then regrade a quarter of the responses.

If the computer has “low confidence” in the score it assigned, or if it encounters a type of response that its programming does not recognize, like slang or non-English words, then those responses will automatically be reassigned to a human grader.

“We have always had very robust quality control processes with humans,” Chris Rozunick, division director for assessment development at the Texas Education Agency, told the Tribune.

The quality control for a computer system looks similar, he said.

Rozunick and other testing administrators will reportedly review a summary of results each day to ensure they match what is expected, as well as review random samples of responses to confirm the accuracy of the computer’s work.

While TEA officials have stressed the scoring engine will be overseen by humans and is not able to “learn” from one response to the next, the plan has sparked concern among educators and parents still coming to grips with machine learning and artificial intelligence.

Some educators told the Tribune they were surprised by TEA’s decision to score responses by computer.

“There ought to be some consensus about, Hey, this is a good thing, or not a good thing, a fair thing or not a fair thing,” Kevin Brown, executive director for the Texas Association of School Administrators and a former superintendent, said.

The STAAR tests are taken each year by students from third grade through high school and the results are a key metric in the accountability system TEA uses to grade school districts and individual school campuses, according to the Tribune.

The Texas education commissioner is permitted by state law to intervene when campuses within a district are underperforming on the test, with actions ranging from suspending and replacing elected school boards to appointing a board of managers or closing a school.

With so much on the line, there is some skepticism as to whether a computer can grade written responses as accurately as a human can.

“There’s always this sort of feeling that everything happens to students and to schools and to teachers and not for them or with them,” Carrie Griffith, policy specialist for the Texas State Teachers Association, told the Tribune.

Even if the automated scoring engine works exactly as designed, “it’s not something parents or teachers are going to trust,” Griffith, a former teacher, said.