
PULLMAN, Wash. — A new study led by a Washington State University professor found that ChatGPT can be both inaccurate and inconsistent when asked to judge whether scientific hypotheses are true or false.
Washington State University associate professor Mesut Cicek and his colleagues tested ChatGPT by feeding it hypotheses from scientific papers and asking whether research had upheld each statement.
SEE ALSO | Teens weigh in on AI, from cheating in school to using chatbots for emotional support
The team used more than 700 hypotheses and repeated each query 10 times.
The AI answered correctly 76.5% of the time when the experiment was run in 2024. When the test was repeated in 2025, accuracy improved to 80%. But when the researchers accounted for random guessing, since a true-or-false question has a 50% chance of being correct by chance, the AI was only about 60% better than chance in both years – which the researchers described as closer to a low D than high reliability.
The study found ChatGPT struggled most when the correct answer was false, identifying false hypotheses correctly just 16.4% of the time.
The researchers also found inconsistency across repeated prompts: across 10 identical prompts, ChatGPT consistently estimated only 73% of the statements accurately.
“We’re not just talking about accuracy, we’re talking about inconsistency, because if you ask the same question again and again, you come up with different answers,” Cicek said.
“We used 10 prompts with the same exact question. Everything was identical. It would answer true. Next, it says it’s false. It’s true, it’s false, false, true. There were several cases where there were five true, five false,” he said.
SEE ALSO | OpenAI opens Bellevue office with ribbon-cutting as region reels from recent tech layoffs
The findings were published in the Rutgers Business Review. The researchers said the results reinforce the need for skepticism and caution when using AI for critical tasks, especially those involving nuance or complicated reasoning. They also said the results suggest that AI’s ability to produce fluent language is not matched by the ability to reason through complex questions, and that the arrival of artificial general intelligence that can truly “think” may be farther off than some predict.
“Current AI tools don’t understand the world the way we do — they don’t have a ‘brain,’” Cicek said. “They just memorize, and they can give you some insight, but they don’t understand what they’re talking about.”
The researchers used 719 hypotheses from scientific papers published in business journals since 2021. They ran the experiment with the free version of ChatGPT-3.5 in 2024 and the free, updated ChatGPT-5 mini in 2025. The study said overall accuracy remained similar between the versions.
The researchers concluded that business managers should verify AI results, treat them with skepticism, and provide training on what AI can and can’t do well. Cicek said he has run similar tests with other AI tools and found comparable results.
“Always be skeptical,” Cicek said. “I’m not against AI. I’m using it. But you need to be very careful.”