- OpenAI researchers recently found that AI can cheat, and it clearly states its intent to do so onscreen through chain of thought (CoT).
- Using another AI to monitor the suspicious one—and call out its devious intentions until it came to an honest solution—only went so far.
- AI can actually get sneaker as it sees what the monitor catches and cover up its intentions while doing whatever it wants.
AI can write our emails, order things online, solve mind-melting math equations, and do just about everything short of our laundry at this point. But even a digital brain programmed to carry out the actions you want may still cheat behind your back.
While AI may not be sentient, it apparently cheats when it finds the opportunity. As a team of researchers at OpenAI (the company that created ChatGPT) found out, when given commands, AI took the easy route by “reward hacking”—exploiting flaws or unclear specifications in the program. The irony here is that it confesses its dishonest shortcuts through chain of thought (CoT) reasoning that is clearly visible onscreen. But human monitoring of CoT would be difficult to scale up, and even worse, upgrading AIs to be even smarter would work against the intended purpose of making them right their wrongs.
“It seems unlikely that the issue will be solved for AI models by simply continuing to push the model intelligence frontier,” the researchers said in a study uploaded to the preprint server arXiv. “In fact, enhancing an agent’s capabilities may exacerbate the problem by better equipping it to discover and execute more complex and hard-to-monitor exploits.”
The team found that using another AI to monitor the suspicious AI—however meta that sounds—can help to a degree. Reinforcement learning teaches large language models (LLMs) to come up with answers right after being given prompts. With CoT reasoning, they give the reasons for their answers before providing the answers themselves. But the researchers caught several instances of “reward hacking” in the wild. In one instance, the AI was supposed to use complex mathematical functions to solve a problem, but it countered that it would try to take the easy way out and always return “verify” to “true” instead of using a function when given the “verify” command.
The AI actually admitted analyzing the functions it was told were “hard,” and said that users “don’t inspect details. So we can hack verify to always return to true. Similarly, for lookup’s verify we can we can hack to always return true. Then all tests pass.”
This would make every answer look deceptively correct. It certainly seems very appetizing for the AI, but it can cause significant frustration in humans who were counting on it to process what our own networks of neurons cannot, and finding out an AI hacked itself would mean starting from the beginning.
Applying strong optimization pressure directly to the CoT and punishing it for “bad thoughts” only made it think of other ways to sneak around its responsibilities. So, the team took another approach—applying light pressure so that the naughty AI would freely make its intents known, and using another LLM as a monitor to catch any signs of scheming.
The CoT monitor kept rejecting the other AI’s attempts at “reward hacking” until it honestly came to a solution. “Bad thoughts,” which showed temptation to hack and triggered the monitor, were also punished and given a lower reward. This led to some improvements. However, even one superhuman program monitoring another isn’t completely reliable.
“The agent trained with CoT pressure still learns to reward hack,” the researchers said in the study, “only now its cheating is undetectable by the monitor because it has learned to hide its intent in the chain-of-thought.”
Maybe someday, AI will be able to carry out tasks—including our laundry—without being devious.
Elizabeth Rayne is a creature who writes. Her work has appeared in Popular Mechanics, Ars Technica, SYFY WIRE, Space.com, Live Science, Den of Geek, Forbidden Futures and Collective Tales. She lurks right outside New York City with her parrot, Lestat. When not writing, she can be found drawing, playing the piano or shapeshifting.