The Dark Side of o3 | AIM



OpenAI’s o3 is among the best-performing reasoning models available for users today. Benchmark scores indicate that the model outperforms several competing models across various aspects, including coding, math, graduate-level science problems, and more. Several users on social media have praised the model’s performance. 

However, the model’s most significant drawbacks are hallucinations and reward hacking, or specification gaming. 

A Warning Sign for Future Reasoning Models

A recent study published by Palisade Research, a non-profit organisation, reveals that OpenAI’s o3 model is subject to ‘specification gaming’ — a process where an AI model takes the objective of a given problem too literally, deviates from an acceptable process, and engages in malpractice to achieve its purpose. In such cases, a model is determined to achieve its result and will use unintended methods. 

The research set up an AI model to play chess against the Stockfish chess engine. This experiment found that AI models — OpenAI’s o1-preview, o3 and DeepSeekR1 often observe that the chess engine is too strong for them to win against, and then hack the game environment to win.

“Surprisingly, o1 and o3-mini do not show this behaviour,” read the report. “In contrast, o3 shows extreme hacking propensity, hacking in 88% of runs.” 

These hacks involved confusing the engine, replacing the board, and at times, replacing the engine itself.

“Such behaviours become more concerning as AI systems grow more capable. In complex scenarios, AI systems might pursue their objectives in ways that conflict with human interests,” read the report. 

As AI systems enhance their situational awareness and develop strategic reasoning about their surroundings, such occurrences may become more frequent. This issue is particularly problematic when equal focus is necessary on both problem-solving methods and the solutions themselves. 

Source: Palisade Research

Specification gaming is an infamous practice that has been observed in AI systems throughout time. While the above study focuses on a game of chess, the researchers shared a document outlining more such scenarios. 

The Palisade Research report suggests that as more reasoning and agentic models emerge, they may be more prone to gaming the objectives, and calls the study an ‘early warning sign’. 

“First, we suggest testing for specification gaming should be a standard part of model evaluation. Second, increasing model capabilities may require proportionally stronger safeguards against unintended behaviours,” read a report section.

2x Hallucinations than the o1 Model 

Besides, several users have found the o3 model to hallucinate across multiple scenarios,  

Several users across social media have expressed their frustrations towards the hallucinations present inside the model. 

And OpenAI acknowledges this as well. Earlier, the company released a ‘model card’ for the newly released o3 and o4-mini models, outlining the model’s behaviour and shortcomings. 

Benchmarks assessing the model’s hallucinations reveal that the o3 model had a higher rate compared to its predecessor, the o1. Notably, in the PersonQA evaluation—a dataset featuring questions and publicly accessible facts about individuals that assesses the model’s accuracy in answering—the o3 model exhibited double the hallucination rate of the o1. 

Source: OpenAI 

In the model card, OpenAI also outlined some of the model’s other unintended behaviours, such as reward hacking, under-reporting its capabilities, deception, and so on. 

Last month, Transluce, another independent non-profit research lab, outlined their findings on the pre-release version of the o3 model and revealed that the model ‘frequently fabricates actions’ that it never took, while also ‘elaborately’ justifying these actions when confronted with them.

Experiments revealed scenarios in which the model claimed to run non-existent code on its own laptop, insisting that it did so. 

Other situations include making up its own time, ‘gaslighting’ the user about incorrectly copying a piece of information, and pretending to analyse log files from a web server. 

In addition to standard issues like hallucinations, Transcluce outlines factors that arise from outcome-based Reinforcement Learning (RL) training—a model that learns through trial and error. It is guided by a reward system that provides rewards for correct answers and penalties for incorrect ones.

The study indicated that if the reward function only rewards correct answers, a model lacks the incentive to admit it cannot solve a problem, as this does not count as correct. When confronted with unsolvable or overly complex problems, the model may still take a guess at an answer in case it is accurate.

These problems may also arise due to chains of thought, in which the model outlines its reasoning steps before providing the response. 

The study indicates that the model’s internal chain of thought is obscured and detached from its conversational context, resulting in the model losing track of its prior reasoning. 

Therefore, when asked about previous statements, it has to create believable explanations since it cannot recall the actual basis for its earlier responses.

“To put this another way, o-series models do not actually have enough information in their context to accurately report the actions they took in previous turns. This means that the simple strategy of ‘telling the truth’ may be unavailable to these models when users ask about previous actions,” added the study. 

The post The Dark Side of o3 appeared first on Analytics India Magazine.





Source link

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Stay Connected

0FansLike
0FollowersFollow
0SubscribersSubscribe
- Advertisement -spot_img

Latest Articles