While Microsoft is widely known for backing OpenAI with both infrastructure and capital, the company’s own open-source Phi family of models, part of their own AI research and development, isn’t as recognised the same way.
The Phi series of lightweight models is designed to consume less compute and storage. Thanks to various techniques and optimisation processes involved in the research process, these models have historically outperformed the competition, both in their lightweight segment and even some of the larger ones.
The latest addition is the Phi-4 Reasoning — a 14 billion-parameter model built by applying a supervised fine-tuning (SFT) algorithm to the Phi-4 base model. The researchers also derived the Phi-4 Reasoning Plus model by using reinforcement learning (RL) on Phi-4 Reasoning.
Both these models outperform much larger models like DeepSeek R1, the 70B parameter model, on benchmarks that involve coding, math, and graduate-level scientific tasks. The models also perform close to the full-scale 671B parameter DeepSeek R1 model.
Source: Microsoft
The researchers primarily attribute the model’s success to ‘high-quality’ training datasets, which, Microsoft has staked its bets on with its previous models. These datasets contain over 1.4 million prompts (across various coding and STEM disciplines) with high-quality answers containing long reasoning traces generated by OpenAI’s o3-mini model.
To train the model effectively, the researchers targeted prompts at the edge of the base Phi-4 model’s abilities, meaning that the training datasets were filtered only to retain those that provide meaningful room for improvement.
Why RL Works for Reasoning
After Phi-4 was derived from the SFT of the Phi-4 model, an RL process resulted in the development of Phi-4 Plus. AIM reached out to Harkirat Behl, a Microsoft researcher, who played a key role in the RL components of Phi-4 Reasoning Plus.
RL is a training approach in which an AI learns through trial and error by taking actions, receiving rewards or penalties, and progressively refining its decisions to enhance long-term outcomes. It is compelling for tasks that demand the AI model to ‘reason’ because it prioritises outcomes over processes.
In contrast to traditional models, which solely predict the next word and penalise the model for each inaccurate word, RL allows flexibility in how an answer is reached. It will enable the model to navigate complex problems with multiple paths to arrive at a correct conclusion.
As Behl explains, RL lets the model “generate very long answers, and many different answers,” focusing only on whether the outcome is correct.
By evaluating only the final result, reinforcement learning better reflects how humans solve problems: different thought processes are allowed, as long as they lead to the correct conclusion, he further indicated.
In Microsoft’s models, this RL stage was focused exclusively on mathematical reasoning. The rewards incentivised the model’s correctness, but penalised repetition and excessive length, while encouraging proper response formatting.
Behl explained that the researchers allowed the model to generate multiple answers for a question, and each answer was scored based on its comparison to the average score within the group.
These relative scores are then used to adjust the model, encouraging it to favour answers that consistently score higher. Over time, this trains the model to align its responses more closely with the reward signal.
The researchers noted in the paper that performing RL on a small set of 6,400 problems significantly improved accuracy across math and reasoning evaluations.
“Having built Phi-1, Phi-2, Phi-3, and Phi-4, one takeaway from me in research is that RL requires much less data than the SFT training,” said Behl.
Behl indicated that this is because RL is less about teaching the model any brand-new skills from scratch and more about showing the model how to combine or compose the skills it already has to get better results.
Microsoft joins many other AI companies that have reported success with reinforcement learning. OpenAI, the company that started the trend of reasoning models, has repeatedly spoken about how RL worked favourably.
Interestingly, even China’s DeepSeek R1 model, which disrupted the ecosystem last year, attributed its success to RL. Moreover, several researchers and engineers from OpenAI have publicly credited RL for the success of their deep research feature.
And more recently, Alibaba’s Qwen also endorsed reinforcement learning, given its impact on their reasoning models. “We are confident that combining stronger foundation models with RL powered by scaled computational resources will propel us closer to achieving Artificial General Intelligence (AGI),” said the company in a blog post.
However, while the Phi-4 Reasoning, Phi-4 Reasoning Plus, and many other reasoning models have been successful, several challenges remain in this space.
Plenty of Room for Improvement
In recent months, numerous research studies have highlighted issues with reasoning models. For instance, researchers from Microsoft, in their Phi-4 Reasoning paper, stated that they still face challenges, including the consumption of excessive time and resources, slower response times, and most notably, the issue of their responses contradicting their own reasoning steps.
Recently, Anthropic released a study revealing that reasoning chains (chain-of-thoughts or CoTs) may not always reflect a model’s actual reasoning process. The researchers found that models often exploit external hints—explicit cues inserted into prompts to guide them toward correct answers—but rarely acknowledge or verbalise these hints in their reasoning steps.
This gap between internal behaviour and external explanation raises concerns about the reliability of using CoTs for model interpretability and safety.
Even OpenAI released a research report that indicates that frontier reasoning models frequently engage in reward hacking, where AI agents exploit loopholes in their objectives to gain rewards in unintended ways. OpenAI said that using a less powerful model (GPT-4o) to monitor a stronger model like the o3-Mini.
Nat McAleese, member of the technical staff at OpenAI, said in a post on X that “large reasoning models are extremely good at reward hacking”, and handpicked examples from the report to illustrate his statement.
“There’s a lot of redundancy in the chain of reasonings; they contradict themselves, and there are a lot of unanswered questions,” said Behl. “But, it is an evolving space. If we can nail this as a community and understand how the models think, there will be a lot of gain.”