Ai2 releases Tülu 3, a fully open-source model that bests DeepSeek v3, GPT-4o with novel post-training approach


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


The open-source model race just keeps on getting more interesting. 

Today, the Allen Institute for AI (Ai2) debuted its latest entry in the race with the launch of its open-source Tülu 3 405 billion-parameter large language model (LLM). The new model not only matches the capabilities of OpenAI’s GPT-4o, it surpasses DeepSeek’s v3 model across critical benchmarks.

This isn’t the first time the Ai2 has made bold claims about a new model. In November 2024 the company released its first version of Tülu 3, which had both 8- and 70-billion parameter versions. At the time, Ai2 claimed the model was on par with the latest GPT-4 model from OpenAI, Anthropic’s Claude and Google’s Gemini. The big difference is that Tülu 3 is open-source. Ai2 also claimed back in September 2024 that its Molmo models were able to beat GPT-4o and Claude on some benchmarks. 

While benchmark performance data is interesting, what’s perhaps more useful is the training innovations that enable the new Ai2 model.

Pushing post-training to the limit

The big breakthrough for Tülu 3 405B is rooted in an innovation that first appeared with the initial Tülu 3 release in 2024. That release utilized a combination of advanced post-training techniques to get better performance.

With the Tülu 3 405B model, those post-training techniques have been pushed even further, using an advanced post-training methodology that combines supervised fine-tuning, preference learning, and a novel reinforcement learning approach that has proven exceptional at larger scales.

“Applying Tülu 3’s post-training recipes to Tülu 3-405B, our largest-scale, fully open-source post-trained model to date, levels the playing field by providing open fine-tuning recipes, data and code, empowering developers and researchers to achieve performance comparable to top-tier closed models,” Hannaneh Hajishirzi, senior director of NLP Research at Ai2 told VentureBeat.

Advancing the state of open-source AI post-training with RLVR

Post-training is something that other models, including DeepSeek v3, do as well.

The key innovation that helps to differentiate Tülu 3 is Ai2’s “reinforcement learning from verifiable rewards” (RLVR) system. 

Unlike traditional training approaches, RLVR uses verifiable outcomes — such as solving mathematical problems correctly — to fine-tune the model’s performance. This technique, when combined with direct preference optimization (DPO) and carefully curated training data, has enabled the model to achieve better accuracy in complex reasoning tasks while maintaining strong safety characteristics.

Key technical innovations in the RLVR implementation include:

  • Efficient parallel processing across 256 GPUs
  • Optimized weight synchronization 
  • Balanced compute distribution across 32 nodes
  • Integrated vLLM deployment with 16-way tensor parallelism

The RLVR system showed improved results at the 405B-parameter scale compared to smaller models. The system also demonstrated particularly strong results in safety evaluations, outperforming DeepSeek V3 , Llama 3.1 and Nous Hermes 3. Notably, the RLVR framework’s effectiveness increased with model size, suggesting potential benefits from even larger-scale implementations.

How Tülu 3 405B compares to GPT-4o and DeepSeek v3

The model’s competitive positioning is particularly noteworthy in the current AI landscape.

Tülu 3 405B not only matches the capabilities of GPT-4o but also outperforms DeepSeek v3 in some areas, particularly with safety benchmarks.

Across a suite of 10 AI benchmarks including safety benchmarks, Ai2 reported that the Tülu 3 405B RLVR model had an average score of 80.7, surpassing DeepSeek V3’s 75.9. Tülu however is not quite as good at GPT-4o, which scored 81.6. Overall the metrics suggest that Tülu 3 405B is at the very least extremely competitive with GPT-4o and DeepSeek v3 across the benchmarks.

Why open-source AI matters and how Ai2 is doing it differently

What makes Tülu 3 405B different for users, though, is how Ai2 has made the model available. 

There is a lot of noise in the AI market about open source. DeepSeek says its model is open-source, and so is Meta’s Llama 3.1, which Tülu 3 405B also outperforms.

With both DeepSeek and Llama the models are freely available for use; and some code, but not all, is available.

For example, DeepSeek-R1 has released its model code and pre-trained weights but not the training data. Ai2 is taking a different approach in an attempt to be more open.

“We don’t leverage any closed datasets,” Hajishirzi said. “As with our first Tülu 3 release in November 2024, we are releasing all of the infrastructure code.”

She added that Ai2’s fully open approach, which includes data, training code and models, ensures users can easily customize their pipeline for everything from data selection through evaluation. Users can access the full suite of Tülu 3 models, including Tülu 3-405B, on Ai2’s Tülu 3 page, or test the Tülu 3-405B functionality through Ai2’s Playground demo space.



Leave a Comment