Industry News

DeepScaleR - Tiny 1.5B Model Beats OpenAI O1 in Math

DeepScaleR - Tiny 1.5B Model Beats OpenAI O1 in Math

DeepScaleR - Tiny 1.5B Model Beats OpenAI O1 in Math

Feb 12, 2025

7 min read

Turns out tiny, mission-specific model can deliver an amazing bang for the buck. This new model - a fine tuned version of the Deepseek-R1-Distilled-Qwen-1.5B model surpasses OpenAI’s frontier o1 model in math problem solving, at 1/1000th of the size. This is a phenomenal demonstration of the power specialized models bring to the table!

The Model

The Agentica project - an open source initiative founded by a group of Berkeley researchers - fine tuned a 1.5B parameter (tiny!) model for math problem solving and reasoning. This model achieved significantly better results on math problems compared with OpenAI’s o1 on multiple benchmarks, while also being 400x cheaper to run (OpenAI’s O1 is ~$60/1M tokens, you can run a 1.5B parameter model for ~$0.15/1M tokens).

Some highlights:

  • Base Model - this model is based on Deepseek-R1-Distilled-Qwen-1.5B from the Deepseek team.

  • Training Data - the Agentica team used 40,000 high-quality math problems to train the model. All the data has been made available by the team.

  • Training Cost - the fine tuning took 3,800 A100 hours - for a training cost of $4,500!

  • Training Process - the fine tuning used RL (Reinforcement Learning) to train the model for math problem solving.

  • Results - the model achieves SOTA level performance in math problem solving, surpassing OpenAI’s flagship O1 model in various benchmarks:

    • AIME 2024:

      • OpenAI O1 - 40.0

      • DeepScaleR-1.5B-Preview - 43.1 (+8%)

    • MATH 500:

      • OpenAI O1 - 81.4

      • DeepScaleR-1.5B-Preview - 87.8 (+8%)

Why This Matters

This is HUGE news - showing that smaller models, fine-tuned to specialize in specific tasks can not only be more cost effective, but also achieve better performance than larger generic models.

This strengthens the case for specialized language models. While larger generic LLMs can be helpful for accelerating development - providing “plug-and-play” functionality - moving to smaller mission-specific models will ultimately provide better cost AND better performance.

Improved Efficiency

The efficiency of an AI model can be generally quantified by the number of parameters in it — the more parameters, the more GPU memory needed and the more operations needed to make inferences. Inference cost also scales relatively linearly to model size.

We don’t know the size of OpenAI’s O1 - but it is rumored to be in the 1 Trillion parameter range. In contrast, DeepScaleR-1.5B-Preview has just 1.5 Billion parameters. Three orders of magnitude smaller. This directly translates to inference cost (~400x savings), inference latency and inference capacity. Farther, it means specialized models like this can be deployed on a wide variety of hardware platforms - even without expensive top-tier GPUs.

Smaller Models Outperform

Traditionally, it was thought that “bigger is better” - the more parameters a model has, the better answers it’ll give. DeepScaleR demonstrates that small models can outperform larger models by specializing in specific tasks. By hyper-specializing on a smaller scope of problem, specialized models deliver better accuracy.

Leveraging this learning, companies should opt for specialized task-specific models to improve accuracy for their specific use-cases.

Training is More Attainable Than Ever

It used to be widely accepted that fine-tuning custom models requires immense amounts of data and staggering resources - beyond the reach of most smaller scale companies. DeepScaleR shows that great results can be achieved with orders of magnitude less data and cost than originally thought.

DeepScaleR used a scant 40,000 samples and cost less than $5,000 to train - putting fine tuning within the reach of many more companies.

The Future of SLMs

With this powerful demonstration of the potential of Smaller Models, we predict more companies will choose to train and deploy custom, small models - for both efficiency and accuracy. The key to achieving these benefits is to create hyper-specialized models, dedicated to very specific tasks.

The more concentrated the use case is - the better performance is achievable with smaller models. With that in mind, more companies will opt for deploying a larger set of model - each dedicated to a subset of tasks they perform. This hyper-specialization will enable far more efficient AI system, and increased inference accuracy.

Datawizz - Create your own SLM

Datawizz helps companies migrate from large generic LLMs like GPT-4o, o-1 and Claude 3.5 to custom specialized models fine tuned for their tasks. As a plug-and-play solution, Datawizz intercepts your applications LLM requests, uses them to train smaller efficient models, and deploy them on a serverless platform.

If you are ready to create better & more efficient AI - checkout https://datawizz.ai !

DeepScaleR Resources

🌐 Website, 👨‍💻 Github, 🤗 HF Model, 🤗 HF Dataset, 📈 Wandb Logs, 🔎 Eval Logs

In this post

In this post

In this post