Introduction
The game just changed. Fine-tuning Small Language Models (SLMs) just went up a level with the latest GRPO (Group Relative Policy Optimization) advances, making fast, reinforcement-learning-based fine-tuning accessible to everyone with just open-source code.
Imagine this: In just ~50 minutes, on a Google Colab A100 GPU, we fine-tuned a 0.5B parameter SLM that competes with OpenAI’s O1-preview on a Q&A task—even though we only trained on a tiny dataset of 783 records. The speed, cost efficiency, and control this unlocks are game-changing for organizations looking to move from black-box LLMs to owning their own AI.
At DataWizz, we help organizations transition from LLMs to SLMs, giving them full control over their AI. With the latest GRPO-based fine-tuning, that transition just became faster, cheaper, and better than ever before. And the best part? You can try it yourself in minutes using our Colab Notebook: Run the fine-tuning here.
Why GRPO is a Breakthrough for Fine-Tuning
How GRPO Works: A Simpler Approach to RL Fine-Tuning
Unlike traditional reinforcement learning methods like PPO (Proximal Policy Optimization), which require careful KL divergence constraints and often struggle with training instability, GRPO simplifies the reward optimization process. Instead of constraining policy updates based on previous models, GRPO directly optimizes for high-reward trajectories, making it more stable and efficient.
One of the biggest advantages of GRPO is that it omits the need for a separate SFT (Supervised Fine-Tuning) step, allowing fine-tuning to begin directly with reinforcement learning. This not only reduces the complexity of training but also enables faster iteration cycles—making it particularly useful for fine-tuning SLMs like Qwen-0.5B and Llama3.2-1B-instruct in a short timeframe.
Previously, achieving similar performance in fine-tuned Q&A models required at least a 3B parameter SLM trained with Supervised Fine-Tuning (SFT). With GRPO, we are now seeing comparable or superior results using just 0.5B parameters, significantly lowering the hardware requirements and making fine-tuning more accessible.
For years, RLHF (Reinforcement Learning with Human Feedback) has been the gold standard for aligning models with human preferences. However, it’s expensive, complex, and requires massive datasets. Enter GRPO—a simpler, faster, and more stable RL fine-tuning method pioneered by DeepSeek that is now powering the next wave of SLM fine-tuning.
We’ve seen GRPO successfully applied to math-solving tasks, and now we’re expanding it to question answering using text similarity rewards—proving that GRPO is a versatile and scalable fine-tuning approach for any text generation task.
Key open-source contributions that helped us get here:
TRL Library (Hugging Face GRPO Trainer) for GRPO implementation.
William Brown's GRPO-based fine-tuning of Qwen-0.5B for math-solving (Will’s Gist).
Pierre-Carl Langlais's insights on GRPO with Qwen (Colab Notebook).
DeepScaleR insights on scaling RL-based fine-tuning (DeepScaleR).
With this experiment, we’re showing how GRPO can fine-tune SLMs at lightning speed for Q&A—and by extension, any task where text similarity-based rewards apply.
Fine-Tuning with GRPO and Text Similarity Rewards
Dataset & Training Setup
We fine-tuned Qwen-0.5B and Llama3.2-1B-Instruct on the Taylor Swift Q&A dataset (lamini/taylor_swift), a small but effective set of 783 records that allowed us to iterate fast. The full fine-tuning process took just ~50 minutes on a Google Colab A100 GPU.
To better illustrate the fine-tuning process, here’s an example of how the data is structured:
Sample Input & Expected Output
This structure follows a system prompt + user question → reference answer format, which the model learns to optimize.
Reward Functions: The Innovation of LLM-Judging (LLM-J)
One of the most exciting aspects of this experiment is the integration of LLM-Judging (LLM-J), where a separate LLM evaluates generated responses for correctness. This innovative approach allows us to fine-tune models without human labels, making RL-based fine-tuning even more scalable.
Instead of traditional supervised learning, we used a reinforcement learning approach that optimizes responses based on text similarity metrics:
ROUGE-L Score: Measures n-gram overlap with reference answers.
Length Similarity: Ensures the response structure aligns with ground truth.
LLM-Judging (LLM-J): Uses an LLM as a judge to assess response correctness.
Our updated reward function: R=0.3×ROUGE-L+0.2×Length Similarity+0.5×LLM-J
By increasing the weight of LLM-J, we achieved a higher ROUGE-L score by prioritizing semantic correctness over n-gram similarity.
To further illustrate how the LLM judge evaluates responses, here’s an example of the evaluation process:
LLM-J Input (Evaluation Prompt)
Expected LLM Judge Output:
The LLM Judge (LLM-J) assigns a numerical score based on how well the generated response aligns with the reference answer, helping optimize the fine-tuning process by acting as a reward signal.
One challenge we encountered was the impact of using OpenAI's models for LLM-J scoring. While GPT-4o provided strong evaluations, the API calls significantly increased training time due to network latency and request processing. To optimize efficiency, we chose to use Qwen-0.5B/Lamma3.2-1B-instruct, which was already loaded locally, ensuring much faster training without sacrificing evaluation quality.
Dynamic Context Length Scaling
Inspired by DeepScaleR, we implemented a dynamic context length adjustment strategy to help the model handle longer answers more effectively. The training started with a max completion length of 160, then increased to 256 after 500 steps, and finally expanded to 384 after 1000 steps. This gradual adjustment allowed the model to handle longer completions more efficiently while avoiding instability early in training.
Results: Qwen-0.5B and Llama-3.2-1B-instruct vs OpenAI’s O1-Preview
Model: OpenAI o1-preview
: Avg ROUGE-L F1: 0.3295, Avg Inference Time: 15.48 sec
Model: Fine-tuned Qwen-0.5B
: Avg ROUGE-L F1: 0.3313, Avg Inference Time: 1.41 sec
Model: Fine-tuned Llama3.2-1B-instruct
: Avg ROUGE-L F1: 0.3539, Avg Inference Time: 0..93 sec
What This Means:
🚀 Fine-tuned Qwen-0.5B and Llama3.2-1B-instruct both surpass OpenAI’s O1-preview on ROUGE-L.
⚡ Inference time is now ~11-17x faster than OpenAI’s API (1.41s/0.93s vs. 15.48s per response).
💰 No API costs—full control over AI ownership with open-source fine-tuning.
This result is a huge milestone for SLM adoption, proving that with fast GRPO-based RL fine-tuning, organizations can train specialized models that compete with commercial LLMs—without breaking the bank.
Where Do We Go from Here?
👉 Try it now with our Colab Notebooks!
Fine-tune Qwen-0.5B with GRPO: Run the notebook
Fine-tune Llama3.2-1B-Instruct with GRPO: Run the notebook