Skip to content

GRPO Fine-Tuning

The GRPO Fine-Tuning workflow allows you to enhance language models’ reasoning capabilities using Group Relative Policy Optimization. This implementation leverages Machine GPU runners to efficiently train models, significantly improving mathematical reasoning and structured output generation.

In this example, we will be fine-tuning the Qwen 2.5 3B model on the GSM8K dataset to strengthen mathematical reasoning abilities.

Prerequisites

You will need to have completed the Quick Start guide.

Use Case Overview

Why might you want to use GRPO fine-tuning?

  • Enhance mathematical reasoning capabilities of language models
  • Enforce structured output formats with defined reasoning steps
  • Improve model performance on complex problem-solving tasks
  • Create models that provide step-by-step explanations of their reasoning process

How It Works

The GRPO Fine-Tuning workflow uses Unsloth to accelerate the fine-tuning process and implements Group Relative Policy Optimization techniques. The workflow is defined in GitHub Actions workflow files and can be triggered on-demand with configurable parameters.

The fine-tuning process:

  1. Loads a specified base model (e.g., Qwen 2.5 3B)
  2. Prepares the GSM8K dataset, which contains grade-school math problems
  3. Applies Low-Rank Adaptation (LoRA) for memory-efficient training
  4. Trains the model using GRPO techniques to improve reasoning and structured outputs
  5. Automatically saves checkpoints during training (in the retry-enabled workflow)
  6. Pushes the fine-tuned model to Hugging Face Hub

Workflow Implementation

The GRPO Fine-Tuning is implemented as GitHub Actions workflows that can be triggered manually. Here’s the basic workflow definition:

name: Training
on:
workflow_dispatch:
inputs:
max_seq_length:
type: string
required: false
description: 'The maximum sequence length'
default: '1024'
lora_rank:
type: string
required: false
description: 'The lora rank'
default: '64'
max_steps:
type: string
required: false
description: 'The maximum number of steps'
default: '250'
gpu_memory_utilization:
type: string
required: false
description: 'The GPU memory utilization'
default: '0.60'
learning_rate:
type: string
required: false
description: 'The learning rate'
default: '5e-6'
per_device_train_batch_size:
type: string
required: false
description: 'The per device training batch size'
default: '1'
hf_repo:
type: string
required: true
description: 'The Hugging Face repository to upload the model to'
jobs:
train:
name: Qwen 2.5 3B - GRPO LoRA Training (unsloth)
runs-on:
- machine
- gpu=T4
- cpu=4
- ram=16
- architecture=x64
timeout-minutes: 180
env:
MAX_SEQ_LENGTH: ${{ inputs.max_seq_length }}
LORA_RANK: ${{ inputs.lora_rank }}
GPU_MEMORY_UTILIZATION: ${{ inputs.gpu_memory_utilization }}
MAX_STEPS: ${{ inputs.max_steps }}
LEARNING_RATE: ${{ inputs.learning_rate }}
PER_DEVICE_TRAIN_BATCH_SIZE: ${{ inputs.per_device_train_batch_size }}
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_HUB_ENABLE_HF_TRANSFER: 1
HF_REPO: ${{ inputs.hf_repo }}
steps:
- uses: actions/checkout@v4
- name: Set up Python 3.10
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: |
pip install -r requirements.txt
- name: Run Training
run: |
python3 "qwen2_5_(3b)_grpo.py"

Advanced Retry Mechanism

For enhanced reliability, the repository also provides a workflow with automatic checkpointing and retry functionality:

name: Training with Retry
on:
workflow_dispatch:
inputs:
attempt:
type: string
description: 'The attempt number'
default: '1'
max_attempts:
type: number
description: 'The maximum number of attempts'
default: 5
# Same parameters as in the basic workflow
# ...

This implementation ensures training progress isn’t lost due to spot instance interruptions by:

  1. Automatically saving checkpoints to Hugging Face Hub during training
  2. Detecting spot instance interruptions using a custom GitHub Action
  3. Restarting the workflow with an incremented attempt number
  4. Resuming training from the latest checkpoint

The retry mechanism works through the following steps:

  1. The workflow starts a training job with a specified attempt number (default: 1)
  2. During training, checkpoints are periodically saved to Hugging Face Hub
  3. If the job completes successfully, the workflow ends
  4. If the job fails due to a spot instance interruption:
    • The check-runner-interruption action detects that the failure was due to a spot instance preemption
    • The workflow calculates the next attempt number
    • If within the maximum attempts limit, it triggers a new workflow run with an incremented attempt number
    • All original parameters are preserved for the new attempt
  5. When a new attempt starts, it downloads the latest checkpoint and resumes training from that point

This mechanism ensures that even if a spot instance is reclaimed, your training progress isn’t lost, and the job can continue from the last checkpoint on a new instance.

Using Machine GPU Runners

This fine-tuning process leverages Machine GPU runners to provide the necessary computing power. The workflow is configured to use:

  • T4 GPU: An entry-level ML training GPU with 16GB of VRAM, suitable for efficient training with unsloth optimizations
  • Spot instance: To optimize for cost while maintaining performance
  • Configurable resources: CPU, RAM, and architecture specifications

You can also specify regions for the training to run in:

runs-on:
- machine
- gpu=T4
- cpu=4
- ram=16
- architecture=x64
- tenancy=spot
- regions=us-east-1,us-east-2

What is GRPO?

Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm designed to improve reasoning capabilities in language models. Originally introduced in the DeepSeekMath paper and later used for training DeepSeek-R1, GRPO offers several advantages over traditional Proximal Policy Optimization (PPO):

  1. No Value Function: GRPO eliminates the need for a separate value function model, reducing memory consumption and computational requirements.

  2. Group-Based Advantage Estimation: Instead of using a value function, GRPO samples multiple outputs for each prompt and uses the average reward within the group as a baseline. This approach better aligns with how reward models evaluate multiple outputs for a single input.

  3. Direct KL Divergence Integration: GRPO incorporates the KL divergence term directly into the loss function (rather than as part of the reward signal), helping to prevent the model from deviating too far from its original behavior.

When applied to mathematical reasoning tasks like GSM8K, GRPO encourages models to show their work through structured thinking steps, significantly improving both accuracy and explainability of solutions.

The workflow typically involves:

  • Sampling multiple outputs for each prompt
  • Scoring each generation using reward functions (rule-based or outcome-based)
  • Calculating advantages relative to the group average
  • Optimizing the policy while maintaining reasonable proximity to the original model

Getting Started

To run the GRPO Fine-Tuning workflow:

  1. Use the MachineHQ/grpo-fine-tune repository as a template
  2. Set up a Hugging Face access token with write permissions
  3. Add this token as a repository secret named HF_TOKEN in your GitHub repository settings
  4. Navigate to the Actions tab in your repository
  5. Select the “Training with Retry” workflow
  6. Click “Run workflow” and configure your parameters:
    • Adjust sequence length, LoRA rank, and training steps
    • Configure GPU memory utilization and learning rate
    • Specify your Hugging Face target repository
  7. Run the workflow and wait for results
  8. Access your fine-tuned model on Hugging Face Hub

Best Practices

  • Start with reasonable defaults: The default parameters have been tuned for good results on GSM8K
  • Adjust batch size for your GPU: Lower batch sizes if you encounter out-of-memory errors
  • Use checkpointing for longer runs: For extensive training sessions, use the retry-enabled workflow
  • Monitor training progress: Check workflow logs to observe training metrics
  • Test on mathematics problems: Evaluate the model specifically on math word problems to gauge improvement

Next Steps