Machine supercharges your GitHub Workflows with seamless GPU acceleration. Say goodbye to the tedious overhead of managing GPU runners and hello to streamlined efficiency.

What GPU types are available?

Machine offers a wide selection of NVIDIA GPUs including T4G, T4, L4, A10G, and L40S, as well as AWS Inferentia options to match your specific workflow needs.

How do I integrate Machine with GitHub Actions?

Machine integrates natively with GitHub Actions. Simply update your workflows with a runs-on tag and start accelerating your tasks immediately.

How much faster are GPU-accelerated workflows?

Machine accelerates model training, inference, batch processing, and simulations—up to 100× faster than CPU-only workflows.

GRPO Fine-Tuning

The GRPO Fine-Tuning workflow allows you to enhance language models’ reasoning capabilities using Group Relative Policy Optimization. This implementation leverages Machine GPU runners to efficiently train models, significantly improving mathematical reasoning and structured output generation.

In this example, we will be fine-tuning the Qwen 2.5 3B model on the GSM8K dataset to strengthen mathematical reasoning abilities.

Use Case Overview

Why might you want to use GRPO fine-tuning?

Enhance mathematical reasoning capabilities of language models
Enforce structured output formats with defined reasoning steps
Improve model performance on complex problem-solving tasks
Create models that provide step-by-step explanations of their reasoning process

How It Works

The GRPO Fine-Tuning workflow uses Unsloth to accelerate the fine-tuning process and implements Group Relative Policy Optimization techniques. The workflow is defined in GitHub Actions workflow files and can be triggered on-demand with configurable parameters.

The fine-tuning process:

Loads a specified base model (e.g., Qwen 2.5 3B)
Prepares the GSM8K dataset, which contains grade-school math problems
Applies Low-Rank Adaptation (LoRA) for memory-efficient training
Trains the model using GRPO techniques to improve reasoning and structured outputs
Automatically saves checkpoints during training (in the retry-enabled workflow)
Pushes the fine-tuned model to Hugging Face Hub

Workflow Implementation

The GRPO Fine-Tuning is implemented as GitHub Actions workflows that can be triggered manually. Here’s the basic workflow definition:

name: Training

on:
  workflow_dispatch:
    inputs:
      max_seq_length:
        type: string
        required: false
        description: 'The maximum sequence length'
        default: '1024'
      lora_rank:
        type: string
        required: false
        description: 'The lora rank'
        default: '64'
      max_steps:
        type: string
        required: false
        description: 'The maximum number of steps'
        default: '250'
      gpu_memory_utilization:
        type: string
        required: false
        description: 'The GPU memory utilization'
        default: '0.60'
      learning_rate:
        type: string
        required: false
        description: 'The learning rate'
        default: '5e-6'
      per_device_train_batch_size:
        type: string
        required: false
        description: 'The per device training batch size'
        default: '1'
      hf_repo:
        type: string
        required: true
        description: 'The Hugging Face repository to upload the model to'

jobs:
  train:
    name: Qwen 2.5 3B - GRPO LoRA Training (unsloth)
    runs-on:
      - machine
      - gpu=T4
      - cpu=4
      - ram=16
      - architecture=x64
    timeout-minutes: 180
    env:
      MAX_SEQ_LENGTH: ${{ inputs.max_seq_length }}
      LORA_RANK: ${{ inputs.lora_rank }}
      GPU_MEMORY_UTILIZATION: ${{ inputs.gpu_memory_utilization }}
      MAX_STEPS: ${{ inputs.max_steps }}
      LEARNING_RATE: ${{ inputs.learning_rate }}
      PER_DEVICE_TRAIN_BATCH_SIZE: ${{ inputs.per_device_train_batch_size }}
      HF_TOKEN: ${{ secrets.HF_TOKEN }}
      HF_HUB_ENABLE_HF_TRANSFER: 1
      HF_REPO: ${{ inputs.hf_repo }}
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python 3.10
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt

      - name: Run Training
        run: |
          python3 "qwen2_5_(3b)_grpo.py"

Advanced Retry Mechanism

For enhanced reliability, the repository also provides a workflow with automatic checkpointing and retry functionality:

name: Training with Retry

on:
  workflow_dispatch:
    inputs:
      attempt:
        type: string
        description: 'The attempt number'
        default: '1'
      max_attempts:
        type: number
        description: 'The maximum number of attempts'
        default: 5
      # Same parameters as in the basic workflow
      # ...

This implementation ensures training progress isn’t lost due to spot instance interruptions by:

Automatically saving checkpoints to Hugging Face Hub during training
Detecting spot instance interruptions using a custom GitHub Action
Restarting the workflow with an incremented attempt number
Resuming training from the latest checkpoint

The retry mechanism works through the following steps:

The workflow starts a training job with a specified attempt number (default: 1)
During training, checkpoints are periodically saved to Hugging Face Hub
If the job completes successfully, the workflow ends
If the job fails due to a spot instance interruption:
- The check-runner-interruption action detects that the failure was due to a spot instance preemption
- The workflow calculates the next attempt number
- If within the maximum attempts limit, it triggers a new workflow run with an incremented attempt number
- All original parameters are preserved for the new attempt
When a new attempt starts, it downloads the latest checkpoint and resumes training from that point

This mechanism ensures that even if a spot instance is reclaimed, your training progress isn’t lost, and the job can continue from the last checkpoint on a new instance.

Using Machine GPU Runners

This fine-tuning process leverages Machine GPU runners to provide the necessary computing power. The workflow is configured to use:

T4 GPU: An entry-level ML training GPU with 16GB of VRAM, suitable for efficient training with unsloth optimizations
Spot instance: To optimize for cost while maintaining performance
Configurable resources: CPU, RAM, and architecture specifications

You can also specify regions for the training to run in:

runs-on:
  - machine
  - gpu=T4
  - cpu=4
  - ram=16
  - architecture=x64
  - tenancy=spot
  - regions=us-east-1,us-east-2

What is GRPO?

Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm designed to improve reasoning capabilities in language models. Originally introduced in the DeepSeekMath paper and later used for training DeepSeek-R1, GRPO offers several advantages over traditional Proximal Policy Optimization (PPO):

No Value Function: GRPO eliminates the need for a separate value function model, reducing memory consumption and computational requirements.
Group-Based Advantage Estimation: Instead of using a value function, GRPO samples multiple outputs for each prompt and uses the average reward within the group as a baseline. This approach better aligns with how reward models evaluate multiple outputs for a single input.
Direct KL Divergence Integration: GRPO incorporates the KL divergence term directly into the loss function (rather than as part of the reward signal), helping to prevent the model from deviating too far from its original behavior.

When applied to mathematical reasoning tasks like GSM8K, GRPO encourages models to show their work through structured thinking steps, significantly improving both accuracy and explainability of solutions.

The workflow typically involves:

Sampling multiple outputs for each prompt
Scoring each generation using reward functions (rule-based or outcome-based)
Calculating advantages relative to the group average
Optimizing the policy while maintaining reasonable proximity to the original model

Getting Started

To run the GRPO Fine-Tuning workflow:

Use the MachineDotDev/grpo-fine-tune repository as a template
Set up a Hugging Face access token with write permissions
Add this token as a repository secret named HF_TOKEN in your GitHub repository settings
Navigate to the Actions tab in your repository
Select the “Training with Retry” workflow
Click “Run workflow” and configure your parameters:
- Adjust sequence length, LoRA rank, and training steps
- Configure GPU memory utilization and learning rate
- Specify your Hugging Face target repository
Run the workflow and wait for results
Access your fine-tuned model on Hugging Face Hub

Best Practices

Start with reasonable defaults: The default parameters have been tuned for good results on GSM8K
Adjust batch size for your GPU: Lower batch sizes if you encounter out-of-memory errors
Use checkpointing for longer runs: For extensive training sessions, use the retry-enabled workflow
Monitor training progress: Check workflow logs to observe training metrics
Test on mathematics problems: Evaluate the model specifically on math word problems to gauge improvement

Next Steps

Explore the full MachineDotDev/grpo-fine-tune repository
Learn about GPU runner specifications