GitHub Actions Workflow Setup

This guide provides step-by-step instructions for configuring GitHub Actions workflows to use Machine GPU runners. We’ll cover everything from basic setup to advanced configurations and troubleshooting.

Prerequisites

Before setting up your workflow, make sure you have:

A GitHub repository where you want to run GPU-accelerated workflows
You will need to have completed the Quick Start Guide.

Basic Workflow Setup

Step 1: Create a Workflow File

In your GitHub repository, create a new workflow file at .github/workflows/gpu-workflow.yml:

name: GPU Workflow
on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]
jobs:
  gpu-job:
    name: GPU-Accelerated Job
    runs-on:
      - machine
      - gpu=T4  # Choose your GPU type
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Run GPU task
        run: |
          # Your GPU-accelerated commands
          nvidia-smi
          # Additional commands...

Step 2: Customize GPU Selection

Choose the appropriate GPU type for your workload:

runs-on:
  - machine
  - gpu=T4G
  # OR
  - gpu=T4  # For general-purpose ML (16GB)
  # OR
  - gpu=L4  # For mid-range training/inference (24GB)
  # OR
  - gpu=A10G  # For advanced training (24GB)
  # OR
  - gpu=L40S  # For large model training (48GB)

Step 3: Commit and Run

Commit the workflow file to your repository. The workflow will run automatically when the specified events occur (e.g., push to main branch).

Advanced Configuration

Resource Customization

Tailor CPU, RAM, and other resources to match your workload requirements:

runs-on:
  - machine
  - gpu=A10G
  - cpu=16    # 16 CPU cores
  - ram=64    # 64GB RAM

Cost Optimization with Spot Instances

Use spot instances to reduce costs by up to 85%:

runs-on:
  - machine
  - gpu=L4
  - tenancy=spot  # Use spot instances for cost savings

Region Selection

Specify regions for improved availability or to meet compliance requirements:

runs-on:
  - machine
  - gpu=T4
  - regions=us-east-1,us-east-2 # Run in either region

Working with Common ML Frameworks

PyTorch

Example workflow for PyTorch-based projects:

name: PyTorch Training
jobs:
  train:
    runs-on:
      - machine
      - gpu=A10G
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install torch torchvision
          pip install -r requirements.txt
      - name: Verify GPU access
        run: python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
      - name: Train model
        run: python train.py

TensorFlow

Example workflow for TensorFlow-based projects:

name: TensorFlow Training
jobs:
  train:
    runs-on:
      - machine
      - gpu=L4
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install tensorflow
          pip install -r requirements.txt
      - name: Verify GPU access
        run: python -c "import tensorflow as tf; print('GPUs:', tf.config.list_physical_devices('GPU'))"
      - name: Train model
        run: python train.py

Using Environment Variables and Secrets

Safely include API keys and configurations:

jobs:
  ml-job:
    runs-on:
      - machine
      - gpu=T4
    env:
      # Public environment variables
      BATCH_SIZE: 32
      EPOCHS: 10
    steps:
      - uses: actions/checkout@v3
      - name: Train with secrets
        env:
          API_KEY: ${{ secrets.API_KEY }}
        run: |
          python train.py --api-key "$API_KEY"

Parameterizing Workflows

Create reusable workflows with inputs:

name: Parameterized ML Training
on:
  workflow_dispatch:
    inputs:
      model_size:
        description: 'Model size (small, medium, large)'
        required: true
        default: 'medium'
      epochs:
        description: 'Number of training epochs'
        required: true
        default: '10'
jobs:
  train:
    runs-on:
      - machine
      - gpu=a10g
    steps:
      - uses: actions/checkout@v3
      - name: Run training
        run: |
          python train.py \
            --model-size ${{ github.event.inputs.model_size }} \
            --epochs ${{ github.event.inputs.epochs }}

Scheduled Workflows

Run workflows on a schedule:

on:
  schedule:
    # Run daily at 2:00 AM UTC
    - cron: '0 2 * * *'

Error Handling and Debugging

Debugging GPU Issues

If you encounter GPU-related issues:

Check GPU availability:

- name: Check GPU
  run: |
    nvidia-smi
    python -c "import torch; print('CUDA available:', torch.cuda.is_available())"

Verify CUDA version compatibility:

- name: Check CUDA version
  run: |
    nvcc --version
    python -c "import torch; print('CUDA version:', torch.version.cuda)"

Monitor GPU usage during training:

- name: Train with monitoring
  run: |
    nvidia-smi daemon --query-gpu=utilization.gpu,memory.used --format=csv -l 5 &
    python train.py

Common Issues and Solutions

Issue	Possible Solution
Out of memory errors	Reduce batch size or use a larger GPU
CUDA version mismatch	Specify compatible library versions in requirements.txt
Spot instance interruptions	Implement checkpointing and retries
Long startup times	Pre-build and cache Docker images

Best Practices

Use checkpointing to save progress during long-running jobs

Cache dependencies to speed up workflow runs:

- name: Cache pip packages
  uses: actions/cache@v3
  with:
    path: ~/.cache/pip
    key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}

Monitor GPU utilization to ensure efficient resource usage
Use spot instances for non-critical or fault-tolerant workloads

Implement timeouts to prevent runaway jobs:

jobs:
  train:
    timeout-minutes: 120  # Fail after 2 hours

Next Steps

Explore our Example Repository for a ready-to-use workflow template
Check our GPU runner specifications for detailed hardware information
Learn about Cost Optimization to maximize your budget