Skip to content

GitHub Actions Workflow Setup

This guide provides step-by-step instructions for configuring GitHub Actions workflows to use Machine GPU runners. We’ll cover everything from basic setup to advanced configurations and troubleshooting.

Prerequisites

Before setting up your workflow, make sure you have:

  1. A GitHub repository where you want to run GPU-accelerated workflows
  2. You will need to have completed the Quick Start Guide.

Basic Workflow Setup

Step 1: Create a Workflow File

In your GitHub repository, create a new workflow file at .github/workflows/gpu-workflow.yml:

name: GPU Workflow
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
jobs:
gpu-job:
name: GPU-Accelerated Job
runs-on:
- machine
- gpu=T4 # Choose your GPU type
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Run GPU task
run: |
# Your GPU-accelerated commands
nvidia-smi
# Additional commands...

Step 2: Customize GPU Selection

Choose the appropriate GPU type for your workload:

runs-on:
- machine
- gpu=T4G
# OR
- gpu=T4 # For general-purpose ML (16GB)
# OR
- gpu=L4 # For mid-range training/inference (24GB)
# OR
- gpu=A10G # For advanced training (24GB)
# OR
- gpu=L40S # For large model training (48GB)

Step 3: Commit and Run

Commit the workflow file to your repository. The workflow will run automatically when the specified events occur (e.g., push to main branch).

Advanced Configuration

Resource Customization

Tailor CPU, RAM, and other resources to match your workload requirements:

runs-on:
- machine
- gpu=A10G
- cpu=16 # 16 CPU cores
- ram=64 # 64GB RAM

Cost Optimization with Spot Instances

Use spot instances to reduce costs by up to 85%:

runs-on:
- machine
- gpu=L4
- tenancy=spot # Use spot instances for cost savings

Region Selection

Specify regions for improved availability or to meet compliance requirements:

runs-on:
- machine
- gpu=T4
- regions=us-east-1,us-east-2 # Run in either region

Working with Common ML Frameworks

PyTorch

Example workflow for PyTorch-based projects:

name: PyTorch Training
jobs:
train:
runs-on:
- machine
- gpu=A10G
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install torch torchvision
pip install -r requirements.txt
- name: Verify GPU access
run: python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
- name: Train model
run: python train.py

TensorFlow

Example workflow for TensorFlow-based projects:

name: TensorFlow Training
jobs:
train:
runs-on:
- machine
- gpu=L4
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install tensorflow
pip install -r requirements.txt
- name: Verify GPU access
run: python -c "import tensorflow as tf; print('GPUs:', tf.config.list_physical_devices('GPU'))"
- name: Train model
run: python train.py

Using Environment Variables and Secrets

Safely include API keys and configurations:

jobs:
ml-job:
runs-on:
- machine
- gpu=T4
env:
# Public environment variables
BATCH_SIZE: 32
EPOCHS: 10
steps:
- uses: actions/checkout@v3
- name: Train with secrets
env:
API_KEY: ${{ secrets.API_KEY }}
run: |
python train.py --api-key "$API_KEY"

Parameterizing Workflows

Create reusable workflows with inputs:

name: Parameterized ML Training
on:
workflow_dispatch:
inputs:
model_size:
description: 'Model size (small, medium, large)'
required: true
default: 'medium'
epochs:
description: 'Number of training epochs'
required: true
default: '10'
jobs:
train:
runs-on:
- machine
- gpu=a10g
steps:
- uses: actions/checkout@v3
- name: Run training
run: |
python train.py \
--model-size ${{ github.event.inputs.model_size }} \
--epochs ${{ github.event.inputs.epochs }}

Scheduled Workflows

Run workflows on a schedule:

on:
schedule:
# Run daily at 2:00 AM UTC
- cron: '0 2 * * *'

Error Handling and Debugging

Debugging GPU Issues

If you encounter GPU-related issues:

  1. Check GPU availability:

    - name: Check GPU
    run: |
    nvidia-smi
    python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
  2. Verify CUDA version compatibility:

    - name: Check CUDA version
    run: |
    nvcc --version
    python -c "import torch; print('CUDA version:', torch.version.cuda)"
  3. Monitor GPU usage during training:

    - name: Train with monitoring
    run: |
    nvidia-smi daemon --query-gpu=utilization.gpu,memory.used --format=csv -l 5 &
    python train.py

Common Issues and Solutions

IssuePossible Solution
Out of memory errorsReduce batch size or use a larger GPU
CUDA version mismatchSpecify compatible library versions in requirements.txt
Spot instance interruptionsImplement checkpointing and retries
Long startup timesPre-build and cache Docker images

Best Practices

  1. Use checkpointing to save progress during long-running jobs

  2. Cache dependencies to speed up workflow runs:

    - name: Cache pip packages
    uses: actions/cache@v3
    with:
    path: ~/.cache/pip
    key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
  3. Monitor GPU utilization to ensure efficient resource usage

  4. Use spot instances for non-critical or fault-tolerant workloads

  5. Implement timeouts to prevent runaway jobs:

    jobs:
    train:
    timeout-minutes: 120 # Fail after 2 hours

Next Steps