GitHub Actions Workflow Setup
This guide provides step-by-step instructions for configuring GitHub Actions workflows to use Machine GPU runners. We’ll cover everything from basic setup to advanced configurations and troubleshooting.
Prerequisites
Before setting up your workflow, make sure you have:
- A GitHub repository where you want to run GPU-accelerated workflows
- You will need to have completed the Quick Start Guide.
Basic Workflow Setup
Step 1: Create a Workflow File
In your GitHub repository, create a new workflow file at .github/workflows/gpu-workflow.yml
:
name: GPU Workflowon: push: branches: [ main ] pull_request: branches: [ main ]jobs: gpu-job: name: GPU-Accelerated Job runs-on: - machine - gpu=T4 # Choose your GPU type steps: - name: Checkout code uses: actions/checkout@v4 - name: Run GPU task run: | # Your GPU-accelerated commands nvidia-smi # Additional commands...
Step 2: Customize GPU Selection
Choose the appropriate GPU type for your workload:
runs-on: - machine - gpu=T4G # OR - gpu=T4 # For general-purpose ML (16GB) # OR - gpu=L4 # For mid-range training/inference (24GB) # OR - gpu=A10G # For advanced training (24GB) # OR - gpu=L40S # For large model training (48GB)
Step 3: Commit and Run
Commit the workflow file to your repository. The workflow will run automatically when the specified events occur (e.g., push to main branch).
Advanced Configuration
Resource Customization
Tailor CPU, RAM, and other resources to match your workload requirements:
runs-on: - machine - gpu=A10G - cpu=16 # 16 CPU cores - ram=64 # 64GB RAM
Cost Optimization with Spot Instances
Use spot instances to reduce costs by up to 85%:
runs-on: - machine - gpu=L4 - tenancy=spot # Use spot instances for cost savings
Region Selection
Specify regions for improved availability or to meet compliance requirements:
runs-on: - machine - gpu=T4 - regions=us-east-1,us-east-2 # Run in either region
Working with Common ML Frameworks
PyTorch
Example workflow for PyTorch-based projects:
name: PyTorch Trainingjobs: train: runs-on: - machine - gpu=A10G steps: - uses: actions/checkout@v3 - name: Set up Python uses: actions/setup-python@v4 with: python-version: '3.10' - name: Install dependencies run: | python -m pip install --upgrade pip pip install torch torchvision pip install -r requirements.txt - name: Verify GPU access run: python -c "import torch; print('CUDA available:', torch.cuda.is_available())" - name: Train model run: python train.py
TensorFlow
Example workflow for TensorFlow-based projects:
name: TensorFlow Trainingjobs: train: runs-on: - machine - gpu=L4 steps: - uses: actions/checkout@v3 - name: Set up Python uses: actions/setup-python@v4 with: python-version: '3.10' - name: Install dependencies run: | python -m pip install --upgrade pip pip install tensorflow pip install -r requirements.txt - name: Verify GPU access run: python -c "import tensorflow as tf; print('GPUs:', tf.config.list_physical_devices('GPU'))" - name: Train model run: python train.py
Using Environment Variables and Secrets
Safely include API keys and configurations:
jobs: ml-job: runs-on: - machine - gpu=T4 env: # Public environment variables BATCH_SIZE: 32 EPOCHS: 10 steps: - uses: actions/checkout@v3 - name: Train with secrets env: API_KEY: ${{ secrets.API_KEY }} run: | python train.py --api-key "$API_KEY"
Parameterizing Workflows
Create reusable workflows with inputs:
name: Parameterized ML Trainingon: workflow_dispatch: inputs: model_size: description: 'Model size (small, medium, large)' required: true default: 'medium' epochs: description: 'Number of training epochs' required: true default: '10'jobs: train: runs-on: - machine - gpu=a10g steps: - uses: actions/checkout@v3 - name: Run training run: | python train.py \ --model-size ${{ github.event.inputs.model_size }} \ --epochs ${{ github.event.inputs.epochs }}
Scheduled Workflows
Run workflows on a schedule:
on: schedule: # Run daily at 2:00 AM UTC - cron: '0 2 * * *'
Error Handling and Debugging
Debugging GPU Issues
If you encounter GPU-related issues:
-
Check GPU availability:
- name: Check GPUrun: |nvidia-smipython -c "import torch; print('CUDA available:', torch.cuda.is_available())" -
Verify CUDA version compatibility:
- name: Check CUDA versionrun: |nvcc --versionpython -c "import torch; print('CUDA version:', torch.version.cuda)" -
Monitor GPU usage during training:
- name: Train with monitoringrun: |nvidia-smi daemon --query-gpu=utilization.gpu,memory.used --format=csv -l 5 &python train.py
Common Issues and Solutions
Issue | Possible Solution |
---|---|
Out of memory errors | Reduce batch size or use a larger GPU |
CUDA version mismatch | Specify compatible library versions in requirements.txt |
Spot instance interruptions | Implement checkpointing and retries |
Long startup times | Pre-build and cache Docker images |
Best Practices
-
Use checkpointing to save progress during long-running jobs
-
Cache dependencies to speed up workflow runs:
- name: Cache pip packagesuses: actions/cache@v3with:path: ~/.cache/pipkey: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }} -
Monitor GPU utilization to ensure efficient resource usage
-
Use spot instances for non-critical or fault-tolerant workloads
-
Implement timeouts to prevent runaway jobs:
jobs:train:timeout-minutes: 120 # Fail after 2 hours
Next Steps
- Explore our Example Repository for a ready-to-use workflow template
- Check our GPU runner specifications for detailed hardware information
- Learn about Cost Optimization to maximize your budget