GitHub Actions Syntax
This reference guide provides comprehensive information on the GitHub Actions syntax for integrating Machine.dev GPU runners into your workflows.
Basic Workflow Structure
A GitHub Actions workflow is defined in YAML format and must include the following elements to use Machine.dev:
name: My GPU Workflow
on: # Trigger events (push, pull_request, workflow_dispatch, etc.) workflow_dispatch:
jobs: gpu-job: name: GPU-Accelerated Job runs-on: - machine # Required to use Machine.dev - gpu=a10g # Specify GPU type - tenancy=spot # Specify tenancy type (spot or on_demand)
steps: # Job steps - name: Checkout code uses: actions/checkout@v3
# Add more steps...Machine.dev Runner Labels
Required Label
runs-on: - machine # Required for all Machine.dev runnersRunner Type Labels
CPU Runners
For high-performance CPU runners (must specify the number of vCPUs):
runs-on: - machine - cpu=16 # Required: specify number of vCPUs (2, 4, 8, 16, 32, 48, or 64)CPU Runner Specifications:
- Available configurations: 2, 4, 8, 16, 32, 48, or 64 vCPUs
- RAM scales with vCPUs (e.g., 16 vCPUs = 32GB RAM, 64 vCPUs = 128GB RAM)
- X64 or ARM64 architecture options
- Ideal for builds, testing, and data processing
GPU Type Labels
Specify the type of GPU you want to use:
runs-on: - machine - gpu=<gpu-type>Available GPU types:
| Label | GPU Model | VRAM | CUDA Cores | Architecture | Use Cases |
|---|---|---|---|---|---|
gpu=t4g | NVIDIA T4 (Graviton) | 16 GB | 2,560 | ARM64 | Entry-level ML, inference |
gpu=t4 | NVIDIA T4 | 16 GB | 2,560 | X64 | Training, inference, computer vision |
gpu=l4 | NVIDIA L4 | 24 GB | 7,680 | X64 | ML/DL training, inference, vision AI |
gpu=a10g | NVIDIA A10G | 24 GB | 9,216 | X64 | Model training, rendering, simulation |
gpu=l40s | NVIDIA L40S | 48 GB | 18,176 | X64 | Generative AI, large models, computer vision |
AWS AI Accelerators:
| Label | Accelerator | Memory | Architecture | Use Cases |
|---|---|---|---|---|
gpu=trainium | AWS Trainium | 32 GB | X64 | High-performance training |
gpu=inferentia2 | AWS Inferentia2 | 32 GB | X64 | Optimized inference |
Note: When using gpu=t4g, architecture is automatically set to arm64 regardless of any architecture label.
Tenancy Type Labels
Specify whether you want dedicated or spot instances:
runs-on: - machine - gpu=a10g - tenancy=<tenancy-type>Available tenancy types:
| Label | Description | Cost | Stability |
|---|---|---|---|
tenancy=on_demand | On-demand instances with guaranteed availability | Higher cost | Highest stability |
tenancy=spot | Spot instances that may be preempted | Up to 85% lower cost | May be interrupted |
Region Labels
Optionally specify one or more AWS regions where your runner should be provisioned. Use a comma-separated list of full region codes:
runs-on: - machine - gpu=a10g - regions=us-east-1,us-west-2Available regions:
| Region Code | Location |
|---|---|
us-east-1 | US East (N. Virginia) |
us-east-2 | US East (Ohio) |
us-west-2 | US West (Oregon) |
ca-central-1 | Canada (Central) |
eu-south-2 | Europe (Spain) |
ap-southeast-2 | Asia Pacific (Sydney) |
If no region is specified, Machine searches globally across all enabled regions to find the most cost-effective option.
Storage Labels
Configure the EBS gp3 root volume for your runner:
runs-on: - machine - gpu=a10g - disk_size=500 # Volume size in GB (default: 100, max: 16384) - disk_iops=10000 # IOPS (default: 6000, range: 6000-16000) - disk_throughput=750 # Throughput in MB/s (default: 250, range: 250-1000)| Label | Description | Default | Range |
|---|---|---|---|
disk_size=<GB> | Root volume size | 100 GB | 1 — 16,384 GB |
disk_iops=<IOPS> | Provisioned IOPS | 6,000 | 6,000 — 16,000 |
disk_throughput=<MB/s> | Provisioned throughput | 250 MB/s | 250 — 1,000 MB/s |
Note: Increasing IOPS and throughput above the defaults incurs additional EBS charges. See Pricing for details.
Metrics Labels
Control CloudWatch metrics collection for your runner:
runs-on: - machine - gpu=a10g - metrics=true # Enable/disable metrics (default: true) - metrics_interval=10 # Collection interval in seconds (default: 60, range: 1-60)| Label | Description | Default | Valid Values |
|---|---|---|---|
metrics=<bool> | Enable metrics collection | true | true, false |
metrics_interval=<seconds> | Collection interval | 60 | 1 — 60 |
When enabled, metrics are collected for CPU, memory, disk, network, and GPU utilization. Results appear as sparkline charts on the Machine dashboard after job completion.
Complete Examples
CPU Runner Example
name: CPU Build Example
on: push: branches: [main]
jobs: build: name: Build Application runs-on: - machine - cpu=16 # Required: specify number of vCPUs - tenancy=spot - regions=us-east-1,us-west-2
steps: - name: Checkout code uses: actions/checkout@v3
- name: Build project run: make -j$(nproc) allGPU Runner Example
name: GPU Training Example
on: workflow_dispatch: inputs: model_type: description: 'Type of model to train' required: true default: 'small' type: choice options: - small - medium - large
jobs: train-model: name: Train Machine Learning Model runs-on: - machine - gpu=a10g - tenancy=spot - regions=us-east-1
steps: - name: Checkout code uses: actions/checkout@v3
- name: Setup Python uses: actions/setup-python@v4 with: python-version: '3.10'
- name: Install dependencies run: | pip install -r requirements.txt
- name: Train model run: | python train.py --model-size=${{ github.event.inputs.model_type }}
- name: Upload model artifacts uses: actions/upload-artifact@v3 with: name: trained-model path: model/Matrix Strategy for Multiple GPUs
You can use matrix strategy to run your workflow on multiple GPU types:
jobs: gpu-matrix: strategy: matrix: gpu: [t4, l4, a10g] fail-fast: false
runs-on: - machine - gpu=${{ matrix.gpu }}
steps: - name: Run benchmark on ${{ matrix.gpu }} run: | python benchmark.py --gpu-type=${{ matrix.gpu }}Job Dependencies
You can coordinate between CPU and GPU jobs:
jobs: build: runs-on: - machine - cpu=8 # Build with 8 vCPUs steps: - name: Build application run: make build outputs: artifact_path: ${{ steps.build.outputs.path }}
train-model: needs: build runs-on: - machine - gpu=a10g # Train with GPU steps: - name: Train on prepared data run: | python train.py --artifact=${{ needs.build.outputs.artifact_path }}Environment Variables
Machine.dev provides several environment variables that can be accessed in your workflow:
| Variable | Description | Example Value |
|---|---|---|
MACHINE_GPU_TYPE | Type of GPU | a10g |
MACHINE_GPU_COUNT | Number of GPUs | 1 |
MACHINE_TENANCY | Tenancy type | spot |
MACHINE_REGION | Region code | us-east-1 |
CUDA_VISIBLE_DEVICES | CUDA devices available | 0 |
Example usage:
steps: - name: Check environment run: | echo "GPU Type: $MACHINE_GPU_TYPE" echo "GPU Count: $MACHINE_GPU_COUNT" echo "Tenancy: $MACHINE_TENANCY" echo "Region: $MACHINE_REGION"Error Handling for Spot Instances
When using spot instances, you should implement error handling to manage potential interruptions:
jobs: spot-job: runs-on: - machine - gpu=a10g - tenancy=spot
steps: # Save checkpoints frequently - name: Train with checkpoints run: | python train.py --checkpoint-freq=5
# Upload partial results - name: Upload checkpoints if: always() # Run even if job is cancelled uses: actions/upload-artifact@v3 with: name: model-checkpoints path: checkpoints/Best Practices
Set Appropriate Timeouts
- Set appropriate timeouts for your jobs:
jobs:gpu-job:timeout-minutes: 60 # Limit to 1 hour
Use Conditional Execution
- Use conditional execution to skip GPU tasks when appropriate:
steps:- name: GPU-intensive taskif: github.event_name == 'push' && github.ref == 'refs/heads/main'run: python heavy_task.py
Optimize Workflow Triggers
- Optimize workflow triggers to avoid unnecessary runs:
on:push:paths:- 'model/**'- 'data/**'branches:- main
Use Workflow Concurrency
- Use workflow concurrency to avoid redundant runs:
concurrency:group: ${{ github.workflow }}-${{ github.ref }}cancel-in-progress: true
Common Issues and Solutions
| Issue | Solution |
|---|---|
| Runner not available | Try a different GPU type or region |
| Out of memory error | Use a GPU with more VRAM or optimize batch size |
| Slow performance | Check data loading bottlenecks or use a faster GPU |
| Spot instance terminated | Use dedicated instances for critical workloads |
| Long provisioning time | Pre-warm runners with scheduled workflows |