Cost Optimization
This guide provides practical strategies to help you minimize costs while maximizing the effectiveness of your GPU-accelerated workflows on Machine. By implementing these optimization techniques, you can significantly reduce your spending without sacrificing performance.
Understanding Machine Pricing
Before diving into optimization strategies, it’s essential to understand how Machine pricing works:
- Credit-based system: You pay for GPU time using credits
- Usage-based billing: You only pay for the exact time your workflows run
- GPU-specific rates: Different GPU types have different credit consumption rates
- Resource-based pricing: Additional CPU cores and RAM affect pricing
Key Cost Optimization Strategies
1. Use Spot Instances with Intelligent Retries
Spot instances can save you up to 85% compared to on-demand instances. When combined with intelligent retry mechanisms, they offer the perfect balance of cost and reliability:
runs-on: - machine - gpu=a10g - tenancy=spot
Best for:
- Non-critical workloads
- Jobs that can be retried if interrupted
- Development and testing workflows
Implementation tips:
- Implement checkpointing to save progress regularly
- Set up automatic retry mechanisms for spot instance interruptions
- Use the intelligent retry patterns from our example workflows
Implementing Intelligent Retries
Our LLM Supervised Fine-Tuning and GRPO Fine-Tuning workflows demonstrate how to implement robust retry mechanisms:
name: Workflow with Retry
on: workflow_dispatch: inputs: attempt: type: string description: 'The attempt number' default: '1' max_attempts: type: number description: 'The maximum number of attempts' default: 5 # Other workflow parameters
The intelligent retry mechanism works through these steps:
- The workflow starts with a specified attempt number (default: 1)
- During execution, checkpoints are periodically saved to Hugging Face Hub or another storage location
- If the job completes successfully, the workflow ends
- If the job fails due to a spot instance interruption:
- A custom GitHub Action detects the failure was due to spot instance preemption
- The workflow calculates the next attempt number
- If within the maximum attempts limit, it triggers a new workflow run with an incremented attempt number
- All original parameters are preserved for the new attempt
- When a new attempt starts, it downloads the latest checkpoint and resumes from that point
This ensures that even if a spot instance is reclaimed, your progress isn’t lost, and the job can continue from the last checkpoint on a new instance.
2. Implement Checkpointing to Hugging Face
Save your progress regularly to avoid losing work due to spot instance interruptions:
# Example checkpoint saving codedef save_checkpoint(model, optimizer, epoch, step, hf_repo_id): # Save model state checkpoint = { 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'epoch': epoch, 'step': step }
# Save to disk first torch.save(checkpoint, 'checkpoint.pt')
# Push to Hugging Face Hub api = HfApi() api.upload_file( path_or_fileobj="checkpoint.pt", path_in_repo="checkpoint.pt", repo_id=hf_repo_id, repo_type="model" )
print(f"Checkpoint saved at epoch {epoch}, step {step}")
To resume from a checkpoint:
# Example checkpoint loading codedef load_checkpoint(model, optimizer, hf_repo_id): try: # Download from Hugging Face Hub api = HfApi() api.download_file( repo_id=hf_repo_id, filename="checkpoint.pt", local_path="checkpoint.pt", repo_type="model" )
# Load checkpoint checkpoint = torch.load('checkpoint.pt') model.load_state_dict(checkpoint['model_state_dict']) optimizer.load_state_dict(checkpoint['optimizer_state_dict']) epoch = checkpoint['epoch'] step = checkpoint['step']
print(f"Resumed from epoch {epoch}, step {step}") return epoch, step except: print("No checkpoint found, starting from scratch") return 0, 0
3. Right-size Your GPU Resources
Choose the smallest GPU that meets your needs:
Workload | Recommended GPU | Why |
---|---|---|
Testing, small models | T4G/T4 (16GB) | Lowest cost per hour |
Medium models | L4 (24GB) | Good balance of memory/performance |
Large models | A10G (24GB) | More memory and compute |
Very large models | L40S (48GB) | Maximum memory capacity |
Implementation:
# Instead of always using the largest GPU:runs-on: - machine - gpu=t4 # Choose the right-sized GPU for your task
4. Optimize Job Duration
The less time your job runs, the less you pay:
Use mixed precision training:
# In PyTorch:from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()with autocast(): outputs = model(inputs) loss = loss_fn(outputs, targets)scaler.scale(loss).backward()scaler.step(optimizer)scaler.update()
Implement efficient data loading:
# Optimize PyTorch DataLoader:dataloader = DataLoader( dataset, batch_size=32, num_workers=4, # Adjust based on CPU cores pin_memory=True, prefetch_factor=2)
Use efficient model architectures:
- Consider more efficient model architectures (e.g., MobileNet vs. ResNet)
- Use pruning or quantization where possible
- Consider LoRA or other parameter-efficient fine-tuning methods
5. Optimize Resource Allocation
Specify only the resources you actually need:
runs-on: - machine - gpu=l4 - cpu=4 # Only request what you need - ram=16 # Only request what you need
Monitoring tip: Run a test job with GPU monitoring to determine actual resource usage:
steps: - name: Monitor resource usage run: | nvidia-smi dmon -s pucvmet -d 5 > gpu_metrics.log & NVIDIA_PID=$! vmstat 5 > cpu_metrics.log & VMSTAT_PID=$!
# Run your workload python train.py
# Stop monitoring kill $NVIDIA_PID $VMSTAT_PID
# Upload metrics cat gpu_metrics.log cpu_metrics.log
6. Use Regional Selection Effectively
Different regions have different pricing:
runs-on: - machine - gpu=t4 - regions=us-east-1,us-west-2 # Regions with best pricing
Tips:
- Include multiple regions to ensure availability
- Consider data sovereignty requirements when selecting regions
7. Implement Smart Caching
Reduce computation by caching dependencies and intermediate results:
steps: - name: Cache dependencies uses: actions/cache@v3 with: path: ~/.cache/pip key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
- name: Cache preprocessed data uses: actions/cache@v3 with: path: ./data/processed key: preprocessed-data-v1
8. Use Workflow Conditions and Filters
Only run GPU jobs when necessary:
name: ML Pipeline
on: push: branches: [ main ] paths: - 'model/**' - 'data/**'
jobs: train: # Only runs when model or data files change runs-on: - machine - gpu=l4
Or use conditional execution:
jobs: validate: runs-on: ubuntu-latest outputs: should_train: ${{ steps.check.outputs.should_train }}
steps: - id: check run: | # Logic to determine if training is needed echo "should_train=true" >> $GITHUB_OUTPUT
train: needs: validate if: ${{ needs.validate.outputs.should_train == 'true' }} runs-on: - machine - gpu=a10g
Monitoring and Analyzing Costs
Using the Machine Dashboard
The Machine dashboard provides detailed insights into your GPU usage and costs:
- Job tracking: The dashboard shows all your previously run jobs, currently running jobs, and queued jobs in one place
- Cost visibility: For completed jobs, you can see the exact runtime and cost in credits used
- Usage aggregation: View daily aggregates for all on-demand and spot credits consumed within a specified date range
- Resource utilization: See GPU, CPU, and memory allocation for each job
This information helps you identify optimization opportunities and track spending patterns over time.
Real-World Cost Optimization Examples
Example 1: LLM Fine-tuning with Retry Mechanism
The LLM Supervised Fine-Tuning workflow demonstrates effective cost optimization:
name: Supervised Fine-Tuning with Retry
on: workflow_dispatch: inputs: attempt: type: string description: 'The attempt number' default: '1' # Other parameters...
jobs: train: name: Training runs-on: - machine - gpu=T4 - cpu=4 - ram=16 - tenancy=spot # Cost savings with spot instances
steps: # Checkpoint handling steps - name: Download previous checkpoint if available run: | if [[ "${{ inputs.attempt }}" -gt "1" ]]; then python download_checkpoint.py --repo "${{ inputs.hf_repo }}" fi
# Training with checkpointing - name: Run training run: | python train.py \ --checkpoint-every 100 \ --save-to-hf
Key cost optimization techniques:
- Using spot instances (~85% cost reduction)
- Implementing automatic checkpointing and retry mechanisms
- Right-sizing resources (T4 GPU, 4 CPU cores)
- Using LoRA for parameter-efficient fine-tuning
Example 2: GRPO Fine-Tuning with Spot Instance Resilience
The GRPO Fine-Tuning workflow shows how to implement resilient training on spot instances:
jobs: train: name: Training runs-on: - machine - gpu=L40S # Needed for larger models - tenancy=spot
steps: # Setup steps...
# Checkpoint handling - name: Check for existing checkpoints id: check-checkpoint run: | python check_checkpoints.py \ --hf-repo "${{ inputs.hf_repo }}" \ --set-output
# Training with progressive saving - name: Training run: | python train.py \ --checkpoint-dir ./checkpoints \ --save-steps 100 \ --push-to-hub
This approach combines:
- Spot instance cost savings
- Automatic checkpoint detection and resumption
- Periodic saving to Hugging Face Hub
- Intelligent retries for interrupted jobs
Best Practices Summary
- Always use spot instances with intelligent retries for non-critical workloads
- Implement regular checkpointing to Hugging Face Hub to handle spot instance interruptions
- Right-size your GPU, CPU, and RAM for each specific task
- Use the Machine dashboard to monitor job costs and resource utilization
- Use mixed precision training where possible
- Cache dependencies and datasets to reduce job time
Next Steps
- Learn about GPU runner specifications to choose the right hardware
- Check out our Workflow Setup guide for detailed configuration instructions