Skip to content

Cost Optimization

This guide provides practical strategies to help you minimize costs while maximizing the effectiveness of your GPU-accelerated workflows on Machine. By implementing these optimization techniques, you can significantly reduce your spending without sacrificing performance.

Understanding Machine Pricing

Before diving into optimization strategies, it’s essential to understand how Machine pricing works:

  1. Credit-based system: You pay for GPU time using credits
  2. Usage-based billing: You only pay for the exact time your workflows run
  3. GPU-specific rates: Different GPU types have different credit consumption rates
  4. Resource-based pricing: Additional CPU cores and RAM affect pricing

Key Cost Optimization Strategies

1. Use Spot Instances with Intelligent Retries

Spot instances can save you up to 85% compared to on-demand instances. When combined with intelligent retry mechanisms, they offer the perfect balance of cost and reliability:

runs-on:
- machine
- gpu=a10g
- tenancy=spot

Best for:

  • Non-critical workloads
  • Jobs that can be retried if interrupted
  • Development and testing workflows

Implementation tips:

  • Implement checkpointing to save progress regularly
  • Set up automatic retry mechanisms for spot instance interruptions
  • Use the intelligent retry patterns from our example workflows

Implementing Intelligent Retries

Our LLM Supervised Fine-Tuning and GRPO Fine-Tuning workflows demonstrate how to implement robust retry mechanisms:

name: Workflow with Retry
on:
workflow_dispatch:
inputs:
attempt:
type: string
description: 'The attempt number'
default: '1'
max_attempts:
type: number
description: 'The maximum number of attempts'
default: 5
# Other workflow parameters

The intelligent retry mechanism works through these steps:

  1. The workflow starts with a specified attempt number (default: 1)
  2. During execution, checkpoints are periodically saved to Hugging Face Hub or another storage location
  3. If the job completes successfully, the workflow ends
  4. If the job fails due to a spot instance interruption:
    • A custom GitHub Action detects the failure was due to spot instance preemption
    • The workflow calculates the next attempt number
    • If within the maximum attempts limit, it triggers a new workflow run with an incremented attempt number
    • All original parameters are preserved for the new attempt
  5. When a new attempt starts, it downloads the latest checkpoint and resumes from that point

This ensures that even if a spot instance is reclaimed, your progress isn’t lost, and the job can continue from the last checkpoint on a new instance.

2. Implement Checkpointing to Hugging Face

Save your progress regularly to avoid losing work due to spot instance interruptions:

# Example checkpoint saving code
def save_checkpoint(model, optimizer, epoch, step, hf_repo_id):
# Save model state
checkpoint = {
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'epoch': epoch,
'step': step
}
# Save to disk first
torch.save(checkpoint, 'checkpoint.pt')
# Push to Hugging Face Hub
api = HfApi()
api.upload_file(
path_or_fileobj="checkpoint.pt",
path_in_repo="checkpoint.pt",
repo_id=hf_repo_id,
repo_type="model"
)
print(f"Checkpoint saved at epoch {epoch}, step {step}")

To resume from a checkpoint:

# Example checkpoint loading code
def load_checkpoint(model, optimizer, hf_repo_id):
try:
# Download from Hugging Face Hub
api = HfApi()
api.download_file(
repo_id=hf_repo_id,
filename="checkpoint.pt",
local_path="checkpoint.pt",
repo_type="model"
)
# Load checkpoint
checkpoint = torch.load('checkpoint.pt')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
step = checkpoint['step']
print(f"Resumed from epoch {epoch}, step {step}")
return epoch, step
except:
print("No checkpoint found, starting from scratch")
return 0, 0

3. Right-size Your GPU Resources

Choose the smallest GPU that meets your needs:

WorkloadRecommended GPUWhy
Testing, small modelsT4G/T4 (16GB)Lowest cost per hour
Medium modelsL4 (24GB)Good balance of memory/performance
Large modelsA10G (24GB)More memory and compute
Very large modelsL40S (48GB)Maximum memory capacity

Implementation:

# Instead of always using the largest GPU:
runs-on:
- machine
- gpu=t4 # Choose the right-sized GPU for your task

4. Optimize Job Duration

The less time your job runs, the less you pay:

Use mixed precision training:

# In PyTorch:
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast():
outputs = model(inputs)
loss = loss_fn(outputs, targets)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Implement efficient data loading:

# Optimize PyTorch DataLoader:
dataloader = DataLoader(
dataset,
batch_size=32,
num_workers=4, # Adjust based on CPU cores
pin_memory=True,
prefetch_factor=2
)

Use efficient model architectures:

  • Consider more efficient model architectures (e.g., MobileNet vs. ResNet)
  • Use pruning or quantization where possible
  • Consider LoRA or other parameter-efficient fine-tuning methods

5. Optimize Resource Allocation

Specify only the resources you actually need:

runs-on:
- machine
- gpu=l4
- cpu=4 # Only request what you need
- ram=16 # Only request what you need

Monitoring tip: Run a test job with GPU monitoring to determine actual resource usage:

steps:
- name: Monitor resource usage
run: |
nvidia-smi dmon -s pucvmet -d 5 > gpu_metrics.log &
NVIDIA_PID=$!
vmstat 5 > cpu_metrics.log &
VMSTAT_PID=$!
# Run your workload
python train.py
# Stop monitoring
kill $NVIDIA_PID $VMSTAT_PID
# Upload metrics
cat gpu_metrics.log cpu_metrics.log

6. Use Regional Selection Effectively

Different regions have different pricing:

runs-on:
- machine
- gpu=t4
- regions=us-east-1,us-west-2 # Regions with best pricing

Tips:

  • Include multiple regions to ensure availability
  • Consider data sovereignty requirements when selecting regions

7. Implement Smart Caching

Reduce computation by caching dependencies and intermediate results:

steps:
- name: Cache dependencies
uses: actions/cache@v3
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
- name: Cache preprocessed data
uses: actions/cache@v3
with:
path: ./data/processed
key: preprocessed-data-v1

8. Use Workflow Conditions and Filters

Only run GPU jobs when necessary:

name: ML Pipeline
on:
push:
branches: [ main ]
paths:
- 'model/**'
- 'data/**'
jobs:
train:
# Only runs when model or data files change
runs-on:
- machine
- gpu=l4

Or use conditional execution:

jobs:
validate:
runs-on: ubuntu-latest
outputs:
should_train: ${{ steps.check.outputs.should_train }}
steps:
- id: check
run: |
# Logic to determine if training is needed
echo "should_train=true" >> $GITHUB_OUTPUT
train:
needs: validate
if: ${{ needs.validate.outputs.should_train == 'true' }}
runs-on:
- machine
- gpu=a10g

Monitoring and Analyzing Costs

Using the Machine Dashboard

The Machine dashboard provides detailed insights into your GPU usage and costs:

  1. Job tracking: The dashboard shows all your previously run jobs, currently running jobs, and queued jobs in one place
  2. Cost visibility: For completed jobs, you can see the exact runtime and cost in credits used
  3. Usage aggregation: View daily aggregates for all on-demand and spot credits consumed within a specified date range
  4. Resource utilization: See GPU, CPU, and memory allocation for each job

This information helps you identify optimization opportunities and track spending patterns over time.

Real-World Cost Optimization Examples

Example 1: LLM Fine-tuning with Retry Mechanism

The LLM Supervised Fine-Tuning workflow demonstrates effective cost optimization:

name: Supervised Fine-Tuning with Retry
on:
workflow_dispatch:
inputs:
attempt:
type: string
description: 'The attempt number'
default: '1'
# Other parameters...
jobs:
train:
name: Training
runs-on:
- machine
- gpu=T4
- cpu=4
- ram=16
- tenancy=spot # Cost savings with spot instances
steps:
# Checkpoint handling steps
- name: Download previous checkpoint if available
run: |
if [[ "${{ inputs.attempt }}" -gt "1" ]]; then
python download_checkpoint.py --repo "${{ inputs.hf_repo }}"
fi
# Training with checkpointing
- name: Run training
run: |
python train.py \
--checkpoint-every 100 \
--save-to-hf

Key cost optimization techniques:

  • Using spot instances (~85% cost reduction)
  • Implementing automatic checkpointing and retry mechanisms
  • Right-sizing resources (T4 GPU, 4 CPU cores)
  • Using LoRA for parameter-efficient fine-tuning

Example 2: GRPO Fine-Tuning with Spot Instance Resilience

The GRPO Fine-Tuning workflow shows how to implement resilient training on spot instances:

jobs:
train:
name: Training
runs-on:
- machine
- gpu=L40S # Needed for larger models
- tenancy=spot
steps:
# Setup steps...
# Checkpoint handling
- name: Check for existing checkpoints
id: check-checkpoint
run: |
python check_checkpoints.py \
--hf-repo "${{ inputs.hf_repo }}" \
--set-output
# Training with progressive saving
- name: Training
run: |
python train.py \
--checkpoint-dir ./checkpoints \
--save-steps 100 \
--push-to-hub

This approach combines:

  • Spot instance cost savings
  • Automatic checkpoint detection and resumption
  • Periodic saving to Hugging Face Hub
  • Intelligent retries for interrupted jobs

Best Practices Summary

  1. Always use spot instances with intelligent retries for non-critical workloads
  2. Implement regular checkpointing to Hugging Face Hub to handle spot instance interruptions
  3. Right-size your GPU, CPU, and RAM for each specific task
  4. Use the Machine dashboard to monitor job costs and resource utilization
  5. Use mixed precision training where possible
  6. Cache dependencies and datasets to reduce job time

Next Steps