Skip to content

Parallel Hyperparameter Tuning

The Parallel Hyperparameter Tuning workflow allows you to systematically explore combinations of key training parameters to identify the optimal configuration for your machine learning models. This implementation leverages Machine GPU runners to run multiple training iterations concurrently, significantly reducing the time needed to find the best model configuration.

Prerequisites

You will need to have completed the Quick Start guide.

Use Case Overview

Why might you want to use parallel hyperparameter tuning?

  • Find optimal model configurations more efficiently by testing multiple parameter sets simultaneously
  • Reduce the total time needed for hyperparameter search
  • Systematically compare model performance across different configurations
  • Automate the process of identifying the best-performing models

How It Works

The Parallel Hyperparameter Tuning workflow uses GitHub Actions’ matrix strategy to run multiple training jobs concurrently. Each job trains a ResNet model on the CIFAR-10 dataset with a different combination of hyperparameters. The workflow is defined in GitHub Actions and can be triggered on-demand.

The tuning process:

  1. Defines a matrix of hyperparameter combinations to explore
  2. Launches multiple GPU-powered training jobs concurrently, one for each combination
  3. Saves performance metrics from each training run as artifacts
  4. Aggregates and compares results across all runs
  5. Generates a comprehensive comparison report

Workflow Implementation

The Parallel Hyperparameter Tuning is implemented as a GitHub Actions workflow that runs multiple jobs in parallel. Here’s the workflow definition:

name: ResNet Hyperparameter Tuning
on:
workflow_dispatch:
jobs:
hyperparameter_tuning:
name: Hyperparameter Tuning
runs-on:
- machine
- gpu=T4
- cpu=4
- ram=16
- architecture=x64
timeout-minutes: 30
strategy:
fail-fast: false
matrix:
learning_rate: [0.001, 0.0005]
batch_size: [32, 64]
steps:
- uses: actions/checkout@v3
- name: Set up Python 3.10
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install uv
uses: astral-sh/setup-uv@v5
- name: Install dependencies
run: |
uv venv .venv --python=3.10
source .venv/bin/activate
uv pip install -r requirements.txt
deactivate
- name: Train and Evaluate ResNet
env:
LEARNING_RATE: ${{ matrix.learning_rate }}
BATCH_SIZE: ${{ matrix.batch_size }}
run: |
source .venv/bin/activate
python train.py
deactivate
- name: Upload metrics artifact
uses: actions/upload-artifact@v4
with:
name: metrics-${{ matrix.learning_rate }}-${{ matrix.batch_size }}
path: metrics_*.json
compare_tuning:
needs: hyperparameter_tuning
name: Compare Tuning Performance
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python 3.10
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install uv
uses: astral-sh/setup-uv@v5
- name: Install dependencies
run: |
uv venv .venv --python=3.10
source .venv/bin/activate
uv pip install -r requirements.txt
deactivate
- name: Download all metrics
uses: actions/download-artifact@v4
with:
path: metrics
- name: Compare Metrics
run: |
source .venv/bin/activate
python compare_metrics.py
deactivate
- name: Upload comparison results
uses: actions/upload-artifact@v4
with:
name: comparison-results
path: model_comparison.csv

Key Features

The power of this implementation comes from several key features:

  1. Matrix Strategy: The workflow defines a matrix of hyperparameters, automatically creating separate jobs for each combination. In this example, we’re exploring two learning rates (0.001, 0.0005) and two batch sizes (32, 64), resulting in 4 concurrent training jobs.

  2. Parallel Execution: Each hyperparameter combination runs as a separate job on its own GPU runner, allowing multiple experiments to run simultaneously rather than sequentially.

  3. Metrics Collection: Each training job produces performance metrics that are saved as artifacts with names that indicate the hyperparameter values used.

  4. Automated Comparison: After all training jobs complete, a separate job downloads all metrics and generates a comparison report, making it easy to identify the best configuration.

Using Machine GPU Runners

This hyperparameter tuning process leverages Machine GPU runners to provide the necessary computing power for efficient model training. The workflow is configured to use:

  • T4 GPU: An entry-level ML GPU with 16GB VRAM, well-suited for training moderate-sized models
  • Configurable resources: CPU, RAM, and architecture specifications optimized for each training job

The parallel nature of this approach means that you can complete a hyperparameter search in a fraction of the time it would take to run sequentially, even when using the same hardware resources per job.

Best Practices

  • Choose parameters wisely: Select hyperparameters that have the most impact on model performance
  • Start with a broad search: Begin with a wide range of values, then refine with narrower ranges around promising values
  • Consider resource allocation: Adjust CPU/RAM requirements based on your specific model and dataset needs
  • Set appropriate timeouts: Ensure your workflow timeout is sufficient for all jobs to complete
  • Use fail-fast: false: This ensures all combinations are evaluated even if some fail, giving you a complete picture

Getting Started

To run the Parallel Hyperparameter Tuning workflow:

  1. Use the MachineHQ/parallel-hyperparameter-tuning repository as a template
  2. Navigate to the Actions tab in your repository
  3. Select the “ResNet Hyperparameter Tuning” workflow
  4. Click “Run workflow” to start the tuning process
  5. Wait for all jobs to complete
  6. Download the comparison-results artifact to identify the best hyperparameter configuration

Customizing the Workflow

You can easily adapt this workflow for your own models and hyperparameters:

  1. Modify the matrix in the workflow file to include your specific hyperparameters
  2. Update the training script (train.py) to work with your model and dataset
  3. Adjust the metrics collection to capture the performance indicators most relevant to your task
  4. Customize the comparison script (compare_metrics.py) to generate insights tailored to your needs

Next Steps