Parallel Hyperparameter Tuning
The Parallel Hyperparameter Tuning workflow allows you to systematically explore combinations of key training parameters to identify the optimal configuration for your machine learning models. This implementation leverages Machine GPU runners to run multiple training iterations concurrently, significantly reducing the time needed to find the best model configuration.
Prerequisites
You will need to have completed the Quick Start guide.
Use Case Overview
Why might you want to use parallel hyperparameter tuning?
- Find optimal model configurations more efficiently by testing multiple parameter sets simultaneously
- Reduce the total time needed for hyperparameter search
- Systematically compare model performance across different configurations
- Automate the process of identifying the best-performing models
How It Works
The Parallel Hyperparameter Tuning workflow uses GitHub Actions’ matrix strategy to run multiple training jobs concurrently. Each job trains a ResNet model on the CIFAR-10 dataset with a different combination of hyperparameters. The workflow is defined in GitHub Actions and can be triggered on-demand.
The tuning process:
- Defines a matrix of hyperparameter combinations to explore
- Launches multiple GPU-powered training jobs concurrently, one for each combination
- Saves performance metrics from each training run as artifacts
- Aggregates and compares results across all runs
- Generates a comprehensive comparison report
Workflow Implementation
The Parallel Hyperparameter Tuning is implemented as a GitHub Actions workflow that runs multiple jobs in parallel. Here’s the workflow definition:
name: ResNet Hyperparameter Tuning
on: workflow_dispatch:
jobs: hyperparameter_tuning: name: Hyperparameter Tuning runs-on: - machine - gpu=T4 - cpu=4 - ram=16 - architecture=x64 timeout-minutes: 30 strategy: fail-fast: false matrix: learning_rate: [0.001, 0.0005] batch_size: [32, 64] steps: - uses: actions/checkout@v3
- name: Set up Python 3.10 uses: actions/setup-python@v4 with: python-version: '3.10'
- name: Install uv uses: astral-sh/setup-uv@v5
- name: Install dependencies run: | uv venv .venv --python=3.10 source .venv/bin/activate uv pip install -r requirements.txt deactivate
- name: Train and Evaluate ResNet env: LEARNING_RATE: ${{ matrix.learning_rate }} BATCH_SIZE: ${{ matrix.batch_size }} run: | source .venv/bin/activate python train.py deactivate
- name: Upload metrics artifact uses: actions/upload-artifact@v4 with: name: metrics-${{ matrix.learning_rate }}-${{ matrix.batch_size }} path: metrics_*.json
compare_tuning: needs: hyperparameter_tuning name: Compare Tuning Performance runs-on: ubuntu-latest steps: - uses: actions/checkout@v3
- name: Set up Python 3.10 uses: actions/setup-python@v4 with: python-version: '3.10'
- name: Install uv uses: astral-sh/setup-uv@v5
- name: Install dependencies run: | uv venv .venv --python=3.10 source .venv/bin/activate uv pip install -r requirements.txt deactivate
- name: Download all metrics uses: actions/download-artifact@v4 with: path: metrics
- name: Compare Metrics run: | source .venv/bin/activate python compare_metrics.py deactivate
- name: Upload comparison results uses: actions/upload-artifact@v4 with: name: comparison-results path: model_comparison.csv
Key Features
The power of this implementation comes from several key features:
-
Matrix Strategy: The workflow defines a matrix of hyperparameters, automatically creating separate jobs for each combination. In this example, we’re exploring two learning rates (0.001, 0.0005) and two batch sizes (32, 64), resulting in 4 concurrent training jobs.
-
Parallel Execution: Each hyperparameter combination runs as a separate job on its own GPU runner, allowing multiple experiments to run simultaneously rather than sequentially.
-
Metrics Collection: Each training job produces performance metrics that are saved as artifacts with names that indicate the hyperparameter values used.
-
Automated Comparison: After all training jobs complete, a separate job downloads all metrics and generates a comparison report, making it easy to identify the best configuration.
Using Machine GPU Runners
This hyperparameter tuning process leverages Machine GPU runners to provide the necessary computing power for efficient model training. The workflow is configured to use:
- T4 GPU: An entry-level ML GPU with 16GB VRAM, well-suited for training moderate-sized models
- Configurable resources: CPU, RAM, and architecture specifications optimized for each training job
The parallel nature of this approach means that you can complete a hyperparameter search in a fraction of the time it would take to run sequentially, even when using the same hardware resources per job.
Best Practices
- Choose parameters wisely: Select hyperparameters that have the most impact on model performance
- Start with a broad search: Begin with a wide range of values, then refine with narrower ranges around promising values
- Consider resource allocation: Adjust CPU/RAM requirements based on your specific model and dataset needs
- Set appropriate timeouts: Ensure your workflow timeout is sufficient for all jobs to complete
- Use fail-fast: false: This ensures all combinations are evaluated even if some fail, giving you a complete picture
Getting Started
To run the Parallel Hyperparameter Tuning workflow:
- Use the MachineHQ/parallel-hyperparameter-tuning repository as a template
- Navigate to the Actions tab in your repository
- Select the “ResNet Hyperparameter Tuning” workflow
- Click “Run workflow” to start the tuning process
- Wait for all jobs to complete
- Download the comparison-results artifact to identify the best hyperparameter configuration
Customizing the Workflow
You can easily adapt this workflow for your own models and hyperparameters:
- Modify the matrix in the workflow file to include your specific hyperparameters
- Update the training script (train.py) to work with your model and dataset
- Adjust the metrics collection to capture the performance indicators most relevant to your task
- Customize the comparison script (compare_metrics.py) to generate insights tailored to your needs
Next Steps
- Explore the full MachineHQ/parallel-hyperparameter-tuning repository
- Learn about GPU runner specifications