Skip to content

Language Model Arena

The Language Model Arena allows you to benchmark and compare different large language models across a variety of standardized tasks. This implementation leverages Machine GPU runners to efficiently evaluate models, providing valuable insights into their performance characteristics.

Prerequisites

You will need to have completed the Quick Start guide.

Use Case Overview

Why might you want to evaluate language models?

  • Compare the performance of different models on specific tasks
  • Identify which model is best suited for your particular use case
  • Measure how well your fine-tuned models perform against baselines
  • Understand the strengths and weaknesses of various models across different reasoning tasks

How It Works

The Language Model Arena uses the lm-evaluation-harness framework to benchmark models on a set of standardized tasks. The workflow is defined in a GitHub Actions workflow file and triggered on-demand with configurable parameters.

The benchmarking process:

  1. Loads two specified models (from Hugging Face or local paths)
  2. Runs them through the same set of evaluation tasks
  3. Generates comparison charts to visualize their relative performance
  4. Stores the results as GitHub workflow artifacts

Workflow Implementation

The Language Model Arena is implemented as a GitHub Actions workflow that can be triggered manually. Here’s the workflow definition:

name: LM Eval Benchmarking
on:
workflow_dispatch:
inputs:
model_1:
type: string
required: false
description: 'The first model to benchmark'
default: 'Qwen/Qwen2.5-3B-Instruct'
model_1_revision:
type: string
required: false
description: 'The first model revision to benchmark'
default: 'main'
model_2:
type: string
required: false
description: 'The second model to benchmark'
default: 'unsloth/Llama-3.1-8B-Instruct'
model_2_revision:
type: string
required: false
description: 'The second model revision to benchmark'
default: 'main'
tasks:
type: string
required: false
description: 'The tasks to benchmark'
default: 'hellaswag,arc_easy,mathqa,truthfulqa,drop,arc_challenge,gsm8k,mmlu_abstract_algebra,mmlu_college_mathematics'
examples_limit:
type: string
required: false
description: 'The number of examples to use for benchmarking'
default: '100'
jobs:
benchmark:
name: LLM Eval Benchmarking
runs-on:
- machine
- gpu=L40S
- cpu=4
- ram=32
- architecture=x64
- tenancy=spot
steps:
# Workflow steps for running the benchmark
# ...
- name: Generate Benchmark Comparison Chart
run: |
ls -l ./benchmarks/
python ./llm_benchmark_plotting.py
- name: Upload Benchmark Artifacts
uses: actions/upload-artifact@v4
with:
name: benchmark-results
path: benchmarks/
retention-days: 90

Evaluation Tasks

The Language Model Arena evaluates models on a variety of reasoning tasks, including:

  • hellaswag: Common sense reasoning about events
  • arc_easy & arc_challenge: Multiple-choice science questions
  • mathqa: Mathematical reasoning and problem-solving
  • truthfulqa: Measuring truthfulness in model responses
  • drop: Reading comprehension with numerical reasoning
  • gsm8k: Grade school math word problems
  • mmlu_abstract_algebra & mmlu_college_mathematics: Advanced mathematics knowledge

These tasks are designed to test different aspects of a model’s reasoning capabilities, from simple common sense to complex mathematical problem-solving.

Using Machine GPU Runners

This benchmark leverages Machine GPU runners to provide the necessary computing power for efficient evaluation. The workflow is configured to use:

  • L40S GPU: A powerful GPU with 48GB of VRAM for handling larger models
  • Spot instance: To optimize for cost while maintaining performance
  • Configurable resources: CPU, RAM, and architecture specifications

You can also specify regions for the benchmark to run in:

runs-on:
- machine
- gpu=L40S
- cpu=4
- ram=32
- architecture=x64
- tenancy=spot
- regions=us-east-1,us-east-2

This ensures your benchmarks run efficiently while optimizing for cost. Machine will search for the lowest spot price within the specified regions.

Benchmark Results

After running the workflow, the benchmark produces:

  1. JSON files containing detailed performance metrics for each task
  2. Comparison charts visualizing the performance differences between models
  3. All results are stored as GitHub artifacts for 90 days

Here’s an example of how the benchmark plotting works:

# Extract metrics from JSON files
for model, dir_path in model_results.items():
result_files = glob.glob(os.path.join(dir_path, "results_*.json"))
if result_files:
latest_file = max(result_files, key=os.path.getctime)
with open(latest_file) as f:
data = json.load(f)
for task, task_metrics in data['results'].items():
tasks.add(task)
metrics[model][task] = task_metrics
# Generate comparison charts for each task
for task in sorted(tasks):
plt.figure(figsize=(12, 7))
plt.title(f'{task} Comparison: Model 1 vs Model 2')
# ... Chart generation code ...
output_path = current_dir / f'benchmarks/{task}_comparison.png'
plt.savefig(output_path)

Getting Started

To run the Language Model Arena benchmark:

  1. Fork the MachineHQ/language-model-arena repository
  2. Navigate to the Actions tab in your repository
  3. Select the “LM Eval Benchmarking” workflow
  4. Click “Run workflow” and configure your parameters:
    • Select the models you want to compare
    • Choose which tasks to benchmark
    • Set the number of examples to evaluate
  5. Run the workflow and wait for results
  6. Download the benchmark artifacts to view the comparison charts

Best Practices

  • Balance depth vs. breadth: More examples provide more accurate results but take longer to evaluate
  • Choose appropriate tasks: Select tasks that are relevant to your use case
  • Compare similar models: Compare models with similar parameter counts for more meaningful comparisons
  • Use consistent evaluation settings: When comparing different versions of a model, keep all other parameters constant

Next Steps