Language Model Arena

The Language Model Arena allows you to benchmark and compare different large language models across a variety of standardized tasks. This implementation leverages Machine GPU runners to efficiently evaluate models, providing valuable insights into their performance characteristics.

Prerequisites

You will need to have completed the Quick Start guide.

Use Case Overview

Why might you want to evaluate language models?

Compare the performance of different models on specific tasks
Identify which model is best suited for your particular use case
Measure how well your fine-tuned models perform against baselines
Understand the strengths and weaknesses of various models across different reasoning tasks

How It Works

The Language Model Arena uses the lm-evaluation-harness framework to benchmark models on a set of standardized tasks. The workflow is defined in a GitHub Actions workflow file and triggered on-demand with configurable parameters.

The benchmarking process:

Loads two specified models (from Hugging Face or local paths)
Runs them through the same set of evaluation tasks
Generates comparison charts to visualize their relative performance
Stores the results as GitHub workflow artifacts

Workflow Implementation

The Language Model Arena is implemented as a GitHub Actions workflow that can be triggered manually. Here’s the workflow definition:

name: LM Eval Benchmarking

on:
  workflow_dispatch:
    inputs:
      model_1:
        type: string
        required: false
        description: 'The first model to benchmark'
        default: 'Qwen/Qwen2.5-3B-Instruct'
      model_1_revision:
        type: string
        required: false
        description: 'The first model revision to benchmark'
        default: 'main'
      model_2:
        type: string
        required: false
        description: 'The second model to benchmark'
        default: 'unsloth/Llama-3.1-8B-Instruct'
      model_2_revision:
        type: string
        required: false
        description: 'The second model revision to benchmark'
        default: 'main'
      tasks:
        type: string
        required: false
        description: 'The tasks to benchmark'
        default: 'hellaswag,arc_easy,mathqa,truthfulqa,drop,arc_challenge,gsm8k,mmlu_abstract_algebra,mmlu_college_mathematics'
      examples_limit:
        type: string
        required: false
        description: 'The number of examples to use for benchmarking'
        default: '100'

jobs:
  benchmark:
    name: LLM Eval Benchmarking
    runs-on:
      - machine
      - gpu=L40S
      - cpu=4
      - ram=32
      - architecture=x64
      - tenancy=spot

    steps:
      # Workflow steps for running the benchmark
      # ...

      - name: Generate Benchmark Comparison Chart
        run: |
          ls -l ./benchmarks/
          python ./llm_benchmark_plotting.py

      - name: Upload Benchmark Artifacts
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-results
          path: benchmarks/
          retention-days: 90

Evaluation Tasks

The Language Model Arena evaluates models on a variety of reasoning tasks, including:

hellaswag: Common sense reasoning about events
arc_easy & arc_challenge: Multiple-choice science questions
mathqa: Mathematical reasoning and problem-solving
truthfulqa: Measuring truthfulness in model responses
drop: Reading comprehension with numerical reasoning
gsm8k: Grade school math word problems
mmlu_abstract_algebra & mmlu_college_mathematics: Advanced mathematics knowledge

These tasks are designed to test different aspects of a model’s reasoning capabilities, from simple common sense to complex mathematical problem-solving.

Using Machine GPU Runners

This benchmark leverages Machine GPU runners to provide the necessary computing power for efficient evaluation. The workflow is configured to use:

L40S GPU: A powerful GPU with 48GB of VRAM for handling larger models
Spot instance: To optimize for cost while maintaining performance
Configurable resources: CPU, RAM, and architecture specifications

You can also specify regions for the benchmark to run in:

runs-on:
  - machine
  - gpu=L40S
  - cpu=4
  - ram=32
  - architecture=x64
  - tenancy=spot
  - regions=us-east-1,us-east-2

This ensures your benchmarks run efficiently while optimizing for cost. Machine will search for the lowest spot price within the specified regions.

Benchmark Results

After running the workflow, the benchmark produces:

JSON files containing detailed performance metrics for each task
Comparison charts visualizing the performance differences between models
All results are stored as GitHub artifacts for 90 days

Here’s an example of how the benchmark plotting works:

# Extract metrics from JSON files
for model, dir_path in model_results.items():
    result_files = glob.glob(os.path.join(dir_path, "results_*.json"))
    if result_files:
        latest_file = max(result_files, key=os.path.getctime)
        with open(latest_file) as f:
            data = json.load(f)
            for task, task_metrics in data['results'].items():
                tasks.add(task)
                metrics[model][task] = task_metrics

# Generate comparison charts for each task
for task in sorted(tasks):
    plt.figure(figsize=(12, 7))
    plt.title(f'{task} Comparison: Model 1 vs Model 2')
    # ... Chart generation code ...
    output_path = current_dir / f'benchmarks/{task}_comparison.png'
    plt.savefig(output_path)

Getting Started

To run the Language Model Arena benchmark:

Fork the MachineHQ/language-model-arena repository
Navigate to the Actions tab in your repository
Select the “LM Eval Benchmarking” workflow
Click “Run workflow” and configure your parameters:
- Select the models you want to compare
- Choose which tasks to benchmark
- Set the number of examples to evaluate
Run the workflow and wait for results
Download the benchmark artifacts to view the comparison charts

Best Practices

Balance depth vs. breadth: More examples provide more accurate results but take longer to evaluate
Choose appropriate tasks: Select tasks that are relevant to your use case
Compare similar models: Compare models with similar parameter counts for more meaningful comparisons
Use consistent evaluation settings: When comparing different versions of a model, keep all other parameters constant

Next Steps

Explore the full MachineHQ/language-model-arena repository
Learn about GPU runner specifications