CPU vs GPU
This page helps you pick the right Machine runner for your workload. The short version: use a GPU only if your code uses CUDA or calls a GPU library. For everything else, CPU is faster to start, cheaper, and simpler.
Decision tree
Recommended runner by workload
| Workload | Recommended GPU | $/min (spot) | Why |
|---|---|---|---|
| Inference, ≤7B model FP16 | T4G | $0.004 | 14 GB fits 16 GB VRAM with KV cache headroom |
| Inference, 7–13B 4-bit quantized | L4 | $0.006 | 13B quantized to ~7 GB fits 24 GB easily |
| Inference, 7–13B FP16 or ≤30B 4-bit | L40S | $0.016 | 13B FP16 needs ~26 GB which exceeds L4’s 24 GB |
| High-throughput inference at scale | Inferentia2 | $0.003 | Cheapest GPU-class option, optimized for inference |
| Computer vision (real-time) | A10G | $0.011–$0.019 | Best latency-to-cost ratio for CV workloads |
| QLoRA fine-tune ≤13B | T4G or T4 | $0.004 | 4-bit quantization keeps memory under 16 GB |
| QLoRA fine-tune 30B | L4 | $0.006 | 20 GB fits 24 GB |
| QLoRA fine-tune 70B | L40S | $0.016 | 46 GB tight but fits 48 GB |
| LoRA fine-tune 7B (FP16) | L4 or A10G | $0.006–$0.011 | 15 GB needs >16 GB headroom |
| LoRA fine-tune 13B (FP16) | L40S | $0.016 | 28 GB needs ≥48 GB for headroom |
| Full fine-tune anything >7B | Multi-GPU or Trainium | varies | Single GPU not viable above 7B for full fine-tunes |
| AWS Neuron native training | Trainium | $0.006 | Built for AWS Neuron SDK workloads |
| Builds, tests, CI (no CUDA) | CPU | $0.0003–$0.010 | GPU is overkill and slower to start |
Memory math (the actual numbers)
These are approximate VRAM requirements per technique. Actual usage varies with batch size, sequence length, and overhead.
| Technique | 7B | 13B | 30B | 70B |
|---|---|---|---|---|
| Full fine-tune (FP16) | ~67 GB | ~125 GB | ~288 GB | ~672 GB |
| LoRA (FP16) | ~15 GB | ~28 GB | ~63 GB | ~146 GB |
| QLoRA (8-bit) | ~9 GB | ~17 GB | ~38 GB | ~88 GB |
| QLoRA (4-bit) | ~5 GB | ~9 GB | ~20 GB | ~46 GB |
| Inference (FP16) | ~14 GB | ~26 GB | ~60 GB | ~140 GB |
| Inference (4-bit) | ~4 GB | ~7 GB | ~16 GB | ~40 GB |
Sources: Modal, Runpod, Hyperstack, Spheron, Crusoe, Hugging Face Forum.
When to use a CPU runner instead
Use CPU runners for:
- Builds and compilation — even large C/C++/Rust projects fit in CPU memory and parallelize well
- Test suites — even GPU-related tests, if they use CPU-only mocks
- Data preprocessing without GPU acceleration
- Linting, formatting, code analysis
- Container builds (
docker build) - Web app builds (Next.js, Vite, Webpack, Astro)
- Database operations and migrations
- Anything that doesn’t import a CUDA library
CPU runners start faster, cost less, and offer up to 64 vCPUs and 128 GB RAM — more than enough for most non-GPU work.
Architecture: X64 vs ARM64 (CPU)
| Architecture | When to pick it |
|---|---|
| X64 (Intel/AMD) | Maximum compatibility — most prebuilt binaries, Docker images, and CI tooling assume X64. Default. |
| ARM64 (Graviton) | ~15–20% cheaper at the same vCPU count. Great for Go, Rust, Java, Python, and modern container workloads. The T4G GPU is also ARM64. |
Most modern Linux software runs cleanly on ARM64. If your toolchain doesn’t have a special X86-only dependency, ARM saves money.
Next steps
- GPU Runners — every GPU type and configuration
- CPU Runners — every CPU configuration
- Pricing — full per-minute rates
- Cost Optimization — spot, checkpointing, right-sizing