CPU vs GPU

This page helps you pick the right Machine runner for your workload. The short version: use a GPU only if your code uses CUDA or calls a GPU library. For everything else, CPU is faster to start, cheaper, and simpler.

Decision tree

GPU selection decision tree: CUDA needed? → Inference or fine-tuning? → Model size → recommended GPU with $/min price

Recommended runner by workload

Workload	Recommended GPU	$/min (spot)	Why
Inference, ≤7B model FP16	T4G	$0.004	14 GB fits 16 GB VRAM with KV cache headroom
Inference, 7–13B 4-bit quantized	L4	$0.006	13B quantized to ~7 GB fits 24 GB easily
Inference, 7–13B FP16 or ≤30B 4-bit	L40S	$0.016	13B FP16 needs ~26 GB which exceeds L4’s 24 GB
High-throughput inference at scale	Inferentia2	$0.003	Cheapest GPU-class option, optimized for inference
Computer vision (real-time)	A10G	$0.011–$0.019	Best latency-to-cost ratio for CV workloads
QLoRA fine-tune ≤13B	T4G or T4	$0.004	4-bit quantization keeps memory under 16 GB
QLoRA fine-tune 30B	L4	$0.006	20 GB fits 24 GB
QLoRA fine-tune 70B	L40S	$0.016	46 GB tight but fits 48 GB
LoRA fine-tune 7B (FP16)	L4 or A10G	$0.006–$0.011	15 GB needs >16 GB headroom
LoRA fine-tune 13B (FP16)	L40S	$0.016	28 GB needs ≥48 GB for headroom
Full fine-tune anything >7B	Multi-GPU or Trainium	varies	Single GPU not viable above 7B for full fine-tunes
AWS Neuron native training	Trainium	$0.006	Built for AWS Neuron SDK workloads
Builds, tests, CI (no CUDA)	CPU	$0.0003–$0.010	GPU is overkill and slower to start

Memory math (the actual numbers)

These are approximate VRAM requirements per technique. Actual usage varies with batch size, sequence length, and overhead.

Technique	7B	13B	30B	70B
Full fine-tune (FP16)	~67 GB	~125 GB	~288 GB	~672 GB
LoRA (FP16)	~15 GB	~28 GB	~63 GB	~146 GB
QLoRA (8-bit)	~9 GB	~17 GB	~38 GB	~88 GB
QLoRA (4-bit)	~5 GB	~9 GB	~20 GB	~46 GB
Inference (FP16)	~14 GB	~26 GB	~60 GB	~140 GB
Inference (4-bit)	~4 GB	~7 GB	~16 GB	~40 GB

Sources: Modal, Runpod, Hyperstack, Spheron, Crusoe, Hugging Face Forum.

When to use a CPU runner instead

Use CPU runners for:

Builds and compilation — even large C/C++/Rust projects fit in CPU memory and parallelize well
Test suites — even GPU-related tests, if they use CPU-only mocks
Data preprocessing without GPU acceleration
Linting, formatting, code analysis
Container builds (docker build)
Web app builds (Next.js, Vite, Webpack, Astro)
Database operations and migrations
Anything that doesn’t import a CUDA library

CPU runners start faster, cost less, and offer up to 64 vCPUs and 128 GB RAM — more than enough for most non-GPU work.

Architecture: X64 vs ARM64 (CPU)

Architecture	When to pick it
X64 (Intel/AMD)	Maximum compatibility — most prebuilt binaries, Docker images, and CI tooling assume X64. Default.
ARM64 (Graviton)	~15–20% cheaper at the same vCPU count. Great for Go, Rust, Java, Python, and modern container workloads. The T4G GPU is also ARM64.

Most modern Linux software runs cleanly on ARM64. If your toolchain doesn’t have a special X86-only dependency, ARM saves money.

Next steps

GPU Runners — every GPU type and configuration
CPU Runners — every CPU configuration
Pricing — full per-minute rates
Cost Optimization — spot, checkpointing, right-sizing