Running Ollama on an AWS GPU Spot Instance for €0.30/Hour

Spin up GPU-powered Ollama on AWS spot instances (~€0.35/hr), run 70B+ models, and tear it down when done. Full automation script with real cost comparisons.

AWS EC2 spot instance terminal running Ollama with a 70B parameter model on an NVIDIA GPU for 0.30 euros per hour

Your homelab can run 7B and 13B models just fine on a decent CPU or a mid-range GPU. But the moment you want to experiment with Llama 3 70B, Mixtral 8x22B, or any model that demands 40GB+ of VRAM, your local hardware hits a wall.

You have two choices: spend €1,500+ on an RTX 4090 or A6000 that sits idle 95% of the time, or rent exactly the GPU you need for exactly as long as you need it.

AWS spot instances make the second option absurdly cheap. A g5.xlarge with an NVIDIA A10G GPU (24GB VRAM) goes for roughly €0.30–0.45/hour on spot pricing — that’s 60–80% less than on-demand. Run it for two hours, tear it down, pay less than a euro.

This guide automates the entire lifecycle: launch, configure, run Ollama, and terminate — all from a single script on your homelab machine.

When this makes sense (and when it doesn’t)

Use spot GPU instances when:

  • You want to test large models (70B+) that exceed your local VRAM
  • You need GPU compute for a few hours per week, not daily
  • You’re fine-tuning or benchmarking and need consistent GPU performance
  • Your local power costs make 24/7 GPU operation expensive

Stick with local hardware when:

  • You use AI models multiple hours every day (the break-even is roughly 2hrs/day — see the cost comparison below)
  • You need instant availability with no spin-up time
  • You’re running models that fit in 8–16GB VRAM (a €300 RTX 4060 handles this)
  • Privacy requirements mean data can’t leave your network

GPU instance options and pricing

Here’s what’s available for AI inference workloads on spot pricing in eu-central-1 (Frankfurt). Prices fluctuate — these are typical ranges:

InstanceGPUVRAMSpot price/hrGood for
g5.xlarge1× A10G24GB€0.30–0.4513B–34B models, quantized 70B
g5.2xlarge1× A10G24GB€0.45–0.65Same GPU, more CPU/RAM for preprocessing
g5.4xlarge1× A10G24GB€0.60–0.90Same GPU, 64GB system RAM
g5.12xlarge4× A10G96GB€1.80–2.70Full 70B models, multi-GPU inference
p3.2xlarge1× V10016GB€0.90–1.20Older but capable, good for fine-tuning

For most people, g5.xlarge is the sweet spot. The A10G’s 24GB VRAM runs quantized 70B models (Q4_K_M) comfortably, and the spot price is consistently the cheapest GPU option on AWS.

Prerequisites

On your homelab machine (the one you’ll launch from):

# AWS CLI v2 configured with credentials that can launch EC2 instances
aws sts get-caller-identity

# jq for JSON parsing
sudo apt install -y jq

You’ll also need a key pair and security group. If you followed the €10/month AWS stack guide, you already have these. Otherwise:

aws ec2 create-key-pair \
  --key-name ollama-gpu \
  --key-type ed25519 \
  --query 'KeyMaterial' --output text > ~/.ssh/ollama-gpu.pem
chmod 600 ~/.ssh/ollama-gpu.pem

The launch script

This script handles everything: finds the cheapest availability zone, requests a spot instance, installs Ollama and your chosen model, and gives you a ready-to-use endpoint.

#!/bin/bash
# ollama-spot.sh — Launch an Ollama GPU spot instance on AWS
set -euo pipefail

# ─── Configuration ─────────────────────────────────────────
INSTANCE_TYPE="${1:-g5.xlarge}"
MODEL="${2:-llama3:70b-instruct-q4_K_M}"
REGION="eu-central-1"
KEY_NAME="ollama-gpu"
AMI_OWNER="099720109477"  # Canonical (Ubuntu)

# ─── Find the latest Ubuntu 24.04 AMI ─────────────────────
echo "Finding latest Ubuntu 24.04 AMI..."
AMI_ID=$(aws ec2 describe-images \
  --region "$REGION" \
  --owners "$AMI_OWNER" \
  --filters \
    "Name=name,Values=ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-amd64-server-*" \
    "Name=state,Values=available" \
  --query 'sort_by(Images, &CreationDate)[-1].ImageId' \
  --output text)
echo "AMI: $AMI_ID"

# ─── Find cheapest AZ for this instance type ──────────────
echo "Checking spot prices across availability zones..."
CHEAPEST=$(aws ec2 describe-spot-price-history \
  --region "$REGION" \
  --instance-types "$INSTANCE_TYPE" \
  --product-descriptions "Linux/UNIX" \
  --start-time "$(date -u +%Y-%m-%dT%H:%M:%S)" \
  --query 'SpotPriceHistory | sort_by(@, &SpotPrice) | [0]' \
  --output json)

SPOT_PRICE=$(echo "$CHEAPEST" | jq -r '.SpotPrice')
SPOT_AZ=$(echo "$CHEAPEST" | jq -r '.AvailabilityZone')
echo "Best spot price: \$${SPOT_PRICE}/hr in ${SPOT_AZ}"

# ─── Create a security group (if it doesn't exist) ────────
SG_NAME="ollama-spot-sg"
SG_ID=$(aws ec2 describe-security-groups \
  --region "$REGION" \
  --filters "Name=group-name,Values=$SG_NAME" \
  --query 'SecurityGroups[0].GroupId' --output text 2>/dev/null || true)

if [ "$SG_ID" = "None" ] || [ -z "$SG_ID" ]; then
  echo "Creating security group..."
  SG_ID=$(aws ec2 create-security-group \
    --region "$REGION" \
    --group-name "$SG_NAME" \
    --description "Ollama spot instance" \
    --query 'GroupId' --output text)

  # SSH access
  aws ec2 authorize-security-group-ingress \
    --region "$REGION" \
    --group-id "$SG_ID" \
    --protocol tcp --port 22 --cidr "$(curl -4s ifconfig.me)/32"

  # Ollama API (restricted to your IP)
  aws ec2 authorize-security-group-ingress \
    --region "$REGION" \
    --group-id "$SG_ID" \
    --protocol tcp --port 11434 --cidr "$(curl -4s ifconfig.me)/32"
fi
echo "Security group: $SG_ID"

# ─── User data script ─────────────────────────────────────
USERDATA=$(cat << SETUP
#!/bin/bash
set -euo pipefail

# Install NVIDIA drivers
apt-get update
apt-get install -y linux-modules-nvidia-560-server-\$(uname -r) nvidia-utils-560-server
modprobe nvidia

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Configure Ollama to listen on all interfaces
mkdir -p /etc/systemd/system/ollama.service.d
cat > /etc/systemd/system/ollama.service.d/override.conf << 'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_NUM_PARALLEL=2"
EOF

systemctl daemon-reload
systemctl restart ollama

# Wait for Ollama to be ready
for i in {1..30}; do
  curl -sf http://localhost:11434/api/tags && break
  sleep 2
done

# Pull the requested model
ollama pull $MODEL

# Signal that setup is complete
echo "READY" > /tmp/ollama-ready
SETUP
)

# ─── Request the spot instance ─────────────────────────────
echo "Requesting spot instance..."
INSTANCE_ID=$(aws ec2 run-instances \
  --region "$REGION" \
  --image-id "$AMI_ID" \
  --instance-type "$INSTANCE_TYPE" \
  --key-name "$KEY_NAME" \
  --security-group-ids "$SG_ID" \
  --placement "AvailabilityZone=$SPOT_AZ" \
  --instance-market-options '{"MarketType":"spot","SpotOptions":{"SpotInstanceType":"one-time","InstanceInterruptionBehavior":"terminate"}}' \
  --block-device-mappings '[{"DeviceName":"/dev/sda1","Ebs":{"VolumeSize":100,"VolumeType":"gp3","Iops":3000,"Throughput":250}}]' \
  --user-data "$USERDATA" \
  --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=ollama-spot},{Key=Purpose,Value=ai-inference}]" \
  --query 'Instances[0].InstanceId' \
  --output text)

echo "Instance: $INSTANCE_ID"
echo "Waiting for instance to start..."
aws ec2 wait instance-running --region "$REGION" --instance-ids "$INSTANCE_ID"

PUBLIC_IP=$(aws ec2 describe-instances \
  --region "$REGION" \
  --instance-ids "$INSTANCE_ID" \
  --query 'Reservations[0].Instances[0].PublicIpAddress' \
  --output text)

echo ""
echo "════════════════════════════════════════════════════"
echo "  Ollama Spot Instance Launched"
echo "════════════════════════════════════════════════════"
echo "  Instance:  $INSTANCE_ID"
echo "  IP:        $PUBLIC_IP"
echo "  Type:      $INSTANCE_TYPE"
echo "  Spot:      \$${SPOT_PRICE}/hr"
echo "  Model:     $MODEL"
echo ""
echo "  SSH:       ssh -i ~/.ssh/ollama-gpu.pem ubuntu@$PUBLIC_IP"
echo "  API:       http://$PUBLIC_IP:11434"
echo ""
echo "  The instance is installing NVIDIA drivers and"
echo "  pulling the model. This takes 5-10 minutes."
echo "  Check progress:"
echo "    ssh -i ~/.ssh/ollama-gpu.pem ubuntu@$PUBLIC_IP 'tail -f /var/log/cloud-init-output.log'"
echo ""
echo "  When done, terminate with:"
echo "    aws ec2 terminate-instances --instance-ids $INSTANCE_ID"
echo "════════════════════════════════════════════════════"

# Save instance ID for easy termination
echo "$INSTANCE_ID" > /tmp/ollama-spot-instance-id

Make it executable and run:

chmod +x ollama-spot.sh

# Default: g5.xlarge with Llama 3 70B
./ollama-spot.sh

# Or specify instance type and model
./ollama-spot.sh g5.xlarge "mixtral:8x22b-instruct-q4_K_M"

Using the remote Ollama instance

Once the setup completes (5–10 minutes for driver install + model pull), you can use Ollama exactly like you would locally.

From the command line

# Direct API call
curl http://$PUBLIC_IP:11434/api/generate -d '{
  "model": "llama3:70b-instruct-q4_K_M",
  "prompt": "Explain the difference between ext4 and btrfs for a homelab NAS",
  "stream": false
}'

Connect Open WebUI to the remote instance

If you’re running Open WebUI locally, you can point it at the remote Ollama instance temporarily:

# Stop your local Ollama, start Open WebUI with the remote endpoint
docker run -d --name open-webui \
  -p 3000:8080 \
  -e OLLAMA_BASE_URL=http://$PUBLIC_IP:11434 \
  ghcr.io/open-webui/open-webui:main

Now you get the familiar chat interface with a 70B model behind it. When you’re done, switch back to your local Ollama endpoint.

SSH tunnel (more secure)

If you’d rather not expose port 11434 publicly, remove the port from the security group and use an SSH tunnel:

ssh -i ~/.ssh/ollama-gpu.pem -L 11434:localhost:11434 -N ubuntu@$PUBLIC_IP &

# Now access Ollama on localhost as if it were local
curl http://localhost:11434/api/generate -d '{
  "model": "llama3:70b-instruct-q4_K_M",
  "prompt": "Hello from my homelab through an SSH tunnel",
  "stream": false
}'

The termination script

When you’re done, terminate immediately. Every hour the instance runs costs money:

#!/bin/bash
# ollama-spot-stop.sh — Terminate the Ollama spot instance
set -euo pipefail

INSTANCE_ID=$(cat /tmp/ollama-spot-instance-id 2>/dev/null || echo "")

if [ -z "$INSTANCE_ID" ]; then
  echo "No instance ID found. Check /tmp/ollama-spot-instance-id"
  exit 1
fi

echo "Terminating $INSTANCE_ID..."
aws ec2 terminate-instances --instance-ids "$INSTANCE_ID"
aws ec2 wait instance-terminated --instance-ids "$INSTANCE_ID"

rm -f /tmp/ollama-spot-instance-id
echo "Instance terminated. No more charges."

Cost comparison: spot vs. buying a GPU

Here’s the real math. The question is: at what point does buying local hardware save you money?

Scenario: 2 hours of GPU use per day

AWS spot (g5.xlarge):
  €0.35/hr × 2 hrs × 30 days = €21.00/month

Local RTX 4070 Ti (16GB VRAM):
  Card: €650 / 36 months = €18.06/month (amortized)
  Electricity: 200W × 2hrs × 30 days × €0.30/kWh = €3.60/month
  Total: €21.66/month

At 2 hours per day, the costs are nearly identical. The card pays for itself after 31 months.

Scenario: occasional use (5 hours per week)

AWS spot:
  €0.35 × 5 hrs × 4.3 weeks = €7.53/month

Local RTX 4070 Ti:
  €18.06 + €1.29 electricity = €19.35/month

Spot instances are 2.5× cheaper if you only use GPU compute a few hours per week. The card would take over 7 years to break even — by which point it’s obsolete.

Scenario: heavy daily use (6+ hours/day)

AWS spot:
  €0.35 × 6 hrs × 30 days = €63.00/month

Local RTX 4070 Ti:
  €18.06 + €10.80 electricity = €28.86/month

If you’re running models 6+ hours daily, buy the GPU. Local hardware wins decisively for sustained workloads.

The VRAM argument

The cost math above ignores a critical factor: VRAM. A g5.xlarge gives you an A10G with 24GB VRAM. Getting 24GB of VRAM locally means an RTX 4090 (€1,600+) or an A5000 (€2,500+). For g5.12xlarge you get 96GB across 4 GPUs — there’s no reasonable local equivalent.

If you need more than 16GB VRAM and use it less than 4 hours daily, spot instances are the clear winner.

Handling spot interruptions

Spot instances can be reclaimed by AWS with 2 minutes notice. For inference workloads this isn’t a big deal — you lose the current prompt, not data. But you should be aware of it.

Check for interruption warnings

# On the instance — checks the instance metadata for termination notices
while true; do
  STATUS=$(curl -sf -o /dev/null -w "%{http_code}" \
    http://169.254.169.254/latest/meta-data/spot/instance-action || echo "000")
  if [ "$STATUS" = "200" ]; then
    echo "SPOT INTERRUPTION — instance will terminate in 2 minutes"
    # Add notification here (webhook, Telegram, etc.)
    break
  fi
  sleep 5
done

Interruption rates

In practice, g5.xlarge spot instances in eu-central-1 have an interruption rate under 5%. For a 2-hour inference session, the chance of being interrupted is very low. If it does happen, just re-run the launch script — it takes 5–10 minutes to get back up.

Persistent model cache (optional)

The biggest cost of spinning up a fresh instance each time is re-downloading the model (5–40GB depending on size). You can cut this down by storing pulled models on an EBS snapshot:

# After pulling your models, create a snapshot of the data volume
# First, find the volume ID
VOL_ID=$(aws ec2 describe-instances \
  --instance-ids "$INSTANCE_ID" \
  --query 'Reservations[0].Instances[0].BlockDeviceMappings[0].Ebs.VolumeId' \
  --output text)

# Create snapshot
SNAP_ID=$(aws ec2 create-snapshot \
  --volume-id "$VOL_ID" \
  --description "Ollama models cache" \
  --query 'SnapshotId' --output text)

echo "Snapshot: $SNAP_ID"
echo "Use this in future launches to skip model downloads"

Then modify the launch script to use this snapshot as the root volume. Startup time drops from 10 minutes to under 3.

Next steps

This pairs naturally with other guides in the series:

The sweet spot is owning a modest local GPU for daily use and spinning up cloud instances when you need more power. Your homelab handles the 7B and 13B models you use every day; AWS handles the occasional 70B experiment. Best of both worlds.

Frequently Asked Questions

Which AWS GPU instance types work best for running LLMs with Ollama?
The g4dn.xlarge (NVIDIA T4, 16 GB VRAM, ~€0.35/hr spot) handles 7B-13B models well. For 70B models, use the g5.2xlarge (NVIDIA A10G, 24 GB VRAM, ~€0.70/hr spot) or g5.12xlarge for multi-GPU inference. The p3.2xlarge (V100) is an older option but often cheaper than g5 at spot pricing.
Can a spot instance be interrupted while I am actively using Ollama?
Yes. AWS can reclaim spot capacity with a 2-minute warning. For interactive sessions this is disruptive — your model generation stops mid-stream. The launch script in this guide handles relaunching automatically, but active sessions cannot be resumed. For long inference jobs, consider spot interruption handling in your client code.
How much does it actually cost to run a 70B model on AWS?
A g5.2xlarge spot instance runs at roughly €0.60-0.85/hour depending on the region and time of day. A typical session — loading the model, running 20-30 prompts, and shutting down — takes 2-3 hours and costs €1.50-2.50. Compare that to a month of API calls to a cloud provider for the same volume of requests.
Do I need to keep the instance running between sessions?
No, and you should not. The termination script in this guide stops the instance after use. Ollama re-downloads models on first run if they are not cached. To avoid re-downloading, configure a persistent EBS volume or snapshot for the model cache, as described in the optional section of the guide.

Get notified when new articles and designs land:

No spam. Unsubscribe any time.

Sergej Voronko
Sergej Voronko
SAP Basis · Senior Operations Manager · Linux infrastructure engineer
About the author →

[discussion]

Comments are powered by Giscus — backed by GitHub Discussions. Sign in with GitHub to join the conversation.