How do I run Ollama with GPU acceleration in Podman on Windows?

Install the NVIDIA Container Toolkit inside the Podman WSL2 machine, generate CDI specs with "nvidia-ctk cdi generate", then run the Ollama container with "--device nvidia.com/gpu=all". This passes the GPU through WSL2 into the container.

How do I verify Ollama is using my GPU and not the CPU?

Run "podman exec ollama ollama ps" after loading a model. The PROCESSOR column shows "100% GPU" for full GPU acceleration. You can also check "podman exec ollama nvidia-smi" to confirm the GPU is visible inside the container.

What is CDI and why do I need it for GPU containers in Podman?

CDI (Container Device Interface) is a standard that tells container runtimes how to expose host devices to containers. Podman uses CDI specs generated by nvidia-ctk to map NVIDIA GPU libraries and devices into containers.

Can I run Ollama in Podman without a GPU?

Yes. Omit the "--device nvidia.com/gpu=all" flag and Ollama runs on CPU only. CPU inference is significantly slower (10-20 tokens/sec vs 200+ on GPU for a 7B model) but works on any hardware.

Do AMD GPUs work with Ollama in containers on WSL2?

No. WSL2 does not expose /dev/kfd which ROCm requires. For AMD GPUs, use native Windows Ollama with the ROCm package (ollama-windows-amd64-rocm.zip) instead of running in a container.

What size LLM model can I run on my GPU?

It depends on VRAM. At Q4_K_M quantization: 6GB fits 7B models, 10-12GB fits 13B, 24GB fits 30B, and 70B models require more VRAM than any single consumer GPU provides. Check "ollama ps" to see if the model loaded fully on GPU.

Do I need to regenerate CDI specs after a Windows driver update?

Yes. After updating NVIDIA drivers on Windows, regenerate CDI specs by running "podman machine ssh -- sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml" and recreate your Ollama container.

Running Ollama with GPU Acceleration in Podman on Windows

Ollama makes running large language models locally straightforward, but getting GPU acceleration working inside a container on Windows takes some effort. Podman Desktop uses a WSL2-backed Linux VM, and the GPU doesn't just appear inside containers automatically — you need the NVIDIA Container Toolkit and CDI (Container Device Interface) to bridge the gap.

This post walks through the complete setup: installing the toolkit inside the Podman machine, generating CDI specs, running Ollama with GPU passthrough, and verifying that inference is actually hitting the GPU. I also cover CPU-only mode for machines without a supported GPU, and include a reference table of what models fit on which GPUs.

Prerequisites

Before starting, you need:

Windows 10 21H2+ or Windows 11 with WSL2 enabled
Podman Desktop installed and running (with a WSL2-backed machine)
NVIDIA GPU with up-to-date drivers installed on Windows (the driver handles WSL2 automatically)
nvidia-smi working from a regular Windows terminal — if this doesn't work, fix your driver first

Verify your GPU is visible from Windows:

bash

nvidia-smi

You should see your GPU model, driver version, and CUDA version. If this fails, install or update your NVIDIA drivers from nvidia.com/drivers.

Podman Machine Architecture on Windows

Podman on Windows doesn't run containers directly on the host. It creates a lightweight Linux VM (the "Podman machine") inside WSL2, and containers run inside that VM. This is important because GPU passthrough has to work at two levels: Windows → WSL2 VM, and WSL2 VM → container.

Click to expand

936 × 1282px

The good news is that NVIDIA's Windows drivers automatically expose CUDA libraries into WSL2 via /usr/lib/wsl/lib/. The missing piece is telling Podman how to map those libraries into individual containers — that's what the NVIDIA Container Toolkit does.

Running Ollama — CPU Only

If you don't have an NVIDIA GPU, or want to run without GPU acceleration, this is straightforward:

bash

podman run -d --name ollama \
  -v ollama-models:/root/.ollama \
  -p 11434:11434 \
  docker.io/ollama/ollama:latest

This creates a container with:

A named volume ollama-models for persistent model storage
Port 11434 exposed for the Ollama API
No GPU access — inference runs on CPU only

Pull and run a model:

bash

podman exec ollama ollama pull qwen2.5:7b
podman exec ollama ollama run qwen2.5:7b "Hello, world"

CPU inference works but is significantly slower. A 7B model that generates ~150 tokens/sec on a modern GPU might only manage 10-20 tokens/sec on CPU. For anything beyond small models or quick tests, GPU acceleration is worth the setup effort.

Enabling NVIDIA GPU Passthrough

This is the main event. Three steps: verify the GPU is visible in the Podman machine, install the NVIDIA Container Toolkit, and generate CDI specs.

Step 1: Verify GPU Visibility in WSL2

First, confirm the Podman machine can see your GPU:

bash

podman machine ssh -- "/usr/lib/wsl/lib/nvidia-smi"

You should see the same GPU info as from Windows. If this fails, your NVIDIA drivers may need updating — ensure you have a recent NVIDIA Windows driver installed.

Check that the CUDA libraries are present:

bash

podman machine ssh -- "ls /usr/lib/wsl/lib/libcuda*"

You should see libcuda.so, libcuda.so.1, and libcuda.so.1.1. These are provided by the Windows NVIDIA driver and automatically mounted into WSL2.

Step 2: Install the NVIDIA Container Toolkit

The Podman machine runs Fedora. Install the toolkit using dnf:

bash

# Add the NVIDIA container toolkit repository
podman machine ssh -- "curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo > /dev/null"

# Install the toolkit
podman machine ssh -- "sudo dnf install -y nvidia-container-toolkit"

Note: If your Podman machine runs a different distro, check the NVIDIA Container Toolkit install docs for the appropriate package manager commands.

Step 3: Generate CDI Specs

CDI (Container Device Interface) is a standard that tells container runtimes how to expose host devices to containers. Generate the NVIDIA CDI spec:

bash

podman machine ssh -- "sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml"

You'll see output showing the detected GPU, driver store paths, and libraries being mapped. Verify the CDI device is available:

bash

podman machine ssh -- "nvidia-ctk cdi list"

This should output:

nvidia.com/gpu=all

That's the device identifier you'll use when running containers with GPU access.

Step 4: Run Ollama with GPU

Now run Ollama with the --device flag to pass the GPU through:

bash

podman run -d --name ollama \
  --device nvidia.com/gpu=all \
  -v ollama-models:/root/.ollama \
  -p 11434:11434 \
  docker.io/ollama/ollama:latest

The only difference from the CPU-only command is --device nvidia.com/gpu=all. This tells Podman to use the CDI spec to mount the NVIDIA libraries and devices into the container.

Migrating from CPU-only? If you already have an Ollama container without GPU access, remove and recreate it. Your models are safe on the ollama-models volume:
bash
podman rm -f ollama
podman run -d --name ollama \
  --device nvidia.com/gpu=all \
  -v ollama-models:/root/.ollama \
  -p 11434:11434 \
  docker.io/ollama/ollama:latest

Verifying GPU Acceleration

Setting up GPU passthrough is one thing — confirming it actually works is another. Here are three verification methods, from quick sanity check to proper benchmarking.

Check 1: nvidia-smi Inside the Container

bash

podman exec ollama nvidia-smi

You should see your GPU listed with its full VRAM. If this fails, the CDI device mapping isn't working — go back and verify the CDI spec was generated correctly.

Check 2: Ollama Startup Logs

The Ollama logs show exactly what GPU was detected:

bash

podman logs ollama 2>&1 | grep -E "gpu|GPU|CUDA|inference compute"

Look for a line like:

inference compute id=GPU-xxx library=CUDA compute=12.0
  name=CUDA0 description="NVIDIA GeForce RTX 5090"
  total="31.8 GiB" available="26.3 GiB"

This confirms Ollama found the GPU via CUDA, detected its compute capability, and knows how much VRAM is available.

Check 3: Verify Model Placement

After running a model, check where it's loaded:

bash

podman exec ollama ollama run qwen2.5:7b "test" > /dev/null 2>&1
podman exec ollama ollama ps

The PROCESSOR column tells you exactly where the model is running:

NAME          SIZE      PROCESSOR    CONTEXT
qwen2.5:7b   4.9 GB    100% GPU     4096

100% GPU means all layers are on the GPU. If you see a split like 60% GPU / 40% CPU, the model is too large for your VRAM and is being partially offloaded — inference will be slower for the offloaded layers.

Check 4: Benchmark with the API

For precise performance numbers, use the Ollama API with stream: false to get timing data:

bash

curl -s http://localhost:11434/api/generate \
  -d '{"model":"qwen2.5:7b","prompt":"Write a haiku about containers","stream":false}' \
  | python3 -c "
import sys, json
d = json.load(sys.stdin)
print(f'Total:      {d[\"total_duration\"]/1e9:.1f}s')
print(f'Load:       {d[\"load_duration\"]/1e9:.1f}s')
print(f'Prompt:     {d[\"prompt_eval_duration\"]/1e9:.2f}s ({d[\"prompt_eval_count\"]} tokens)')
print(f'Generation: {d[\"eval_duration\"]/1e9:.2f}s ({d[\"eval_count\"]} tokens)')
print(f'Speed:      {d[\"eval_count\"]/(d[\"eval_duration\"]/1e9):.1f} tokens/sec')
"

On an RTX 5090, a warm 7B model generates at ~150 tokens/sec. A 30B model runs at ~50 tokens/sec. If you're seeing numbers in the 10-20 tok/s range, the model is likely running on CPU.

First run is slow: The first inference after starting or loading a model includes model load time (moving weights from disk into VRAM). Run the same prompt twice — the second run shows actual inference speed.

Performance Expectations

Real numbers from an RTX 5090 (32GB GDDR7) running in this exact Podman setup:

Model	Parameters	VRAM Used	Speed (warm)
qwen2.5:1.5b	1.5B	1.4 GB	~250 tok/s
qwen2.5:7b	7B	4.9 GB	~150 tok/s
mistral:7b-instruct	7B	4.4 GB	~145 tok/s
deepseek-coder:6.7b	6.7B	3.8 GB	~155 tok/s
glm-4.7-flash	29.9B	19 GB	~50 tok/s

Thinking models feel slow. Models like glm-4.7-flash generate hidden chain-of-thought tokens before producing visible output. A 30B thinking model generating 1,800 internal tokens at 50 tok/s takes ~36 seconds before you see anything — it's not stuck, it's thinking.

GPU VRAM Compatibility Guide

Not every GPU can run every model. VRAM is the bottleneck — the model weights need to fit in GPU memory for full acceleration. These tables show what fits where, using Q4_K_M quantization (Ollama's typical default).

How to read these tables: "Comfortably" means the model loads with headroom for context. "Tight" means it loads but leaves little room. Models that don't fit will partially offload to CPU, which is significantly slower.

NVIDIA GeForce RTX

GPU	VRAM	Max Comfortable Model	Notes
RTX 50 Series
RTX 5090	32 GB GDDR7	30B	70B possible with CPU offload
RTX 5080	16 GB GDDR7	13-14B	High bandwidth (960 GB/s)
RTX 5070 Ti	16 GB GDDR7	13-14B
RTX 5070	12 GB GDDR7	7-8B	13B is tight
RTX 40 Series
RTX 4090	24 GB GDDR6X	30B	Best previous-gen option
RTX 4080 Super	16 GB GDDR6X	13-14B
RTX 4070 Ti Super	16 GB GDDR6X	13-14B
RTX 4070 Super	12 GB GDDR6X	7-8B	13B is tight
RTX 4060 Ti 16GB	16 GB GDDR6	13B	128-bit bus limits speed
RTX 4060 Ti 8GB	8 GB GDDR6	7-8B
RTX 4060	8 GB GDDR6	7-8B
RTX 30 Series
RTX 3090	24 GB GDDR6X	30B	Same VRAM ceiling as 4090, lower bandwidth
RTX 3080	10 GB GDDR6X	7-8B	13B is marginal
RTX 3070	8 GB GDDR6	7-8B
RTX 3060	12 GB GDDR6	13B	Budget 13B option
RTX 20 Series
RTX 2080 Ti	11 GB GDDR6	7-8B	Oldest supported generation
RTX 2080	8 GB GDDR6	7-8B
RTX 2070	8 GB GDDR6	7-8B
RTX 2060	6 GB GDDR6	3-7B	7B is very tight

Intel Arc

GPU	VRAM	Max Comfortable Model	Notes
Arc A770	16 GB GDDR6	13-14B	Requires IPEX-LLM, not standard Ollama
Arc B580	12 GB GDDR6	7-8B	Requires IPEX-LLM
Arc B570	10 GB GDDR6	7-8B	Requires IPEX-LLM
Arc A750	8 GB GDDR6	7-8B	Requires IPEX-LLM

Intel Arc requires a custom Ollama build. Standard Ollama doesn't support Intel GPUs. You need IPEX-LLM which provides a patched Ollama build using SYCL kernels for Intel XPU hardware. This is a separate setup path and doesn't use the CDI/NVIDIA toolkit approach described in this post.

AMD Radeon RX

GPU	VRAM	Max Comfortable Model	Notes
RX 7900 XTX	24 GB GDDR6	30B	Native Windows Ollama + ROCm
RX 7900 XT	20 GB GDDR6	13-14B	Native Windows Ollama + ROCm
RX 9070 XT	16 GB GDDR6	13-14B	ROCm 6.4.1+ required
RX 9070	16 GB GDDR6	13-14B	ROCm 6.4.1+ required
RX 7800 XT	16 GB GDDR6	13-14B
RX 7700 XT	12 GB GDDR6	7-8B	13B is tight
RX 7600	8 GB GDDR6	7-8B	128-bit bus limits speed

AMD GPUs don't reliably work in containers on WSL2. ROCm traditionally requires /dev/kfd which WSL2 doesn't expose — it provides /dev/dxg instead. While there are community workarounds for AMD GPU passthrough in WSL2 containers, support is experimental and many users report the GPU not being detected. For AMD GPUs, the most reliable path is native Windows Ollama with the ROCm package (ollama-windows-amd64-rocm.zip) — no container needed. Or run on bare-metal Linux where ROCm has full hardware access.

Quick Model Sizing Reference

Model Size	VRAM Needed (Q4_K_M)	Example Models
1-3B	1-3 GB	qwen2.5:1.5b, phi-3:mini, gemma:2b
7-8B	5-6 GB	llama3.1:8b, qwen2.5:7b, mistral:7b
13-14B	9-10 GB	llama2:13b, qwen2.5:14b
30-34B	20-22 GB	codellama:34b, yi:34b
70B	44-48 GB	llama3.1:70b (no single consumer GPU)

Memory bandwidth matters too. A 16GB card with a 128-bit bus (RTX 4060 Ti 16GB) will be noticeably slower than a 16GB card with a 256-bit bus (RTX 5080) even when both can load the same model. VRAM determines what fits, bandwidth determines how fast.

Troubleshooting

"No NVIDIA GPU detected" in Ollama Logs

Check that CDI was generated after installing the toolkit:

bash

podman machine ssh -- "nvidia-ctk cdi list"

If empty, regenerate:

bash

podman machine ssh -- "sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml"

Container Starts but No GPU

Make sure you included --device nvidia.com/gpu=all when creating the container. You can't add a device to a running container — you need to recreate it.

nvidia-smi Works in WSL but Not in Container

The CDI spec may be stale. After Windows driver updates, regenerate:

bash

podman machine ssh -- "sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml"

Then recreate the Ollama container.

Model Shows "100% CPU" Despite GPU Being Available

The model might be too large for VRAM. Check available memory:

bash

podman exec ollama nvidia-smi

If VRAM is nearly full from other processes, Ollama will fall back to CPU. Close other GPU-heavy applications or use a smaller model.

CDI Generation Fails

Ensure the toolkit is installed and the GPU is visible:

bash

podman machine ssh -- "which nvidia-ctk"
podman machine ssh -- "/usr/lib/wsl/lib/nvidia-smi"

If nvidia-smi fails inside the Podman machine, the issue is at the WSL2 level — update your NVIDIA Windows drivers.

Performance Is Lower Than Expected

First run penalty — The first inference loads the model into VRAM. Always benchmark on the second run.
Thinking models — Models like glm-4.7-flash or deepseek-r1 generate internal reasoning tokens before producing output. This isn't slowness, it's by design.
Shared GPU — If you're running games, video editing, or other GPU workloads simultaneously, available VRAM and compute will be reduced.

Putting It All Together

The complete setup from scratch is five commands after Podman Desktop is installed:

bash

# 1. Add NVIDIA Container Toolkit repo
podman machine ssh -- "curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo > /dev/null"

# 2. Install the toolkit
podman machine ssh -- "sudo dnf install -y nvidia-container-toolkit"

# 3. Generate CDI specs
podman machine ssh -- "sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml"

# 4. Run Ollama with GPU
podman run -d --name ollama \
  --device nvidia.com/gpu=all \
  -v ollama-models:/root/.ollama \
  -p 11434:11434 \
  docker.io/ollama/ollama:latest

# 5. Pull and run a model
podman exec ollama ollama run llama3.1:8b "Hello!"

Once set up, the GPU passthrough survives container restarts. You only need to regenerate CDI specs if you update your NVIDIA Windows drivers.

Running LLMs locally gives you privacy, zero API costs, and the ability to experiment freely. With a modern NVIDIA GPU and Podman on Windows, you get container isolation without sacrificing GPU performance.

Related Posts