Ollama makes running large language models locally straightforward, but getting GPU acceleration working inside a container on Windows takes some effort. Podman Desktop uses a WSL2-backed Linux VM, and the GPU doesn't just appear inside containers automatically — you need the NVIDIA Container Toolkit and CDI (Container Device Interface) to bridge the gap.
This post walks through the complete setup: installing the toolkit inside the Podman machine, generating CDI specs, running Ollama with GPU passthrough, and verifying that inference is actually hitting the GPU. I also cover CPU-only mode for machines without a supported GPU, and include a reference table of what models fit on which GPUs.
Prerequisites
Before starting, you need:
- Windows 10 21H2+ or Windows 11 with WSL2 enabled
- Podman Desktop installed and running (with a WSL2-backed machine)
- NVIDIA GPU with up-to-date drivers installed on Windows (the driver handles WSL2 automatically)
nvidia-smiworking from a regular Windows terminal — if this doesn't work, fix your driver first
Verify your GPU is visible from Windows:
nvidia-smiYou should see your GPU model, driver version, and CUDA version. If this fails, install or update your NVIDIA drivers from nvidia.com/drivers.
Podman Machine Architecture on Windows
Podman on Windows doesn't run containers directly on the host. It creates a lightweight Linux VM (the "Podman machine") inside WSL2, and containers run inside that VM. This is important because GPU passthrough has to work at two levels: Windows → WSL2 VM, and WSL2 VM → container.
The good news is that NVIDIA's Windows drivers automatically expose CUDA libraries into WSL2 via /usr/lib/wsl/lib/. The missing piece is telling Podman how to map those libraries into individual containers — that's what the NVIDIA Container Toolkit does.
Running Ollama — CPU Only
If you don't have an NVIDIA GPU, or want to run without GPU acceleration, this is straightforward:
podman run -d --name ollama \
-v ollama-models:/root/.ollama \
-p 11434:11434 \
docker.io/ollama/ollama:latestThis creates a container with:
- A named volume
ollama-modelsfor persistent model storage - Port 11434 exposed for the Ollama API
- No GPU access — inference runs on CPU only
Pull and run a model:
podman exec ollama ollama pull qwen2.5:7b
podman exec ollama ollama run qwen2.5:7b "Hello, world"CPU inference works but is significantly slower. A 7B model that generates ~150 tokens/sec on a modern GPU might only manage 10-20 tokens/sec on CPU. For anything beyond small models or quick tests, GPU acceleration is worth the setup effort.
Enabling NVIDIA GPU Passthrough
This is the main event. Three steps: verify the GPU is visible in the Podman machine, install the NVIDIA Container Toolkit, and generate CDI specs.
Step 1: Verify GPU Visibility in WSL2
First, confirm the Podman machine can see your GPU:
podman machine ssh -- "/usr/lib/wsl/lib/nvidia-smi"You should see the same GPU info as from Windows. If this fails, your NVIDIA drivers may need updating — ensure you have a recent NVIDIA Windows driver installed.
Check that the CUDA libraries are present:
podman machine ssh -- "ls /usr/lib/wsl/lib/libcuda*"You should see libcuda.so, libcuda.so.1, and libcuda.so.1.1. These are provided by the Windows NVIDIA driver and automatically mounted into WSL2.
Step 2: Install the NVIDIA Container Toolkit
The Podman machine runs Fedora. Install the toolkit using dnf:
# Add the NVIDIA container toolkit repository
podman machine ssh -- "curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo > /dev/null"
# Install the toolkit
podman machine ssh -- "sudo dnf install -y nvidia-container-toolkit"Note: If your Podman machine runs a different distro, check the NVIDIA Container Toolkit install docs for the appropriate package manager commands.
Step 3: Generate CDI Specs
CDI (Container Device Interface) is a standard that tells container runtimes how to expose host devices to containers. Generate the NVIDIA CDI spec:
podman machine ssh -- "sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml"You'll see output showing the detected GPU, driver store paths, and libraries being mapped. Verify the CDI device is available:
podman machine ssh -- "nvidia-ctk cdi list"This should output:
nvidia.com/gpu=allThat's the device identifier you'll use when running containers with GPU access.
Step 4: Run Ollama with GPU
Now run Ollama with the --device flag to pass the GPU through:
podman run -d --name ollama \
--device nvidia.com/gpu=all \
-v ollama-models:/root/.ollama \
-p 11434:11434 \
docker.io/ollama/ollama:latestThe only difference from the CPU-only command is --device nvidia.com/gpu=all. This tells Podman to use the CDI spec to mount the NVIDIA libraries and devices into the container.
Migrating from CPU-only? If you already have an Ollama container without GPU access, remove and recreate it. Your models are safe on the
ollama-modelsvolume:bashpodman rm -f ollama podman run -d --name ollama \ --device nvidia.com/gpu=all \ -v ollama-models:/root/.ollama \ -p 11434:11434 \ docker.io/ollama/ollama:latest
Verifying GPU Acceleration
Setting up GPU passthrough is one thing — confirming it actually works is another. Here are three verification methods, from quick sanity check to proper benchmarking.
Check 1: nvidia-smi Inside the Container
podman exec ollama nvidia-smiYou should see your GPU listed with its full VRAM. If this fails, the CDI device mapping isn't working — go back and verify the CDI spec was generated correctly.
Check 2: Ollama Startup Logs
The Ollama logs show exactly what GPU was detected:
podman logs ollama 2>&1 | grep -E "gpu|GPU|CUDA|inference compute"Look for a line like:
inference compute id=GPU-xxx library=CUDA compute=12.0
name=CUDA0 description="NVIDIA GeForce RTX 5090"
total="31.8 GiB" available="26.3 GiB"This confirms Ollama found the GPU via CUDA, detected its compute capability, and knows how much VRAM is available.
Check 3: Verify Model Placement
After running a model, check where it's loaded:
podman exec ollama ollama run qwen2.5:7b "test" > /dev/null 2>&1
podman exec ollama ollama psThe PROCESSOR column tells you exactly where the model is running:
NAME SIZE PROCESSOR CONTEXT
qwen2.5:7b 4.9 GB 100% GPU 4096100% GPU means all layers are on the GPU. If you see a split like 60% GPU / 40% CPU, the model is too large for your VRAM and is being partially offloaded — inference will be slower for the offloaded layers.
Check 4: Benchmark with the API
For precise performance numbers, use the Ollama API with stream: false to get timing data:
curl -s http://localhost:11434/api/generate \
-d '{"model":"qwen2.5:7b","prompt":"Write a haiku about containers","stream":false}' \
| python3 -c "
import sys, json
d = json.load(sys.stdin)
print(f'Total: {d[\"total_duration\"]/1e9:.1f}s')
print(f'Load: {d[\"load_duration\"]/1e9:.1f}s')
print(f'Prompt: {d[\"prompt_eval_duration\"]/1e9:.2f}s ({d[\"prompt_eval_count\"]} tokens)')
print(f'Generation: {d[\"eval_duration\"]/1e9:.2f}s ({d[\"eval_count\"]} tokens)')
print(f'Speed: {d[\"eval_count\"]/(d[\"eval_duration\"]/1e9):.1f} tokens/sec')
"On an RTX 5090, a warm 7B model generates at ~150 tokens/sec. A 30B model runs at ~50 tokens/sec. If you're seeing numbers in the 10-20 tok/s range, the model is likely running on CPU.
First run is slow: The first inference after starting or loading a model includes model load time (moving weights from disk into VRAM). Run the same prompt twice — the second run shows actual inference speed.
Performance Expectations
Real numbers from an RTX 5090 (32GB GDDR7) running in this exact Podman setup:
| Model | Parameters | VRAM Used | Speed (warm) |
|---|---|---|---|
| qwen2.5:1.5b | 1.5B | 1.4 GB | ~250 tok/s |
| qwen2.5:7b | 7B | 4.9 GB | ~150 tok/s |
| mistral:7b-instruct | 7B | 4.4 GB | ~145 tok/s |
| deepseek-coder:6.7b | 6.7B | 3.8 GB | ~155 tok/s |
| glm-4.7-flash | 29.9B | 19 GB | ~50 tok/s |
Thinking models feel slow. Models like
glm-4.7-flashgenerate hidden chain-of-thought tokens before producing visible output. A 30B thinking model generating 1,800 internal tokens at 50 tok/s takes ~36 seconds before you see anything — it's not stuck, it's thinking.
GPU VRAM Compatibility Guide
Not every GPU can run every model. VRAM is the bottleneck — the model weights need to fit in GPU memory for full acceleration. These tables show what fits where, using Q4_K_M quantization (Ollama's typical default).
How to read these tables: "Comfortably" means the model loads with headroom for context. "Tight" means it loads but leaves little room. Models that don't fit will partially offload to CPU, which is significantly slower.
NVIDIA GeForce RTX
| GPU | VRAM | Max Comfortable Model | Notes |
|---|---|---|---|
| RTX 50 Series | |||
| RTX 5090 | 32 GB GDDR7 | 30B | 70B possible with CPU offload |
| RTX 5080 | 16 GB GDDR7 | 13-14B | High bandwidth (960 GB/s) |
| RTX 5070 Ti | 16 GB GDDR7 | 13-14B | |
| RTX 5070 | 12 GB GDDR7 | 7-8B | 13B is tight |
| RTX 40 Series | |||
| RTX 4090 | 24 GB GDDR6X | 30B | Best previous-gen option |
| RTX 4080 Super | 16 GB GDDR6X | 13-14B | |
| RTX 4070 Ti Super | 16 GB GDDR6X | 13-14B | |
| RTX 4070 Super | 12 GB GDDR6X | 7-8B | 13B is tight |
| RTX 4060 Ti 16GB | 16 GB GDDR6 | 13B | 128-bit bus limits speed |
| RTX 4060 Ti 8GB | 8 GB GDDR6 | 7-8B | |
| RTX 4060 | 8 GB GDDR6 | 7-8B | |
| RTX 30 Series | |||
| RTX 3090 | 24 GB GDDR6X | 30B | Same VRAM ceiling as 4090, lower bandwidth |
| RTX 3080 | 10 GB GDDR6X | 7-8B | 13B is marginal |
| RTX 3070 | 8 GB GDDR6 | 7-8B | |
| RTX 3060 | 12 GB GDDR6 | 13B | Budget 13B option |
| RTX 20 Series | |||
| RTX 2080 Ti | 11 GB GDDR6 | 7-8B | Oldest supported generation |
| RTX 2080 | 8 GB GDDR6 | 7-8B | |
| RTX 2070 | 8 GB GDDR6 | 7-8B | |
| RTX 2060 | 6 GB GDDR6 | 3-7B | 7B is very tight |
Intel Arc
| GPU | VRAM | Max Comfortable Model | Notes |
|---|---|---|---|
| Arc A770 | 16 GB GDDR6 | 13-14B | Requires IPEX-LLM, not standard Ollama |
| Arc B580 | 12 GB GDDR6 | 7-8B | Requires IPEX-LLM |
| Arc B570 | 10 GB GDDR6 | 7-8B | Requires IPEX-LLM |
| Arc A750 | 8 GB GDDR6 | 7-8B | Requires IPEX-LLM |
Intel Arc requires a custom Ollama build. Standard Ollama doesn't support Intel GPUs. You need IPEX-LLM which provides a patched Ollama build using SYCL kernels for Intel XPU hardware. This is a separate setup path and doesn't use the CDI/NVIDIA toolkit approach described in this post.
AMD Radeon RX
| GPU | VRAM | Max Comfortable Model | Notes |
|---|---|---|---|
| RX 7900 XTX | 24 GB GDDR6 | 30B | Native Windows Ollama + ROCm |
| RX 7900 XT | 20 GB GDDR6 | 13-14B | Native Windows Ollama + ROCm |
| RX 9070 XT | 16 GB GDDR6 | 13-14B | ROCm 6.4.1+ required |
| RX 9070 | 16 GB GDDR6 | 13-14B | ROCm 6.4.1+ required |
| RX 7800 XT | 16 GB GDDR6 | 13-14B | |
| RX 7700 XT | 12 GB GDDR6 | 7-8B | 13B is tight |
| RX 7600 | 8 GB GDDR6 | 7-8B | 128-bit bus limits speed |
AMD GPUs don't reliably work in containers on WSL2. ROCm traditionally requires
/dev/kfdwhich WSL2 doesn't expose — it provides/dev/dxginstead. While there are community workarounds for AMD GPU passthrough in WSL2 containers, support is experimental and many users report the GPU not being detected. For AMD GPUs, the most reliable path is native Windows Ollama with the ROCm package (ollama-windows-amd64-rocm.zip) — no container needed. Or run on bare-metal Linux where ROCm has full hardware access.
Quick Model Sizing Reference
| Model Size | VRAM Needed (Q4_K_M) | Example Models |
|---|---|---|
| 1-3B | 1-3 GB | qwen2.5:1.5b, phi-3:mini, gemma:2b |
| 7-8B | 5-6 GB | llama3.1:8b, qwen2.5:7b, mistral:7b |
| 13-14B | 9-10 GB | llama2:13b, qwen2.5:14b |
| 30-34B | 20-22 GB | codellama:34b, yi:34b |
| 70B | 44-48 GB | llama3.1:70b (no single consumer GPU) |
Memory bandwidth matters too. A 16GB card with a 128-bit bus (RTX 4060 Ti 16GB) will be noticeably slower than a 16GB card with a 256-bit bus (RTX 5080) even when both can load the same model. VRAM determines what fits, bandwidth determines how fast.
Troubleshooting
"No NVIDIA GPU detected" in Ollama Logs
Check that CDI was generated after installing the toolkit:
podman machine ssh -- "nvidia-ctk cdi list"If empty, regenerate:
podman machine ssh -- "sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml"Container Starts but No GPU
Make sure you included --device nvidia.com/gpu=all when creating the container. You can't add a device to a running container — you need to recreate it.
nvidia-smi Works in WSL but Not in Container
The CDI spec may be stale. After Windows driver updates, regenerate:
podman machine ssh -- "sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml"Then recreate the Ollama container.
Model Shows "100% CPU" Despite GPU Being Available
The model might be too large for VRAM. Check available memory:
podman exec ollama nvidia-smiIf VRAM is nearly full from other processes, Ollama will fall back to CPU. Close other GPU-heavy applications or use a smaller model.
CDI Generation Fails
Ensure the toolkit is installed and the GPU is visible:
podman machine ssh -- "which nvidia-ctk"
podman machine ssh -- "/usr/lib/wsl/lib/nvidia-smi"If nvidia-smi fails inside the Podman machine, the issue is at the WSL2 level — update your NVIDIA Windows drivers.
Performance Is Lower Than Expected
- First run penalty — The first inference loads the model into VRAM. Always benchmark on the second run.
- Thinking models — Models like
glm-4.7-flashordeepseek-r1generate internal reasoning tokens before producing output. This isn't slowness, it's by design. - Shared GPU — If you're running games, video editing, or other GPU workloads simultaneously, available VRAM and compute will be reduced.
Putting It All Together
The complete setup from scratch is five commands after Podman Desktop is installed:
# 1. Add NVIDIA Container Toolkit repo
podman machine ssh -- "curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo > /dev/null"
# 2. Install the toolkit
podman machine ssh -- "sudo dnf install -y nvidia-container-toolkit"
# 3. Generate CDI specs
podman machine ssh -- "sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml"
# 4. Run Ollama with GPU
podman run -d --name ollama \
--device nvidia.com/gpu=all \
-v ollama-models:/root/.ollama \
-p 11434:11434 \
docker.io/ollama/ollama:latest
# 5. Pull and run a model
podman exec ollama ollama run llama3.1:8b "Hello!"Once set up, the GPU passthrough survives container restarts. You only need to regenerate CDI specs if you update your NVIDIA Windows drivers.
Running LLMs locally gives you privacy, zero API costs, and the ability to experiment freely. With a modern NVIDIA GPU and Podman on Windows, you get container isolation without sacrificing GPU performance.