Ollama Metal GPU crash on macOS 26 — use llama.cpp directly
Ollama Metal GPU crash on macOS 26 — use llama.cpp directly
Problem
Ollama fails to run any model on macOS 26 (Tahoe / Darwin 25.x) with a Metal GPU initialization error. Every model (llama3.2:1b, gemma2:2b, etc.) crashes with:
Error: 500 Internal Server Error: llama runner process has terminated:
error:Error Domain=MTLLibraryErrorDomain Code=3
static_assert failed due to requirement
'__tensor_ops_detail::__is_same_v<bfloat, half>'
"Input types must match cooperative tensor types"
ggml_metal_init: error: failed to initialize the Metal library
ggml_backend_metal_device_init: error: failed to allocate context
Setting OLLAMA_NO_GPU=1 or OLLAMA_NUM_GPU=0 does not help — Ollama on macOS always attempts Metal initialization and has no CPU-only fallback.
Root Cause
The Metal Performance Primitives framework on macOS 26 has a type mismatch between bfloat and half in the cooperative tensor operations. Ollama’s bundled llama.cpp backend triggers this code path during Metal library compilation at runtime. This is a system-level incompatibility between Ollama’s Metal shaders and the macOS 26 Metal framework.
Investigation Steps
- Tried multiple models — llama3.2:1b, gemma2:2b — all fail with the same Metal error
- Tried disabling GPU —
OLLAMA_NO_GPU=1,OLLAMA_NUM_GPU=0, restarting the server — Ollama still attempts Metal initialization - Checked whisper-cpp — whisper-cpp (same llama.cpp project family) works fine with Metal on the same machine, suggesting the issue is specific to Ollama’s build or the model architectures it uses
Solution
Use llama.cpp directly instead of Ollama. The Homebrew build of llama.cpp works correctly with Metal on macOS 26.
Install
brew install llama.cpp
Download a model
mkdir -p ~/.cache/llama-cpp
curl -L "https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF/resolve/main/qwen2.5-1.5b-instruct-q4_k_m.gguf" \
-o ~/.cache/llama-cpp/qwen2.5-1.5b-instruct-q4_k_m.gguf
Run as a persistent server (recommended for speed)
llama-server -m ~/.cache/llama-cpp/qwen2.5-1.5b-instruct-q4_k_m.gguf -c 2048 --port 8384
This exposes an OpenAI-compatible API at http://localhost:8384/v1/chat/completions.
Call it from Python (no pip deps)
import json
import urllib.request
body = json.dumps({
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 100,
"temperature": 0.7,
}).encode()
req = urllib.request.Request(
"http://localhost:8384/v1/chat/completions",
data=body,
headers={"content-type": "application/json"},
)
with urllib.request.urlopen(req, timeout=10) as resp:
result = json.loads(resp.read())
print(result["choices"][0]["message"]["content"])
Performance comparison
| Approach | Cold start | Inference (100 tokens) |
|---|---|---|
llama-cli (spawned per call) | ~50 seconds | ~1 second |
llama-server (persistent) | 0 (pre-loaded) | ~0.3 seconds |
Always use llama-server for interactive use cases.
CLI one-shot (if server not needed)
llama-cli -m ~/.cache/llama-cpp/qwen2.5-1.5b-instruct-q4_k_m.gguf \
-c 2048 -n 100 --temp 0.7 \
--no-display-prompt -cnv --single-turn \
-p "Your prompt here"
The --single-turn flag is critical — without it, llama-cli enters interactive mode and waits for more input forever.
Prevention
- When using local LLMs on macOS, prefer
llama.cpp(Homebrew) over Ollama until Ollama fixes their Metal compatibility with macOS 26 - Design LLM integrations to be backend-agnostic — use the OpenAI-compatible chat completions API format so you can swap between llama-server, Ollama, or cloud APIs without code changes
- Always implement graceful fallback when LLM is unavailable
Related
- Voice Pilot plan
- Voice Pilot smart summary plan
- Ollama GitHub issue: Metal bfloat16 compatibility on macOS 26