ollama llama-cpp metal macos local-llm gpu voice-pilot

Ollama Metal GPU crash on macOS 26 — use llama.cpp directly

Problem

Ollama fails to run any model on macOS 26 (Tahoe / Darwin 25.x) with a Metal GPU initialization error. Every model (llama3.2:1b, gemma2:2b, etc.) crashes with:

Error: 500 Internal Server Error: llama runner process has terminated:
error:Error Domain=MTLLibraryErrorDomain Code=3

static_assert failed due to requirement
'__tensor_ops_detail::__is_same_v<bfloat, half>'
"Input types must match cooperative tensor types"

ggml_metal_init: error: failed to initialize the Metal library
ggml_backend_metal_device_init: error: failed to allocate context

Setting OLLAMA_NO_GPU=1 or OLLAMA_NUM_GPU=0 does not help — Ollama on macOS always attempts Metal initialization and has no CPU-only fallback.

Root Cause

The Metal Performance Primitives framework on macOS 26 has a type mismatch between bfloat and half in the cooperative tensor operations. Ollama’s bundled llama.cpp backend triggers this code path during Metal library compilation at runtime. This is a system-level incompatibility between Ollama’s Metal shaders and the macOS 26 Metal framework.

Investigation Steps

Tried multiple models — llama3.2:1b, gemma2:2b — all fail with the same Metal error
Tried disabling GPU — OLLAMA_NO_GPU=1, OLLAMA_NUM_GPU=0, restarting the server — Ollama still attempts Metal initialization
Checked whisper-cpp — whisper-cpp (same llama.cpp project family) works fine with Metal on the same machine, suggesting the issue is specific to Ollama’s build or the model architectures it uses

Solution

Use llama.cpp directly instead of Ollama. The Homebrew build of llama.cpp works correctly with Metal on macOS 26.

Install

brew install llama.cpp

Download a model

mkdir -p ~/.cache/llama-cpp
curl -L "https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF/resolve/main/qwen2.5-1.5b-instruct-q4_k_m.gguf" \
  -o ~/.cache/llama-cpp/qwen2.5-1.5b-instruct-q4_k_m.gguf

Run as a persistent server (recommended for speed)

llama-server -m ~/.cache/llama-cpp/qwen2.5-1.5b-instruct-q4_k_m.gguf -c 2048 --port 8384

This exposes an OpenAI-compatible API at http://localhost:8384/v1/chat/completions.

Call it from Python (no pip deps)

import json
import urllib.request

body = json.dumps({
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 100,
    "temperature": 0.7,
}).encode()

req = urllib.request.Request(
    "http://localhost:8384/v1/chat/completions",
    data=body,
    headers={"content-type": "application/json"},
)

with urllib.request.urlopen(req, timeout=10) as resp:
    result = json.loads(resp.read())
    print(result["choices"][0]["message"]["content"])

Performance comparison

Approach	Cold start	Inference (100 tokens)
`llama-cli` (spawned per call)	~50 seconds	~1 second
`llama-server` (persistent)	0 (pre-loaded)	~0.3 seconds

Always use llama-server for interactive use cases.

CLI one-shot (if server not needed)

llama-cli -m ~/.cache/llama-cpp/qwen2.5-1.5b-instruct-q4_k_m.gguf \
  -c 2048 -n 100 --temp 0.7 \
  --no-display-prompt -cnv --single-turn \
  -p "Your prompt here"

The --single-turn flag is critical — without it, llama-cli enters interactive mode and waits for more input forever.

Prevention

When using local LLMs on macOS, prefer llama.cpp (Homebrew) over Ollama until Ollama fixes their Metal compatibility with macOS 26
Design LLM integrations to be backend-agnostic — use the OpenAI-compatible chat completions API format so you can swap between llama-server, Ollama, or cloud APIs without code changes
Always implement graceful fallback when LLM is unavailable

Voice Pilot plan
Voice Pilot smart summary plan
Ollama GitHub issue: Metal bfloat16 compatibility on macOS 26