The only setup that actually works. Run Claude Code with local LLMs on Apple Silicon — real tool execution, real agentic loops, fully offline.
Every tutorial out there tells you to point Claude Code at Ollama or llama.cpp and call it a day. None of them work. The model generates text that looks like a tool call, but nothing executes. No files get created, no commands run, no code gets written. You're watching a convincing hallucination.
This project uses vllm-mlx — the only backend that speaks Claude Code's native language: the Anthropic Messages API with real tool_use content blocks. When the model decides to read a file, it actually reads the file. When it writes code, the code lands on disk. The agentic loop works — tool calls chain into tool results, the model iterates, and you get the real Claude Code experience running entirely on your hardware.
No API key. No cloud. No subscription. No data leaves your machine. Just ./install.sh and go.
- Apple Silicon Mac (M1/M2/M3/M4/M5)
- 16GB+ unified memory (24GB+ recommended)
- Claude Code installed
- Homebrew
git clone https://github.com/vitorallo/claude-code-local.git
cd claude-code-local
./install.sh
cclocalFirst run downloads the default model (~5GB, one-time). Then starts vllm-mlx and launches Claude Code.
In Claude Code, type:
create a file called /tmp/test_tools.txt with "hello world"
Working: Claude Code calls the Write tool, creates the file, confirms. Broken: Claude Code generates text saying it created the file, but nothing exists on disk.
| Flag | Model | Size | RAM needed | Notes |
|---|---|---|---|---|
(default) --gemma-light |
Gemma-4-E4B | ~5GB | 16GB+ | Clean tool calling, verified end-to-end |
--gemma |
Gemma-4-26B-A4B MoE | ~16GB | 24GB+ | Google MoE, 3.8B active params |
--review |
GLM-4.7-Flash | ~17GB | 24GB+ | Stronger reasoning |
--coder |
Qwen3-Coder-30B-A3B | ~18GB | 24GB+ | Heavier code model |
--qwen3 |
Qwen3.5-9B | ~5GB | 16GB+ | General reasoning — leaks plain-text thinking [1] |
--coder7b |
Qwen2.5-Coder-7B | ~5GB | 16GB+ | Code analysis — tool calls unreliable [2] |
--light |
(alias) | Back-compat alias for --gemma-light (v2.0.1 pointed at Qwen3.5-9B) |
||
--model ID |
Any MLX model | varies | varies | Custom HuggingFace model ID (not tested) |
[1] Qwen3.5 is a hybrid-thinking model that ignores enable_thinking=false at
the template level and emits plain-text "Thinking Process:" preamble outside
<think> tags. Known upstream issue; see
vllm-project/vllm#35574
and QwenLM/Qwen3#1625. Use
only if you want general reasoning and tolerate verbose output.
[2] Qwen2.5-Coder-7B hallucinates an XML tool-call format
(<Write path="..." content="..."/>) that no parser handles. Good for
non-agentic code analysis where you feed it whole files, not for Claude
Code's tool loop. Use --gemma-light for tool calling work instead.
cclocal # Interactive menu: pick model, see what's cached, manage cache
cclocal --gemma-light # Direct launch, Gemma-4-E4B (default, clean tool calling)
cclocal --gemma # Direct launch, Gemma-4-26B MoE
cclocal --review # Direct launch, GLM-4.7-Flash
cclocal --coder # Direct launch, Qwen3-Coder-30B-A3B
cclocal --list # List cached models on disk
cclocal --rm # Manage/delete cached models (interactive)
cclocal --server # Start server only, connect Claude Code separately
cclocal -h # Show all optionsRunning cclocal with no arguments opens an interactive menu that shows every
supported model, indicates which are already cached on disk, and lets you pick
one or jump to a cache management screen. Use the model flags to skip the menu
when you already know what you want.
cclocal --serverThen connect Claude Code from any terminal:
ANTHROPIC_BASE_URL=http://127.0.0.1:8000 \
ANTHROPIC_API_KEY=not-needed \
ANTHROPIC_MODEL=mlx-community/gemma-4-e4b-it-4bit \
claude --strict-mcp-config --mcp-config /path/to/claude-code-local/mcp-local.json \
--tools "Bash,Read,Edit,Write,Glob,Grep,WebSearch,WebFetch"Replace
/path/to/claude-code-localwith wherever you cloned the repo. Or just usecclocal --serverwhich prints the full command for you.
Running Claude Code with a local model isn't just "point it at localhost". There are 15 problems that break the experience. This section documents every one and how run.sh handles it.
Problem: Ollama's Anthropic API adapter generates text that looks like tool calls but never emits real tool_use content blocks. Claude Code receives plain text, never executes anything. Tested with qwen3.5:9b, qwen3.5:35b-a3b, glm-4.7-flash — all produce fake tool calls.
Solution: Use vllm-mlx. It implements the native Anthropic Messages API with real tool_use / tool_result content blocks.
Problem: Claude Code needs stop_reason: "end_turn" to know the model finished. Backends returning "stop" (OpenAI convention) cause Claude Code to stop looping after the first response — no tool calls, no iteration.
Solution: vllm-mlx's native /v1/messages endpoint returns correct Anthropic stop reasons.
Problem: Qwen 3.x and Gemma 4 models emit thinking/reasoning tokens. Claude Code doesn't expect these — causes garbage output and misparses tool calls.
Solution: run.sh sets VLLM_MLX_ENABLE_THINKING=false on the server, which passes enable_thinking=False to the chat template. This suppresses thinking tokens at the template level for all models.
Problem: Claude Code's attribution header changes every request, invalidating the KV cache. Follow-up responses go from 2s to 30s+.
Solution: CLAUDE_CODE_ATTRIBUTION_HEADER=0 disables the header (set by run.sh).
Problem: Claude Code calls claude-haiku-4-5-20251001 for background tasks. The local server doesn't recognize it — 404 — hang.
Solution: All model tier env vars (ANTHROPIC_DEFAULT_OPUS_MODEL, ANTHROPIC_DEFAULT_SONNET_MODEL, ANTHROPIC_DEFAULT_HAIKU_MODEL, CLAUDE_CODE_SUBAGENT_MODEL) are set to the same local model (set by run.sh).
Problem: Claude Code calls /v1/messages/count_tokens. Most local servers don't implement it.
Solution: vllm-mlx supports it. DISABLE_PROMPT_CACHING=1 reduces dependence on it.
Problem: Claude Code fires concurrent requests (main + background + subagents). Two concurrent 24K+ token prompts exceed the Metal GPU buffer limit on 24GB and crash the server.
Solution: Run in single-request mode (no --continuous-batching). Requests serialize instead of competing for Metal memory. Additionally, --kv-cache-quantization halves KV cache memory usage, giving more headroom before OOM.
Problem: Claude Code expects Anthropic SSE events. OpenAI-format streaming shows only the last token.
Solution: vllm-mlx uses native Anthropic SSE streaming.
Problem: Claude Code sends ALL tool definitions in every request. With plugins enabled, that's 200+ tools crammed into the system prompt. Even 30B models choke.
Solution: Two flags strip tools down to essentials:
--strict-mcp-config --mcp-config mcp-local.json # strips all plugin/MCP tools
--tools "Bash,Read,Edit,Write,Glob,Grep,WebSearch,WebFetch" # 8 built-in tools only
Your plugins remain available when running Claude Code normally with the cloud API.
Problem: Your real ANTHROPIC_API_KEY (sk-ant-...) is set in the shell. Claude Code detects it and may send it to the local server.
Solution: env -u ANTHROPIC_API_KEY -u ANTHROPIC_AUTH_TOKEN in run.sh explicitly unsets real keys before setting the dummy one.
Problem: Claude Code tries to check for updates and send telemetry on startup, which can hang or slow down local-only sessions.
Solution: Session env vars:
DISABLE_AUTOUPDATER=1
DISABLE_TELEMETRY=1
DISABLE_ERROR_REPORTING=1
| Model | Size | Free RAM | Status |
|---|---|---|---|
| Gemma-4-E4B | ~5GB | ~19GB | Default — verified tool loop |
| Qwen3.5-9B | ~5GB | ~19GB | Works but leaks plain-text thinking |
| Qwen2.5-Coder-7B | ~5GB | ~19GB | Code analysis only — tool calls unreliable |
| Gemma-4-26B-A4B MoE | ~16GB | ~8GB | Fast inference, tight on 24GB |
| GLM-4.7-Flash | ~16.9GB | ~7GB | Works single-request only |
Problem: Earlier versions of vllm-mlx serve crashed on startup with any model:
TypeError: cannot unpack non-iterable NoneType object
In vllm_mlx/utils/tokenizer.py, the function load_model_with_fallback() was missing a return statement on the success path.
Solution: Fixed upstream and present in our fork. install.sh installs from vitorallo/vllm-mlx@claude-code-local-patches which has the fix on top of a rebased foil-patches-rebased base, plus Gemma 4 channel-token cleanup patches for Claude Code compatibility (asymmetric <|channel>thought...<channel|> handling in both non-streaming and streaming paths).
Problem: Scripts polling for server readiness grep for "ok" but vllm-mlx returns "status":"healthy".
Solution: run.sh greps for "healthy".
Problem: Setting ANTHROPIC_MODEL=default causes 404. vllm-mlx requires the full HuggingFace model ID.
Solution: run.sh passes the full model ID (e.g., mlx-community/gemma-4-e4b-it-4bit).
| Symptom | Cause | Fix |
|---|---|---|
| vllm-mlx crashes on startup (TypeError: NoneType) | Using unpatched upstream | ./install.sh installs from our fork which has the fix |
| Model generates text about tools but nothing executes | Using Ollama | Switch to vllm-mlx — Ollama can't produce real tool_use blocks |
| Metal GPU OOM | Model too large for concurrent requests | Use default model (9B) or accept single-request mode |
| Claude Code asks about "detected custom API key" | Real API key leaking | Use cclocal which unsets real keys |
| "Model does not exist" (404) | Wrong model name | Must use full HuggingFace ID, not "default" |
| Slow responses (30-60s) | Normal for local inference | Context grows each turn — 24K+ tokens at ~8 tok/s |
| Variable | Value | Purpose |
|---|---|---|
ANTHROPIC_BASE_URL |
http://127.0.0.1:8000 |
Point Claude Code at local server |
ANTHROPIC_API_KEY |
not-needed |
Dummy key (real key explicitly unset) |
ANTHROPIC_MODEL |
Full HuggingFace ID | Model identifier |
ANTHROPIC_DEFAULT_*_MODEL |
Same as above | Route all tiers (Opus/Sonnet/Haiku) locally |
CLAUDE_CODE_SUBAGENT_MODEL |
Same as above | Route subagent calls locally |
CLAUDE_CODE_MAX_OUTPUT_TOKENS |
16384 (9B) / 4096 (large) |
Output limit per model size |
CLAUDE_CODE_ATTRIBUTION_HEADER |
0 |
Prevents KV cache invalidation |
DISABLE_PROMPT_CACHING |
1 |
Local server doesn't support Anthropic caching |
DISABLE_AUTOUPDATER |
1 |
No update checks |
DISABLE_TELEMETRY |
1 |
No telemetry |
DISABLE_ERROR_REPORTING |
1 |
No error reporting |
DISABLE_NON_ESSENTIAL_MODEL_CALLS |
1 |
Reduce background model calls |
| Flag | Purpose |
|---|---|
VLLM_MLX_ENABLE_THINKING=false |
Disable thinking/reasoning tokens |
--kv-cache-quantization |
8-bit KV cache — halves cache memory usage |
--cache-memory-percent 0.35 |
35% of RAM for cache (~8.4GB on 24GB) |
--prefill-step-size 4096 |
Faster time-to-first-token on large prompts |
--stream-interval 4 |
Batch 4 tokens before streaming for throughput |
--timeout 600 |
10 min timeout (default 300s caused disconnects) |
| Flag | Purpose |
|---|---|
--strict-mcp-config |
Ignore global plugins |
--mcp-config mcp-local.json |
Empty config — no plugin tools |
--tools "Bash,Read,..." |
8 essential built-in tools only |
claude-code-local/
run.sh # Launcher — starts vllm-mlx + Claude Code
install.sh # Setup — creates .venv, installs vllm-mlx, patches bugs, creates cclocal
mcp-local.json # Empty MCP config (strips plugins for local sessions)
.venv/ # Local Python venv with vllm-mlx (created by install.sh)
.gitignore
README.md
- vllm-mlx — Anthropic-compatible MLX inference server
- Claude Code — Anthropic's CLI for Claude
- Why Claude Code Fails with Local LLMs — Detailed failure analysis
- Claude Code tool flooding issue — 259 tools sent to local models
- Ollama Anthropic Compatibility — Confirmed broken for tool_use
This project would not exist without vllm-mlx by Wayner Barrios — the native Apple Silicon MLX backend that makes real Anthropic tool-use blocks possible on local hardware. If you use vLLM-MLX in your research or project, please cite:
@software{vllm_mlx2025,
author = {Barrios, Wayner},
title = {vLLM-MLX: Apple Silicon MLX Backend for vLLM},
year = {2025},
url = {https://github.com/waybarrios/vllm-mlx},
note = {Native GPU-accelerated LLM and vision-language model inference on Apple Silicon}
}