Claude-Code-Local

The only setup that actually works. Run Claude Code with local LLMs on Apple Silicon — real tool execution, real agentic loops, fully offline.

Every tutorial out there tells you to point Claude Code at Ollama or llama.cpp and call it a day. None of them work. The model generates text that looks like a tool call, but nothing executes. No files get created, no commands run, no code gets written. You're watching a convincing hallucination.

This project uses vllm-mlx — the only backend that speaks Claude Code's native language: the Anthropic Messages API with real tool_use content blocks. When the model decides to read a file, it actually reads the file. When it writes code, the code lands on disk. The agentic loop works — tool calls chain into tool results, the model iterates, and you get the real Claude Code experience running entirely on your hardware.

No API key. No cloud. No subscription. No data leaves your machine. Just ./install.sh and go.

What you need

Apple Silicon Mac (M1/M2/M3/M4/M5)
16GB+ unified memory (24GB+ recommended)
Claude Code installed
Homebrew

Quick start

git clone https://github.com/vitorallo/claude-code-local.git
cd claude-code-local
./install.sh
cclocal

First run downloads the default model (~5GB, one-time). Then starts vllm-mlx and launches Claude Code.

Verify it works

In Claude Code, type:

create a file called /tmp/test_tools.txt with "hello world"

Working: Claude Code calls the Write tool, creates the file, confirms. Broken: Claude Code generates text saying it created the file, but nothing exists on disk.

Models

Flag	Model	Size	RAM needed	Notes
(default) `--gemma-light`	Gemma-4-E4B	~5GB	16GB+	Clean tool calling, verified end-to-end
`--gemma`	Gemma-4-26B-A4B MoE	~16GB	24GB+	Google MoE, 3.8B active params
`--review`	GLM-4.7-Flash	~17GB	24GB+	Stronger reasoning
`--coder`	Qwen3-Coder-30B-A3B	~18GB	24GB+	Heavier code model
`--qwen3`	Qwen3.5-9B	~5GB	16GB+	General reasoning — leaks plain-text thinking [1]
`--coder7b`	Qwen2.5-Coder-7B	~5GB	16GB+	Code analysis — tool calls unreliable [2]
`--light`	(alias)			Back-compat alias for `--gemma-light` (v2.0.1 pointed at Qwen3.5-9B)
`--model ID`	Any MLX model	varies	varies	Custom HuggingFace model ID (not tested)

[1] Qwen3.5 is a hybrid-thinking model that ignores enable_thinking=false at the template level and emits plain-text "Thinking Process:" preamble outside <think> tags. Known upstream issue; see vllm-project/vllm#35574 and QwenLM/Qwen3#1625. Use only if you want general reasoning and tolerate verbose output.

[2] Qwen2.5-Coder-7B hallucinates an XML tool-call format (<Write path="..." content="..."/>) that no parser handles. Good for non-agentic code analysis where you feed it whole files, not for Claude Code's tool loop. Use --gemma-light for tool calling work instead.

cclocal                # Interactive menu: pick model, see what's cached, manage cache
cclocal --gemma-light  # Direct launch, Gemma-4-E4B (default, clean tool calling)
cclocal --gemma        # Direct launch, Gemma-4-26B MoE
cclocal --review       # Direct launch, GLM-4.7-Flash
cclocal --coder        # Direct launch, Qwen3-Coder-30B-A3B
cclocal --list         # List cached models on disk
cclocal --rm           # Manage/delete cached models (interactive)
cclocal --server       # Start server only, connect Claude Code separately
cclocal -h             # Show all options

Running cclocal with no arguments opens an interactive menu that shows every supported model, indicates which are already cached on disk, and lets you pick one or jump to a cache management screen. Use the model flags to skip the menu when you already know what you want.

Server-only mode

cclocal --server

Then connect Claude Code from any terminal:

ANTHROPIC_BASE_URL=http://127.0.0.1:8000 \
ANTHROPIC_API_KEY=not-needed \
ANTHROPIC_MODEL=mlx-community/gemma-4-e4b-it-4bit \
claude --strict-mcp-config --mcp-config /path/to/claude-code-local/mcp-local.json \
  --tools "Bash,Read,Edit,Write,Glob,Grep,WebSearch,WebFetch"

Replace /path/to/claude-code-local with wherever you cloned the repo. Or just use cclocal --server which prints the full command for you.

Why this is hard (and how we solved it)

Running Claude Code with a local model isn't just "point it at localhost". There are 15 problems that break the experience. This section documents every one and how run.sh handles it.

1. Ollama can't produce real tool calls

Problem: Ollama's Anthropic API adapter generates text that looks like tool calls but never emits real tool_use content blocks. Claude Code receives plain text, never executes anything. Tested with qwen3.5:9b, qwen3.5:35b-a3b, glm-4.7-flash — all produce fake tool calls.

Solution: Use vllm-mlx. It implements the native Anthropic Messages API with real tool_use / tool_result content blocks.

2. `end_turn` vs `stop` (the loop killer)

Problem: Claude Code needs stop_reason: "end_turn" to know the model finished. Backends returning "stop" (OpenAI convention) cause Claude Code to stop looping after the first response — no tool calls, no iteration.

Solution: vllm-mlx's native /v1/messages endpoint returns correct Anthropic stop reasons.

3. Reasoning/thinking tokens (garbage output)

Problem: Qwen 3.x and Gemma 4 models emit thinking/reasoning tokens. Claude Code doesn't expect these — causes garbage output and misparses tool calls.

Solution: run.sh sets VLLM_MLX_ENABLE_THINKING=false on the server, which passes enable_thinking=False to the chat template. This suppresses thinking tokens at the template level for all models.

4. KV cache invalidation (90% slowdown)

Problem: Claude Code's attribution header changes every request, invalidating the KV cache. Follow-up responses go from 2s to 30s+.

Solution: CLAUDE_CODE_ATTRIBUTION_HEADER=0 disables the header (set by run.sh).

5. Background Haiku model calls (crash)

Problem: Claude Code calls claude-haiku-4-5-20251001 for background tasks. The local server doesn't recognize it — 404 — hang.

Solution: All model tier env vars (ANTHROPIC_DEFAULT_OPUS_MODEL, ANTHROPIC_DEFAULT_SONNET_MODEL, ANTHROPIC_DEFAULT_HAIKU_MODEL, CLAUDE_CODE_SUBAGENT_MODEL) are set to the same local model (set by run.sh).

6. Token counting endpoint (silent failure)

Problem: Claude Code calls /v1/messages/count_tokens. Most local servers don't implement it.

Solution: vllm-mlx supports it. DISABLE_PROMPT_CACHING=1 reduces dependence on it.

7. Concurrent requests OOM

Problem: Claude Code fires concurrent requests (main + background + subagents). Two concurrent 24K+ token prompts exceed the Metal GPU buffer limit on 24GB and crash the server.

Solution: Run in single-request mode (no --continuous-batching). Requests serialize instead of competing for Metal memory. Additionally, --kv-cache-quantization halves KV cache memory usage, giving more headroom before OOM.

8. Streaming format mismatches (partial responses)

Problem: Claude Code expects Anthropic SSE events. OpenAI-format streaming shows only the last token.

Solution: vllm-mlx uses native Anthropic SSE streaming.

9. Tool flooding (259 tools overwhelm local models)

Problem: Claude Code sends ALL tool definitions in every request. With plugins enabled, that's 200+ tools crammed into the system prompt. Even 30B models choke.

Solution: Two flags strip tools down to essentials:

--strict-mcp-config --mcp-config mcp-local.json    # strips all plugin/MCP tools
--tools "Bash,Read,Edit,Write,Glob,Grep,WebSearch,WebFetch"  # 8 built-in tools only

Your plugins remain available when running Claude Code normally with the cloud API.

10. Real API key leaking to local server

Problem: Your real ANTHROPIC_API_KEY (sk-ant-...) is set in the shell. Claude Code detects it and may send it to the local server.

Solution: env -u ANTHROPIC_API_KEY -u ANTHROPIC_AUTH_TOKEN in run.sh explicitly unsets real keys before setting the dummy one.

11. Autoupdater and telemetry (network-dependent startup)

Problem: Claude Code tries to check for updates and send telemetry on startup, which can hang or slow down local-only sessions.

Solution: Session env vars:

DISABLE_AUTOUPDATER=1
DISABLE_TELEMETRY=1
DISABLE_ERROR_REPORTING=1

12. Memory pressure on 24GB

Model	Size	Free RAM	Status
Gemma-4-E4B	~5GB	~19GB	Default — verified tool loop
Qwen3.5-9B	~5GB	~19GB	Works but leaks plain-text thinking
Qwen2.5-Coder-7B	~5GB	~19GB	Code analysis only — tool calls unreliable
Gemma-4-26B-A4B MoE	~16GB	~8GB	Fast inference, tight on 24GB
GLM-4.7-Flash	~16.9GB	~7GB	Works single-request only

13. vllm-mlx critical bug: missing `return` statement (historical)

Problem: Earlier versions of vllm-mlx serve crashed on startup with any model:

TypeError: cannot unpack non-iterable NoneType object

In vllm_mlx/utils/tokenizer.py, the function load_model_with_fallback() was missing a return statement on the success path.

Solution: Fixed upstream and present in our fork. install.sh installs from vitorallo/vllm-mlx@claude-code-local-patches which has the fix on top of a rebased foil-patches-rebased base, plus Gemma 4 channel-token cleanup patches for Claude Code compatibility (asymmetric <|channel>thought...<channel|> handling in both non-streaming and streaming paths).

14. Health endpoint mismatch

Problem: Scripts polling for server readiness grep for "ok" but vllm-mlx returns "status":"healthy".

Solution: run.sh greps for "healthy".

15. Model name `default` not recognized

Problem: Setting ANTHROPIC_MODEL=default causes 404. vllm-mlx requires the full HuggingFace model ID.

Solution: run.sh passes the full model ID (e.g., mlx-community/gemma-4-e4b-it-4bit).

Troubleshooting

Symptom	Cause	Fix
vllm-mlx crashes on startup (TypeError: NoneType)	Using unpatched upstream	`./install.sh` installs from our fork which has the fix
Model generates text about tools but nothing executes	Using Ollama	Switch to vllm-mlx — Ollama can't produce real tool_use blocks
Metal GPU OOM	Model too large for concurrent requests	Use default model (9B) or accept single-request mode
Claude Code asks about "detected custom API key"	Real API key leaking	Use `cclocal` which unsets real keys
"Model does not exist" (404)	Wrong model name	Must use full HuggingFace ID, not "default"
Slow responses (30-60s)	Normal for local inference	Context grows each turn — 24K+ tokens at ~8 tok/s

Configuration reference

Environment variables (set by run.sh per-session)

Variable	Value	Purpose
`ANTHROPIC_BASE_URL`	`http://127.0.0.1:8000`	Point Claude Code at local server
`ANTHROPIC_API_KEY`	`not-needed`	Dummy key (real key explicitly unset)
`ANTHROPIC_MODEL`	Full HuggingFace ID	Model identifier
`ANTHROPIC_DEFAULT_*_MODEL`	Same as above	Route all tiers (Opus/Sonnet/Haiku) locally
`CLAUDE_CODE_SUBAGENT_MODEL`	Same as above	Route subagent calls locally
`CLAUDE_CODE_MAX_OUTPUT_TOKENS`	`16384` (9B) / `4096` (large)	Output limit per model size
`CLAUDE_CODE_ATTRIBUTION_HEADER`	`0`	Prevents KV cache invalidation
`DISABLE_PROMPT_CACHING`	`1`	Local server doesn't support Anthropic caching
`DISABLE_AUTOUPDATER`	`1`	No update checks
`DISABLE_TELEMETRY`	`1`	No telemetry
`DISABLE_ERROR_REPORTING`	`1`	No error reporting
`DISABLE_NON_ESSENTIAL_MODEL_CALLS`	`1`	Reduce background model calls

vllm-mlx server flags (set by run.sh)

Flag	Purpose
`VLLM_MLX_ENABLE_THINKING=false`	Disable thinking/reasoning tokens
`--kv-cache-quantization`	8-bit KV cache — halves cache memory usage
`--cache-memory-percent 0.35`	35% of RAM for cache (~8.4GB on 24GB)
`--prefill-step-size 4096`	Faster time-to-first-token on large prompts
`--stream-interval 4`	Batch 4 tokens before streaming for throughput
`--timeout 600`	10 min timeout (default 300s caused disconnects)

Claude Code flags (set by run.sh)

Flag	Purpose
`--strict-mcp-config`	Ignore global plugins
`--mcp-config mcp-local.json`	Empty config — no plugin tools
`--tools "Bash,Read,..."`	8 essential built-in tools only

File structure

claude-code-local/
  run.sh                    # Launcher — starts vllm-mlx + Claude Code
  install.sh                # Setup — creates .venv, installs vllm-mlx, patches bugs, creates cclocal
  mcp-local.json            # Empty MCP config (strips plugins for local sessions)
  .venv/                    # Local Python venv with vllm-mlx (created by install.sh)
  .gitignore
  README.md

Links

vllm-mlx — Anthropic-compatible MLX inference server
Claude Code — Anthropic's CLI for Claude
Why Claude Code Fails with Local LLMs — Detailed failure analysis
Claude Code tool flooding issue — 259 tools sent to local models
Ollama Anthropic Compatibility — Confirmed broken for tool_use

Citation

This project would not exist without vllm-mlx by Wayner Barrios — the native Apple Silicon MLX backend that makes real Anthropic tool-use blocks possible on local hardware. If you use vLLM-MLX in your research or project, please cite:

@software{vllm_mlx2025,
  author = {Barrios, Wayner},
  title = {vLLM-MLX: Apple Silicon MLX Backend for vLLM},
  year = {2025},
  url = {https://github.com/waybarrios/vllm-mlx},
  note = {Native GPU-accelerated LLM and vision-language model inference on Apple Silicon}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
README.md		README.md
install.sh		install.sh
mcp-local.json		mcp-local.json
run.sh		run.sh

Folders and files

Latest commit

History

Repository files navigation

Claude-Code-Local

What you need

Quick start

Verify it works

Models

Server-only mode

Why this is hard (and how we solved it)

1. Ollama can't produce real tool calls

2. end_turn vs stop (the loop killer)

3. Reasoning/thinking tokens (garbage output)

4. KV cache invalidation (90% slowdown)

5. Background Haiku model calls (crash)

6. Token counting endpoint (silent failure)

7. Concurrent requests OOM

8. Streaming format mismatches (partial responses)

9. Tool flooding (259 tools overwhelm local models)

10. Real API key leaking to local server

11. Autoupdater and telemetry (network-dependent startup)

12. Memory pressure on 24GB

13. vllm-mlx critical bug: missing return statement (historical)

14. Health endpoint mismatch

15. Model name default not recognized

Troubleshooting

Configuration reference

Environment variables (set by run.sh per-session)

vllm-mlx server flags (set by run.sh)

Claude Code flags (set by run.sh)

File structure

Links

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2. `end_turn` vs `stop` (the loop killer)

13. vllm-mlx critical bug: missing `return` statement (historical)

15. Model name `default` not recognized

Packages