ARCANADA
All Posts
Blog April 7, 2026

6 Local LLMs as Claude Code Backend: Why 4 Out of 6 Fail

6 Local LLMs as Claude Code Backend: Why 4 Out of 6 Fail

Why this matters now. Reliance on large cloud-hosted models is a single point of failure: API quotas change, prices rise, services go down, regions get restricted. If your agent workflow depends entirely on a remote API, you are one step away from a full stop. Testing local models is not a hobby — it is operational resilience.

On top of that, Ollama has just shipped a release specifically optimized for the Apple M4/M5 Neural Engine and unified memory architecture. The performance gap between local and cloud inference is shrinking fast — and we need hard numbers to know exactly where it stands today.

Every model writes working code. Every model passes tests. But when you plug them into Claude Code as an agent backend — 4 out of 6 break. One model diagnoses its own bug but can’t fix it. Another ignores the task entirely and starts configuring your memory system. A third invents tools that don’t exist.

Only 1 out of 6 is actually usable for daily work.

I tested 6 local LLMs via Ollama 0.20.2 on a MacBook Pro M4 Max with 48GB. One task, identical prompt, two modes: direct generation via API and Claude Code CLI agent backend. The results demolish the “bigger model = better agent” myth.

Setup: MacBook Pro M4 Max, 48GB Unified Memory. Ollama v0.20.2, Claude Code CLI v2.1.85. All models Q4_K_M quantization. Settings: num_ctx=32768, temperature=0, max_tokens=8192.

The task: Write a thread-safe LRU cache in Python using OrderedDict and threading.Lock, with type hints, docstrings, and 5 pytest tests. Same prompt for all models and both modes.

The 6 models:

ModelArchitectureTotal / Active ParamsVRAM
gemma4:latestDense8B11.6 GB
gemma4:26bMoE25.2B / 3.8B active21.4 GB
gemma4:31bDense30.7B29.8 GB
qwen3-coder:30bDense30.5B21.9 GB
qwen3.5:35b-a3bMoE35B / 3B active26.9 GB
glm-4.7-flashDense~30B22.5 GB

Raw Ollama: Everyone Wins

ModelDecode tok/sTTFTTotalTestsVRAM
qwen3-coder:30b96.90.27s14s5/521.9 GB
gemma4:latest 8B84.14.5s27s6/611.6 GB
gemma4:26b MoE77.014.4s33s5/521.4 GB
glm-4.7-flash65.00.34s21s5/522.5 GB
qwen3.5:35b-a3b45.00.39s44s10/1026.9 GB
gemma4:31b19.742.6s113s6/629.8 GB

All 6 models produce working code with passing tests in raw mode. The coding ability gap between 8B and 30B is negligible for this task.

Gotcha: qwen3.5 has thinking mode ON by default. Without "think": false in the API call, all tokens go to chain-of-thought reasoning and the response field comes back empty. Two wasted runs before I figured this out.

Claude Code Agent: The Massacre

ModelTimeCodeTestsFailure Mode
gemma4:26b MoE67sOK4/5Minor typos (regex-fixable)
gemma4:31b Dense380sOK7/7None (perfect but slow)
gemma4:latest 8B99sFAIL0/0“I cannot perform web searches”
qwen3-coder:30b193sFAIL0/0Calls write_file instead of Write
qwen3.5:35b-a3b225sFAIL0/0Uses path instead of file_path
glm-4.7-flash669sFAIL0/0Ignores task, configures memory system

The bottleneck is NOT coding ability — it’s protocol compliance. Claude Code expects models to respond with structured tool_use blocks using exact tool names (Write, not write_file) and exact parameter names (file_path, not path).

What Claude Code expects from a backend model:

Claude Code CLI is an orchestrator. It sends the model a massive system prompt (thousands of tokens) describing available tools: Read, Write, Edit, Bash, Grep, Glob, and others. The model must respond not with text, but with a structured JSON tool call:

{
  "role": "assistant",
  "content": [
    {
      "type": "tool_use",
      "name": "Write",
      "input": {
        "file_path": "/tmp/lru_cache.py",
        "content": "import threading\nfrom collections import OrderedDict..."
      }
    }
  ]
}

Key requirements:

  • Tool name must be exact: Write, not write_file, not create_file
  • Parameter names must match: file_path, not path, not filename
  • The model must not try to execute actions itself — only request execution via tool_use
  • After receiving a tool_use result, the model must continue the conversation (multi-turn)

Now — the autopsy of each failure. With actual quotes from the models.


gemma4:latest (8B) — “Too small to understand the room”

“I see that you are attempting to use a tool... I cannot perform web searches as I do not have access to that functionality.”

The 8B model couldn’t hold the massive Claude Code system prompt in context and started hallucinating. It decided it was being asked to do a web search. 82 tokens across 2 turns — and gave up.

Verdict: capacity failure. 8B is not enough to understand the agent protocol.


qwen3-coder:30b — “Right idea, wrong name”

“I apologize for the confusion. It seems there was an error with the tool I tried to use. The write_file tool is not available in the current tool set.”

The model understood it needed to use tools. But it called write_file instead of Write. Most likely qwen3-coder was trained on a different tool use schema (possibly OpenAI function calling) where tools have different names. Claude Code rejected the call, the model tried twice more — and gave up.

Verdict: tool name mismatch. Understands the concept, knows the wrong names.


qwen3.5:35b-a3b — “Self-diagnosis of its own bug”

This is the most remarkable case. The model responded (in Russian!):

“I can see an error from the previous tool call. It appears that an incorrect parameter was passed (path instead of file_path).”

The model understood what went wrong. It called Write (correct name!) with parameter path (wrong name — should be file_path). Claude Code rejected the call, and the model accurately diagnosed the cause. But it couldn’t fix itself — because the error is in its own generation, not in fixable code.

Verdict: parameter name mismatch. Closest to success among all failures. One mapping path → file_path in a proxy server — and it would work.


glm-4.7-flash — “Let me configure your memory instead”

“The memory file was successfully created. The system is now ready to store important information about you, your preferences, and your work context. You can save the following memory types: user, feedback, project, reference...”

The model read Claude Code’s system prompt (which includes a section about auto memory) and decided it was being asked to set up a memory system. Completely ignored the LRU cache task. 4 turns, 4089 tokens, 669 seconds — the most expensive and most useless run.

Verdict: task comprehension failure. Did not understand what was being asked.

The Two Survivors: gemma4:26b and gemma4:31b

Out of six models, only two passed through the Claude Code agent protocol. Both are Gemma 4. Not a coincidence: Google specifically trained Gemma 4 for agentic tasks.

gemma4:26b MoE — the practical choice

  • 67 seconds, 1 turn, 4 out of 5 tests
  • lru_cache.py code — correct, clean, with docstrings and type hints
  • Problem: test file has LRUCCache instead of LRUCache on line 44. And LRLRCache on line 17. Typos in repeating letter sequences — a characteristic artifact of MoE architecture where routing between experts sometimes “stutters” on similar tokens
  • Main code (lru_cache.py) is flawless — errors only in tests
  • 21.4 GB VRAM — comfortable in 48 GB, room left for IDE and browser

gemma4:31b Dense — perfect but slow

  • 380 seconds (6.3 minutes!), 1 turn, 7 out of 7 tests
  • The only model that wrote a thread-safety test without being explicitly asked
  • Perfectly clean code with no artifacts
  • But 380 seconds is not agent work. It’s waiting. In a real scenario where you iterate through 20-30 requests per session, this turns into hours of downtime
  • TTFT of 42.6 seconds means you wait almost a minute before seeing the first character

Key insight: 31b is the better coder. 26b is the better agent. Agent work is latency-sensitive. It doesn’t matter how perfect the code is if you’re waiting 6 minutes per request.

Post-Processing: Regex Instead of Retry

gemma4:26b’s typos are not a death sentence. One regex turns 4/5 into 5/5:

import re

def fix_moe_typos(code: str) -> str:
    """Fix common MoE routing artifacts in class names."""
    # LRUCCache, LRLRCache, LRUUCache → LRUCache
    return re.sub(r'LR[A-Z]*Cache', 'LRUCache', code)

Verified: after applying this regex, all 5 tests pass.

This leads to a broader philosophy: if the model gets 95% right, a deterministic post-processor is cheaper than re-inference. One run of gemma4:26b = 67 seconds and ~1160 output tokens. Regex = 0 seconds and 0 tokens. The choice is obvious.

The benchmark script (quick_bench_v2.py) already has a built-in post-processing pipeline:

  1. Get JSON response from Claude Code CLI
  2. Extract the result field (model’s text response)
  3. Regex to find all ```python``` blocks
  4. Classify: block with class LRU*lru_cache.py, block with def test_test_lru_cache.py
  5. Write files
  6. Run pytest
def extract_code_blocks(text: str) -> tuple[str, str]:
    """Extract lru_cache.py and test_lru_cache.py from markdown."""
    # Handle unclosed code blocks (model hit token limit)
    blocks = re.findall(r"""```(?:python|py)?\s*\n(.*?)(?:```|$)""", text, re.DOTALL)

    lru_code, test_code = "", ""
    for block in blocks:
        block = block.strip()
        if "class LRU" in block or "OrderedDict" in block:
            lru_code = block
        elif "def test_" in block or "import pytest" in block:
            test_code = block
    return lru_code, test_code

Important detail: (?:```|$) in the regex handles unclosed code blocks. If the model hits the token limit and doesn’t close the ```, the regex still extracts the code. Without this, qwen3.5 and gemma4:31b would lose all output on long generations.

Taxonomy of Failures: 6 Levels of Agent Compatibility

Analyzing all 6 runs, I see a clear hierarchy of failures. Each model “falls off” at a specific level — and it’s not a binary “works/doesn’t work” situation:

LevelModelWhat HappensFixable?
1. Task comprehension failureglm-4.7-flashIgnores the task, does its own thingNo
2. Tool hallucinationgemma4:latest (8B)Invents non-existent toolsNo
3. Tool name mismatchqwen3-coder:30bCalls write_file instead of WriteYes, proxy
4. Parameter name mismatchqwen3.5:35b-a3bpath instead of file_pathYes, proxy
5. Output corruptiongemma4:26bTypos LRUCCacheYes, regex
6. Successgemma4:31bEverything correct

Levels 1-2 are fatal. The model doesn’t understand what’s being asked, and no proxy can fix that.

Levels 3-4 are the most interesting. The models understand the concept of tool use but were trained on a different schema. A proxy server between Claude Code and Ollama that remaps write_file → Write and path → file_path could save both qwen3-coder and qwen3.5. That’s a feasible engineering task.

Level 5 is trivially fixed by a post-processor.

Level 6 — everything works, at the cost of speed.

Why Gemma 4 Specifically

Out of 6 models, only two handled the agent protocol — and both are Gemma 4. Not a coincidence.

Google specifically highlighted agentic capabilities when announcing Gemma 4: native tool calling support, training on agentic workflows, optimization for multi-turn interactions. Unlike qwen3-coder (optimized for code completion) or glm-4.7-flash (optimized for fast inference), Gemma 4 was trained on exactly the scenario that Claude Code tests.

Why MoE beats Dense:

gemma4:26b MoE activates only 3.8B parameters out of 25.2B on each forward pass. This gives:

  • Generation speed of 77 tok/s (vs 20 tok/s for 31b Dense with 30.7B active parameters)
  • Enough intelligence to understand the tool use schema
  • 21.4 GB VRAM — fits in 48 GB with room to spare

The paradox: a model with 3.8B active parameters works as an agent, while a model with 30.5B active parameters (qwen3-coder) does not. Because it’s not about the volume of parameters — it’s about what data they were trained on.

gemma4:26b is the Goldilocks model:

  • Fast enough (67 seconds per task) for iterative work
  • Smart enough (understands Claude Code protocol) for agent tasks
  • Compact enough (21.4 GB) for a daily-driver laptop

Practical Recommendations

Profile A: “I want a local Claude Code agent”

Model: gemma4:26b
Launch: ollama launch claude --model gemma4:26b
Post-processing: regex for MoE typos

What to expect: ~67 seconds per task (vs ~5 seconds with real Claude via API). Code works but sometimes needs manual test cleanup. Good for: private development, offline, learning, experiments. Not good for: production speed, complex multi-file refactoring.

Ollama configuration for benchmarks:

export OLLAMA_NUM_GPU=99           # All layers on GPU
export OLLAMA_KEEP_ALIVE=-1        # Don't unload model
export OLLAMA_NUM_PARALLEL=1       # Single request for stability
export OLLAMA_FLASH_ATTENTION=1    # Flash attention if supported

Profile B: “I just want fast code generation”

Model: qwen3-coder:30b (speed) or gemma4:latest 8B (compactness)
Mode: direct Ollama API, no Claude Code
Remember: "think": false for qwen3.5

For automation — Python + requests:

import requests
r = requests.post("http://localhost:11434/api/generate", json={
    "model": "qwen3-coder:30b",
    "prompt": "Write a Python function...",
    "think": False,
    "stream": False,
    "options": {"temperature": 0, "num_predict": 4096}
})
print(r.json()["response"])

Profile C: “I want to build middleware for other models”

The most interesting option. qwen3.5 and qwen3-coder almost work. A proxy between Claude Code and Ollama that:

  1. Intercepts tool_use from the model
  2. Remaps write_file → Write, read_file → Read
  3. Remaps path → file_path, filename → file_path
  4. Forwards the corrected request to Claude Code

This unlocks at least 2 additional models. qwen3.5 with its 10/10 tests in raw mode and 256K context is an especially tempting candidate.

What’s Next

The gap between “can write code” and “can be an agent” will narrow. Next-generation models (Gemma 5, Qwen 4) will almost certainly train on Anthropic’s tool use schema alongside OpenAI’s — the market is too big to ignore.

Ollama launch claude makes connection trivial — the bottleneck is now model quality, not infrastructure.

The middleware approach (remapping tool names and parameter names) is low-hanging fruit that could rescue 2-3 models right now. Worth building.

But the main question is speed. Even the “working” gemma4:26b returns results in 67 seconds. Real Claude via API — in 5. A 13x difference. For a single generation, that’s tolerable. For iterative agent work with 20-30 requests per session, it’s the difference between “productive day” and “day of waiting.”

Local models as an agent backend — a working but niche story. For privacy, offline use, experiments — great. For production productivity — not yet.


Methodology: all data obtained on a real test bench. The benchmark script (quick_bench_v2.py) is published and reproducible. Every claim is verified by run results saved in JSON. Model response quotes are verbatim, from raw_response.txt files.