ARCANADA
All Posts
Blog May 8, 2026

7.7× Cheaper and 4× Faster: Splitting Inference Across Claude, Kimi, and DeepSeek

7.7× Cheaper and 4× Faster: Splitting Inference Across Claude, Kimi, and DeepSeek

7.7× Cheaper and 4× Faster: Splitting Inference Across Claude, Kimi, and DeepSeek

I build Datarim, a framework for AI agent pipelines, and two days ago I added DeepSeek to the stack as the third model alongside Claude and Kimi K2.6. One thinks, one writes, one reads and summarizes.

Yes, I know: after the Anthropic-SpaceX merger raised Claude limits, hitting the weekly cap got harder. But token economics didn't go anywhere. The budget for reading long logs or generating boilerplate shouldn't burn at reasoning-model rates, even with generous limits.

In two days of live use through the Datarim pipeline, DeepSeek has taken over one-off reads and summarizations; Kimi K2.6 stays on long generation tasks. The first measurement on an identical read prompt gave DeepSeek $0.00083 per call and 1549 ms versus $0.00636 and 6217 ms for Kimi K2.6. That's 7.7× cheaper and 4× faster at comparable output quality. Mixing task classes in a single model isn't a matter of taste; it's an arithmetic error.

Task class determines model class

Three task classes differ fundamentally in token economics.

First class: reasoning and orchestration. Tradeoffs between architectural decisions, debugging race conditions, planning with dependencies. This needs a model that holds the decision-making context and produces a compact but precise result. Input token cost is secondary—there aren't many of them—but output quality is critical.

Second class: structured generation. Long artifacts with predictable form: specifications, code, documentation. Input tokens are few; output tokens reach several thousand with high consistency requirements. Output token price dominates the bill.

Third class: bulk I/O. Reading large corpora, summarization, data ingestion. Input is huge, output is compact. Input token cost is the main factor.

The stack in Arcanada agents follows this logic. Claude handles reasoning and orchestration. Kimi K2.6 takes structured generation at a premium output token rate. DeepSeek (deepseek-chat and v4-flash) handles bulk I/O and cost arbitrage with a 1M context window versus ~262K for Kimi K2.6.

FrugalGPT (Stanford, 2023) formally justifies this logic: cascading LLMs by task class reduces cost without quality loss. Rates are on the DeepSeek API Pricing page.

Key consequence: routing decisions are made by task class before the model call, not after an expensive bill.

In Datarim this is implemented through profiles. The code profile (one-off requests, reading, bulk summarization) and social profile (short texts) go to DeepSeek. The write profile (long structured artifacts) stays on Kimi K2.6, where premium output quality justifies the rate. Explicit provider selection always overrides the profile default.

Token budgets with an explicit override hierarchy

A single global limit doesn't work when multiple models run in parallel: one ceiling on everything either throttles bulk reading or burns the generation budget where a short answer would suffice.

The working scheme: separate budgets with an explicit override hierarchy. A CLI flag overrides an environment variable, which overrides a sensible fallback. One flag lifts the limit for a specific experiment without changing default behavior.

Anthropic prompt caching adds a cost-reduction tool for repeated reads: on a repeated request for the same document or system prompt, a cache hit lowers the price. The rate difference on cache is striking: DeepSeek $0.0028 per 1M tokens, Moonshot $0.16 per 1M. The 57× gap holds on cache too.

Default delegation: deny, not ask

Three modes exist for delegating bulk operations; two don't work in practice.

ask: the model requests permission before bulk reading. The flow breaks, and the operator becomes the bottleneck in nightly or autonomous scenarios.

allow: delegation is permitted but optional. In practice the model takes the path of least resistance and calls the reasoning model anyway.

deny with a reason: the only mode with deterministic behavior. The rule blocks the reasoning model call for bulk tasks and returns the refusal reason. The retry routes through the delegate, reinforcing the "this task class → this model" pattern in the working context.

Rules must block, not ask. OpenRouter publishes rates across all models. Price differences between models with the same purpose reach an order of magnitude or more — reason enough to make routing mandatory.

Fail loud, not silent

Reasoning models with thinking mode (Anthropic extended thinking, DeepSeek R1) spend the response budget on the reasoning chain before producing visible output. With a low max_tokens, the model won't error; it returns an empty response. The budget went into the hidden chain; visible output: zero tokens.

Preliminary observation on DeepSeek: on generation tasks with a tight response budget, the model returned 2 output tokens — technically a successful call, practically empty. Fix: use flash variants without thinking mode, or budget 2-3× headroom over expected output.

Real latency on a two-day corpus of live calls differs from a synthetic spike. The test measurement gave 1.5 seconds, but the median on live calls is 49 seconds, and on large outputs (2000-5000 tokens) it reaches 78. DeepSeek's throughput is lower than Kimi K2.6's, where the median on generation tasks is 160 seconds due to larger output volumes. Synthetic prompts give optimistic readings; live observation is what matters.

Any silent failure must fail loud: empty output, truncated response, input exceeding context, missing access key.

Watch the output file, not the PID

Long-running generations (bulk summarization, large write tasks) execute in the background. Tracking by PID is a race: the parent crashes, the child lives, the PID gets reused, the result is false.

Polling the output file is more reliable. A task is complete if and only if the file exists. The calling process can crash and restart without losing progress. Task lifecycle and process lifecycle are decoupled.

Claude Code demonstrates this in long-lived agent sessions: the completion signal via an artifact is more reliable than process state. The file either exists or it doesn't.

Token price as a stack parameter

Price per 1M tokens, context window size, cache rate: these are constructive stack parameters, just like RPS and p99 latency. Ignoring them isn't asceticism or "premature optimization"; it's engineering negligence. Matching each task class to its model isn't an optimization on top of a working system; it's a mandatory architectural requirement.

Bottom line

Measurement on an identical read prompt:

  • Call cost: 7.7× cheaper — $0.00083 on DeepSeek versus $0.00636 on Kimi K2.6.
  • Latency: 4× faster — 1549 ms versus 6217 ms.

Two days of live use through the Datarim pipeline (medians):

  • DeepSeek deepseek-chat — median latency 49 seconds on one-off requests with compact output; call cost is negligible.
  • Moonshot Kimi K2.6 — median latency 160 seconds on long generation tasks with large output.
  • DeepSeek is used when the code/social profile is explicitly selected; the pipeline's default write profile stays on Kimi K2.6.

Reference rates per 1M tokens:

  • DeepSeek v4-flash — $0.14 input / $0.28 output, cache $0.0028 per 1M, 1M context window.
  • Kimi K2.6 — $0.95 input / $4.00 output, cache $0.16 per 1M, ~262K context window.
  • Premium model output tokens are 29× more expensive — sufficient reason not to send bulk reading there. DeepSeek cache is 57× cheaper — the gap holds on any repeated reading scenario.

Routing by task class turns these numbers into standard stack behavior. As profile-based routing shifts one-off reads to DeepSeek, its share by call count rises while its share by cost falls: one-off reads are cheap.