One morning, the Mac stopped responding entirely. Not slow — frozen: cursor immobile, SSH unreachable, hard reset unavoidable. The cause wasn't a memory leak or thermal throttling. An early RTK plugin version activated command substitution unconditionally across all environments, and on macOS under certain launch conditions the wrappers launched themselves in every new terminal window, creating an infinite cascade. Fifteen minutes lost, command history gone.
This is the chronicle of how Datarim and Coworker reached their current state: 883,563 processed commands, 68.1% adoption-weighted token savings, three runtimes at full parity. Achieved through real failures, three rewrites of the interception mechanism, and one bug that nearly turned git push into an infinite retry loop.
Where tokens go in agentic sessions
A token is roughly three to four characters of text. In agentic sessions, those tokens don't burn on answers — they burn on reading. The input-to-output ratio runs around 25:1: for every token the model spends on a response, it reads 25. Before editing a file, the agent reads it in full. Before committing, it reads git diff. During debugging, it reads logs. Before running a test, it reads the previous output.
Three cost sources are independently addressable. First: file reading — the agent reads the full content of every relevant file even when it needs one section. Second: boilerplate generation — specs, tests, documentation follow predictable patterns yet get generated by an expensive model. Third: shell command output — git log, find, pytest, docker logs return large volumes where 80% is irrelevant to the model's actual task.
Datarim: three runtimes, no hierarchy
Datarim is an iterative workflow framework for AI-assisted development. 18 agents, 55 skills, 24 commands. Full pipeline: init → prd → plan → design → do → qa → compliance → archive. Complexity-aware: lightweight tasks take the short L1 path (three steps), architectural features go through full L4. Reflection is built into the archive command — each cycle generates improvement proposals for the framework itself.
The architectural principle that shaped this solution: Datarim runs on three peer runtimes — Claude Code, Codex CLI, and Cursor. Not "primary plus two secondaries." Three equal execution environments, the same commands, the same outcomes.
Each runtime has genuine strengths. Claude Code handles interactive sessions and fine-grained context management. Codex CLI fits automation and scripting. Cursor integrates into the IDE and accelerates the work-in-editor cycle. Delivering consistent behavior across all three isn't a convenience feature — it's an architectural requirement for predictability.
Coworker: three independent tools for three cost sources
The principle: the expensive model reasons, the cheap one handles bulk reading. Coworker is a vendor-neutral CLI with five providers (Moonshot, DeepSeek, Groq, OpenRouter, OpenAI — all via OpenAI-compatible API), switchable with --provider.
Three independent tools, each targeting a different cost source:
coworker ask delegates file reading to a cheap model. The delegation rule fires above 15,000 est-tokens or three or more files per question. The agent doesn't read the file directly — it sends a specific question to the cheap model. The expensive model receives the answer, not the raw file.
coworker write generates a structural draft from a spec. The cheap model builds the skeleton; the expensive model does only the final judgment pass. The templated parts — PRD structure, plan skeleton, standard documentation sections — don't require Sonnet 4.6 to produce.
The RTK plugin (Rust Token Killer, opt-in) operates at a fundamentally different level: it compresses command output locally on the machine, before that output ever enters the model's context. No API call occurs — a Rust binary filters the output without touching any external service. The three tools are independent. Enable one, or all three.
Current pricing (verified May 30, 2026): Sonnet 4.6 output — $15/M tokens. DeepSeek V4 Flash output — $0.28/M. That's 53.6× cheaper. Flash is the savings anchor for coworker ask and coworker write. DeepSeek V4 Pro is an option for higher-quality generation needs.
Why "compress everything" is a trap
Seeing 90% savings on pytest and git log --stat, the natural instinct is to enable compression for all commands. That logic produced the git push bug.
On short commands, RTK works in reverse. git status on a clean repo: raw output — 59 bytes. After RTK processing: 123 bytes, which is +108%. The compression added metadata and structure that outweigh the original message. git log --oneline -50: 59 bytes → 4,144 bytes, +6,924%. RTK attempts to "nicely structure" the already-compact oneline format and expands it into a multi-line report.
But the problem isn't just size. It's reliability.
When RTK compressed git push output, it removed the confirmation line:
To github.com:Arcanada-one/coworker.git
bd39466..3314623 main -> main
For a human, that's information. For an agent, it's a critical signal: the push succeeded. When RTK classified that line as "noise," the agent received no confirmation. The agent concluded the push failed. The agent retried. The cycle ran approximately 15 minutes before manual intervention.
Second case: git status returning "nothing to commit, working tree clean." That's a signal — the working tree is clean, forward progress is safe. If compression removed or reformatted this message, the agent couldn't confirm repository state — every repo appeared clean, and the agent stalled.
The core distinction: bulk noise is high-volume listings, repeated lines, stack traces beyond three frames, progress bars, ANSI codes. Noise can be compressed. Signal is short but meaning-critical: push confirmed, commit created, test failed at this specific line. Signal must not be touched.
The passthrough list: 13 commands that bypass compression
The solution is a passthrough list (allowlist): commands whose output reaches the model raw and unmodified. Think of it as a dedicated lane in an airport — separate from the standard screening, for passengers whose profile is already known.
Thirteen git/gh commands are on passthrough by default: git push, git commit, git status, git checkout, git merge, git rebase, git pull, git fetch, git branch, git stash, gh pr create, gh pr merge, gh release create. These are precisely the commands whose output carries operation confirmations. RTK compresses only the high-volume noise: pytest, find, docker logs, ruff check.
After adding the passthrough list, the agent received correct signals from git operations. The 15-minute retry loop disappeared.
Three runtimes — three interception points
Same goal, different entry points per environment. From the outside this is invisible — savings activate with one command and behave identically everywhere. Under the hood, each runtime has its own implementation.
To picture it: one building has a security post right at the entrance — any package can be inspected the moment it arrives. Another building has no such post, so the only way to check is to intercept the package quietly at the door. Visitors don't notice a difference. Security works the same way in both buildings; it's just set up differently inside.
Claude Code is the building with the entrance post. It has a built-in mechanism for subscribing to any action before it executes. Coworker registers once in ~/.claude/settings.json and from that point is automatically active in every session, no further configuration needed.
Codex CLI has no such post. Instead, a folder of wrapper binaries is placed at the front of the PATH environment variable, ahead of the real commands. When the agent invokes a command, the system finds the wrapper first. The wrapper makes the compression decision and then passes control to the original binary. The wrappers are active only during an active Codex session; when the session ends, PATH returns to its original state.
Cursor was initially listed as "not supported" — no obvious interception point seemed to exist in the IDE. Deeper examination of the documentation surfaced a built-in shell interceptor. After implementing it, parity became complete: three runtimes, three mechanisms, one result.
Technical mechanism names for reference:
| Runtime | Interception mechanism | Registration point |
|---|---|---|
| Claude Code | PreToolUse hook | ~/.claude/settings.json |
| Cursor | beforeShellExecution | IDE native hook |
| Codex CLI | PATH-shim | Active only within session |
Development chronicle: from macOS lockup to 3/3 parity
Bug 1: macOS freeze. The early version activated command substitution unconditionally across all environments. On Linux, this worked. On macOS under certain launch conditions, the wrappers called themselves in every interactive shell and entered an infinite loop. Fix: the wrappers activate only within an active session of the target tool. The freeze stopped.
Bug 2: "lying status" from git. After RTK was connected, the agent stopped receiving push and commit confirmations — RTK had classified those lines as noise. The passthrough list of 13 commands was added. Git operation signals returned to correct handling.
Stage 3: Cursor. The first implementation for Cursor passed unit tests but not live sessions — Cursor rebuilt PATH after start, leaving the wrapper outside the active path. Cursor was written off as "partial support." Then the built-in shell interceptor was found in the documentation, implemented, and parity became complete.
Stage 4: diagnostics. The status command falsely reported Claude Code as "disconnected" on legacy RTK installations. The probe request logic was corrected — false negatives stopped occurring.
This path ran not on a test bench but on a live machine with six parallel agentic sessions per day. Dogfooding at production load.
Measured results as of May 30, 2026
| Metric | Value |
|---|---|
| Commands processed | 883,563 |
| Input tokens | 9.3M |
| Output tokens | 3.0M |
| Saved | 6.3M tokens — 68.1% (adoption-weighted) |
| Average processing time | 35 ms/command |
Individual commands produce extreme numbers: ps aux — 99.0%, Playwright test — 97.2%, rtk grep — 27.9%. The 99% figure on ps aux is a measurement of one high-volume command — not representative of the full stack. The honest number is the adoption-weighted 68.1%, which accounts for actual command frequency including zero savings on short signal commands.
Delegation cost via coworker ask and coworker write over the same period:
| Provider | Calls | Input tokens | Output tokens | Cost |
|---|---|---|---|---|
| DeepSeek | 1,701 | 21.7M | 3.9M | $4.16 |
| Moonshot | 426 | 4.7M | 2.9M | $15.99 |
DeepSeek V4 Flash output at $0.28/M is 53.6× cheaper than Sonnet 4.6 at $15/M. That gap is what makes coworker ask's economics work: tens of millions of tokens delegated through the cheap provider for single-digit dollar totals, while Sonnet receives only the structured answer.
Current state
Three independent tools, one install, three runtimes. RTK runs locally without API calls — 35 ms/command at 883K+ processed. coworker ask and write delegate bulk through cheap providers. The passthrough list protects signal commands from compression. All three runtimes get consistent behavior through mechanisms matched to their architecture.
One human life matters.