# Right Tool, Right Task: How We Cut Claude AI Costs by 70% Without Losing Quality

> Running Opus for everything is like hiring a senior partner to update a spreadsheet. Here's how intelligent model routing cut our AI costs by 70% in 3 months.

**Author:** david | **Date:** 2026-04-28 | **Category:** strategy

---

I've been running Claude Code as my chief of staff since January. Not just for work. Meeting follow-ups, CRM updates, blog posts, team coordination, code generation, research, daily planning, health tracking for my parents. It functions as a full-stack executive admin across everything I do.

By mid-March, I had a problem. I was burning through my Opus quota in 3 days and spending the rest of the week on a slower model. Opus is Anthropic's most capable (and most expensive) model, and I was using it for everything, including tasks that didn't need it.

## The Numbers Before

From January 6 to March 14, my setup was simple: Claude Code with Opus doing all the work. Every search, every draft, every file operation, every piece of thinking.

The daily average was 149,598 Opus output tokens. Opus accounted for 94.8% of all my AI output. Haiku (the cheaper model) was 5.2%, mostly from automatic background tasks I hadn't configured intentionally.

It worked. The quality was excellent. I was also paying for a Ferrari to drive to the corner store.

## What Changed

The fix came in two phases.

**Phase 1 (March 15): Haiku subagents.** I started routing exploration and search tasks to Haiku, Anthropic's smallest and cheapest model. When Claude Code needs to search my vault for meeting notes, scan a codebase, or look up a contact in my CRM, it doesn't need Opus-level reasoning for that. It needs to find information and bring it back.

I built a delegation table into my system configuration: research tasks, read-only queries, vault filing, bulk metadata updates all routed to Haiku. Within two weeks, Opus dropped to 24.1% of output.

**Phase 2 (March 31): Local LLM and Codex.** I added two more tiers. Qwen 3.5 (a 9-billion parameter model running locally on my MacBook via MLX) handles text generation: first drafts, structured output, frontmatter generation, template expansion. It costs nothing because it runs on my hardware. Codex (OpenAI's coding agent) handles mechanical coding tasks: fixing test failures, lint errors, bulk find-and-replace across files. It runs on an existing OpenAI subscription, no extra token cost per task.

Both phases together brought Opus down to 4.7% of output: 45,000 daily tokens, down from 149,598. A **70% reduction**, with no perceptible drop in quality.

## The Delegation Framework

Different tasks have different cognitive requirements, and you match the model to the requirement.

| Task Type | Route | Cost |
|-----------|-------|------|
| Research, vault search, codebase exploration | Haiku subagents | Low |
| Text generation, drafts, frontmatter, templates | Qwen 3.5-9B (local, runs on laptop) | Free |
| Mechanical coding: test fixes, lint, bulk edits | Codex (OpenAI) | Subscription |
| Read-only data queries (email, calendar, CRM) | Haiku subagents | Low |
| Domain operations, judgment calls, strategy | Sonnet or Opus | Mid/High |
| Content, creative writing, architecture decisions | Opus | High |

The expensive models only get involved when the task actually requires their reasoning ability. Strategy work, client-facing content, architecture decisions, anything requiring judgment or nuance. Everything else gets routed down.

## What Makes This Work

Three things matter more than the specific models.

**Explicit routing rules.** I maintain a delegation table that maps task patterns to models. When the system encounters a task, it checks the table before executing. Without this, the default behavior is to use whatever model is loaded, which is always the most expensive one.

**Cost awareness as a design constraint.** Most people treat AI model selection like picking a restaurant: go to the best one you can afford. The better frame is staffing. You don't assign a senior partner to update a spreadsheet. You don't assign an intern to negotiate a contract. Same logic applies to AI models.

**Local models for commodity work.** Running Qwen locally was the single biggest cost reduction. Text generation (drafts, structured output, filling in templates) is high-volume, low-complexity work. A 9B parameter model running on a laptop handles it fine. The quality is good enough for a first pass, and I review everything before it goes out.

## What Lower Opus Usage Actually Means

The real win isn't the token savings. It's what the budget now covers.

Before the optimization, I was burning through my weekly Opus quota by Wednesday. That meant three days of constrained output: slower work, manual fallbacks, tasks deferred. Token budget was the actual ceiling, not workload.

Routing changed the math. Heavy sessions still hit limits occasionally, but the weekly quota now runs the full week in normal usage. I've added automations that would have been cost-prohibitive before: background agents for CRM updates, meeting summaries, and daily briefings that run whether I'm in a session or not.

When Opus was handling everything, it was spending most of its time on retrieval and formatting: tasks that don't need it. Now it's almost entirely on strategy, client-facing content, and decisions that actually require judgment. The quality of that work is better because the model isn't context-switching between searching a CRM and writing a proposal.

## What I'd Tell a Team Doing This

If you're running AI tools for a team or a practice and you're hitting cost walls or quota limits, you probably don't need a bigger plan. You need better routing.

Start by auditing what your AI actually does in a week. Categorize every task by complexity. My bet is that 60–70% of your AI usage is retrieval, reformatting, or mechanical work that doesn't need your most capable model.

Then set up the tiers. You don't need my exact stack. The principle is the same whether you're using Claude, GPT, Gemini, or a mix. Put the expensive model where it earns its keep. Route everything else to something cheaper or free.

The 70% reduction isn't a one-time optimization. It compounds. Lower cost per task means you can run more tasks, automate more workflows, and build systems that would have been prohibitively expensive before.

We're not a tech company. We built this incrementally over 3 months while doing client work. If your business uses AI tools daily, you can do the same.

---

**Source:** https://teamsatori.asia/blog/right-tool-right-task/