The Haiku-Opus Strategy: Token-Efficient Model Routing
Why routing AI tasks by complexity cuts inference costs by up to 85% without sacrificing quality, and how your business can stop overpaying for intelligence it doesn’t need.
Introduction: The Default That Costs Too Much
When teams first deploy AI in production, they tend to reach for the most powerful model available. The logic seems airtight: use the best tool, get the best results, ship confidently. For many organizations in 2023 and 2024, that meant routing virtually every workload through GPT-4, Claude 3 Opus, or their nearest equivalent, regardless of whether the task actually required frontier-level reasoning.
This instinct is understandable. Early AI projects often justify themselves on quality alone, not cost. When the primary goal is to prove that AI can do the job at all, the price per query feels secondary. As usage scales from prototypes to production and from dozens of queries to millions, the cost of that default becomes impossible to ignore.
Anthropic designed its Claude model family from the start with explicit tiers: Opus for the highest-complexity tasks, Sonnet for balanced workloads, and Haiku for high-volume, latency-sensitive operations where near-instant responsiveness matters more than maximum reasoning depth. At launch in March 2024, the pricing gap between those tiers was striking. Claude 3 Haiku entered the market at $0.25 per million input tokens; Claude 3 Opus was $15 per million. [1] That is a 60-fold difference in per-token cost for models designed to coexist within the same application.
“Each successive model offers increasingly powerful performance, allowing users to select the optimal balance of intelligence, speed, and cost for their specific application.”
— Anthropic, Claude 3 family announcement (March 2024) [1]
Most teams treated that design as a menu and ordered the most expensive item every time. The consequences did not show up immediately. They showed up at the billing review.
The Research That Documented the Waste
The case for tiered model routing was not theoretical. By 2023, independent academic research had already demonstrated that intelligent model selection could produce dramatic cost reductions without meaningful quality trade-offs, and the evidence continued to strengthen through 2024.
Stanford University researchers Lingjiao Chen, Matei Zaharia, and James Zou published FrugalGPT in May 2023, introducing a cascade approach: instead of routing every query to a single frontier model, the system queries cheaper models first and escalates only when the response confidence falls below a threshold. The results were stark. FrugalGPT could match GPT-4’s performance with up to 98% cost reduction, or improve accuracy over GPT-4 by 4% at the same cost. [4] The core finding was that a large share of real-world queries simply do not require the reasoning depth of the most capable model.
A year later, researchers at UC Berkeley and Anyscale published RouteLLM, an open-source routing framework trained on human preference data from 80,000 Chatbot Arena conversations. [2] The results on standard benchmarks were decisive. On MT Bench, RouteLLM achieved cost reductions of over 85% compared to routing every query through GPT-4, while maintaining 95% of GPT-4’s performance quality. [2][3] Additional savings of 45% and 35% were demonstrated on MMLU and GSM8K respectively. The research team also noted that, at the time of publication, routing queries to smaller models could save costs by more than 50x when comparing Claude 3 Haiku to Claude 3 Opus. [2]
The RouteLLM team’s matrix factorization router was able to achieve 95% of GPT-4’s quality using only 26% of GPT-4 calls, meaning 74% of queries could be handled by the cheaper model without measurable degradation. [2] In its most efficient configuration, the router directed 86% of queries to the smaller model while maintaining near-frontier quality across the full query set.
What Teams Discovered When They Looked at Their Bills
The academic findings translated directly to production economics. Teams running enterprise AI workloads began auditing their usage and discovering a consistent pattern: the vast majority of their queries were simple enough for a smaller model, but routing had never been set up to take advantage of that. FAQ answering, form parsing, classification, basic summarization, and language translation were all being sent to frontier models that were priced for sustained multi-step reasoning. The per-query cost for those tasks was 5 to 60 times higher than necessary. [1][6]
The pattern was not unusual. It reflected how most organizations approach AI deployment: start with the best model available, optimize later. The problem is that “later” rarely arrives before the budget conversation does. By the time engineering teams are asked to reduce AI inference costs, those costs are already embedded in production systems that were never designed with routing in mind.
Signal from real-world deployments confirmed the research. Gamma, a presentation-software company and Anthropic API customer, found that Claude Haiku 4.5 outperformed their existing premium-tier model on instruction-following for slide text generation, achieving 65% accuracy against 44% from the more expensive option. [5] The result pointed directly at the over-provisioning problem: not only was the cheaper model sufficient for the task, it was measurably better.
“Claude Haiku 4.5 outperformed our current models on instruction-following for slide text generation, achieving 65% accuracy versus 44% from our premium tier model—that’s a game-changer for our unit economics.”
— Jon Noronha, Co-Founder, Gamma [5]
The discovery has a name: over-provisioning. Organizations that provision more computational power than their workloads require pay for unused capacity. In AI inference, over-provisioning means routing queries that use 10% of a frontier model’s capability through 100% of its pricing.
The Strategic Pivot: The Haiku-Opus Framework
The Haiku-Opus Strategy is not about using cheaper models universally. It is about matching model capability to task complexity at the routing layer, before the query reaches a model at all. Organizations implementing this framework typically work through three operational steps:
- Task complexity mapping. The first step is a structured audit of every AI-powered workflow in the application: what the task is, what inputs it receives, what quality bar its outputs must meet, and how often it runs. Classification tasks, template-based generation, and high-frequency retrieval operations are strong candidates for the Haiku tier. Multi-document synthesis, nuanced decision support, and context-heavy reasoning belong to Sonnet or Opus.
- Tiered routing architecture. Once tasks are classified, routing logic is implemented at the application layer. Each incoming query is inspected for complexity signals, including length, structure, content type, and prior context, then directed to the appropriate model tier. This can be built as simple rule-based branching or as a trained router in the pattern of RouteLLM. The architectural principle is consistent: only escalate to a more capable model when the task genuinely requires it.
- Quality monitoring and calibration. Routing thresholds need ongoing adjustment. Initial classification will be imperfect; some queries landing in the Haiku tier will require escalation, and some Opus-level routing will prove unnecessary at scale. A quality monitoring layer, tracking output sampling, human review flags, and downstream task success rates, enables continuous calibration of routing decisions without disrupting the system.
At current API pricing, Claude Haiku 4.5 is priced at $1 per million input tokens and $5 per million output tokens. Claude Opus 4.7 is priced at $5 per million input tokens and $25 per million output tokens. [6] That five-fold difference in base pricing compounds with volume. A system running ten million input tokens per day against Haiku instead of Opus saves $40,000 per day at current rates. With prompt caching enabled on Haiku, additional savings of up to 90% are available on repeated prompt structures. [5]
The speed advantage reinforces the cost advantage in practice. Matthew Isabel, Distinguished Product Manager at GitHub, described the result of integrating Haiku 4.5 into GitHub Copilot: “Our early testing shows that Claude Haiku 4.5 brings efficient code generation to GitHub Copilot with comparable quality to Sonnet 4 but at faster speed. Already we’re seeing it as an excellent choice for Copilot users who value speed and responsiveness in their AI-powered development workflows.” [5]
Key Lessons for Your Business
The Haiku-Opus strategy is available to any team running AI workloads through an API. Three patterns apply directly to SMBs considering or currently managing model costs.
Default-to-Frontier Is a Budget Pattern, Not a Quality Strategy
Routing all queries through the most capable available model is an organizational habit, not a deliberate choice. Auditing your current AI workloads and classifying them by required reasoning depth is the first optimization available to any team looking to bring inference costs under control.
Your Task Mix Is the Routing Map
The correct tier for any model call is determined by task complexity, not application category. A legal-tech product may route document summarization to Sonnet and clause extraction to Haiku. A content platform may route topic classification to Haiku and long-form generation to Sonnet. The same application often contains tasks at every tier.
Speed and Cost Scale Together
Claude Haiku 4.5 runs at four to five times the speed of Sonnet 4.5. [5] For high-volume, latency-sensitive workflows such as real-time customer interactions, code completion, and live data monitoring, this speed advantage compounds the cost advantage and directly improves user experience metrics that frontier models cannot optimize for at scale.
Conclusion: Allocation Over Uniformity
The Haiku-Opus Strategy reflects a broader maturation in how organizations deploy AI. The first generation of enterprise AI implementations optimized for capability: find the best model, integrate it, ship. The second generation optimizes for allocation: match the right model to the right task at the right time.
The evidence from academic research is not marginal. FrugalGPT’s 98% cost reduction ceiling [4] and RouteLLM’s 85% savings on MT Bench [2] demonstrate that the gap between frontier-only and intelligently routed deployments is structural, not incidental. Teams that do not implement routing are not paying a small premium for convenience. They are systematically overpaying for reasoning capability that most of their workload will never use.
The good news for SMBs is that this optimization does not require building a research-grade routing system from scratch. The architecture is well-documented, the tooling is open source, and the model pricing already reflects the tiers. The work is in the audit: understanding your own task mix well enough to make deliberate choices about where each query belongs.
“The most capable model is not always the right model. Matching intelligence to need is what separates a sustainable AI operation from an expensive proof of concept.”
Sources & References
- Anthropic. “Introducing the next generation of Claude.” March 4, 2024.
- Ong, Isaac et al. “RouteLLM: Learning to Route LLMs with Preference Data.” UC Berkeley / Anyscale. June 2024.
- LMSYS. “RouteLLM: An Open-Source Framework for Cost-Effective LLM Routing.” July 1, 2024.
- Chen, Lingjiao, Matei Zaharia, and James Zou. “FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance.” Stanford University. May 2023.
- Anthropic. “Claude Haiku 4.5.” Anthropic.com, 2025–2026.
- Anthropic. “Claude API Models Overview.” Anthropic Platform Documentation, 2026.
- Anthropic. “Claude 3 Haiku: our fastest model yet.” March 13, 2024.
- Anthropic. “Claude Haiku 4.5.” October 15, 2025.
- LMSYS. RouteLLM open-source repository. GitHub, 2024.
Stop Paying Frontier Prices
for Routine Work.
We help SMBs audit their AI workloads, design tiered routing architectures, and build inference cost structures that scale without ballooning budgets, from initial audit through production deployment.
Schedule Your Free AI Readiness Assessment →