HuggingFace Chooses Arch (Router) for Omni Chat

It was thrilling to wake up to this tweet: when one of the largest and most influential AI companies bets on your technology. But what is it? why do they use it? and should you consider this for your AI applications?

I'll skip the basics on what LLMs are. I'm assuming the audience here is already familiar with the core concepts of generative AI. What's less obvious, though, is how the blistering pace of new model releases overwhelms even the savviest users, let alone the dream of a magical system to accurately route queries across this exploding ecosystem.

Why can't we have this magical system: because performance is highly subjective, especially in scenarios where you are trying to align an LLM to your business processes or workflows. Take an example from Digits, a fintech platform automating transaction queries: For routine balance checks, a lightweight model like Claude 4.5 Haiku delivers crisp, factual responses in under 50 words; perfect for high-volume, time-sensitive support tickets where brevity trumps elaboration. But for dispute escalations involving potential fraud, the same model falls flat (for them) on empathy, sounding robotic and detached. Here, routing to a more nuanced model like GPT-5, tuned for warmer, reassuring tone (e.g., "I understand how frustrating this must be—let's walk through the details together"), aligned better with the workflow's need to de-escalate and build trust. Ultimately, the quality and effectiveness of an LLM response lies squarely in the hands of the developer, who must weigh these nuanced trade-offs to deliver responses that truly resonate with their users. Sorry, there is no magic bullet, and there may never be as workflows evolve and get even more interesting in the future.

Until today, most existing LLM routing systems optimize for academic benchmark performance —like MMLU or GPQA— that don’t reflect the messy, subjective, and task-specific judgments users and developers make in real-world applications. In the real world, it's less about benchmark scores and more about things like domain-specific accuracy, speed, and preference fit. That’s why we built Arch-Router, a lightweight (1.5B parameter) routing model that allows you to capture your preferences for model routing decisions

‍

You define intuitive categories like “travel booking” or “image editing,” and Arch-Router routes each query to the model you’ve found to work best—based on your own experience and evaluation. Unlike rigid benchmark-tuned approaches, Arch-Router is transparent, adaptable to new models, and fast—clocking in at just 50ms per routing decision—while outperforming even proprietary LLMs like Claude Sonnet 3.7 and GPT-4o in our evaluations on real conversational data.

‍

What is Arch-Router?

As developers, only you truly know which LLM works best for your use case through countless trial and error.. Benchmarks won’t reflect your real-world experience, specialized tasks, or unique expectations. Preference-aligned routing offers a new approach to LLM routing, focusing on practical, subjective preferences—such as domain expertise (finance, coding, medical) or specific actions (summarization, image generation). The framework closes that gap by letting you encode your own notion of “best.” You supply a routing policy that does two things:

‍

Breaks the query space into policies at the domain level (e.g., finance, medical) and, when needed, the finer-grained action level (e.g., “summarize,” “generate SQL”).
Maps each policy to the exact model you trust for that slice of work.

‍

Arch-Router LLM is a 1.5-billion-parameter model built around this preference-aligned framework. Instead of hard-coding rules or relying on a black box router, you hand Arch-Router your routing policy and it does the rest. Despite its compact size, the model outperforms larger proprietary LLMs from the GPT-4o, Claude, and Gemini families. Moreover, it is blazing fast, delivering end-to-end routing decisions in 50ms (p50), under 75ms (p99) while competing LLMs typically spend roughly 1 s just to pick a route (as shown in Figure 1). The result: state-of-the-art accuracy at a fraction of the latency and deployment cost.

‍How does it work?

Arch-Router introduces two key concepts:

Domain – the high-level thematic category or subject matter of a request (e.g., legal, healthcare, programming).
Action – the specific type of operation the user wants performed (e.g., summarization, code generation, booking appointment, translation).

Both domain and action policies are associated with preferred models or model variants. At inference time, Arch-Router analyzes the incoming prompt to infer its domain and action using semantic similarity, task indicators, and contextual cues. It then applies the user-defined routing preferences to select the model best suited to handle the request as shown in Figure 2.

Performance

Arch-Router is fast and accurate, choosing a model almost instantaneous (50 ms) while scoring higher than the best proprietary LLMs on routing performance.. It aligns with your preferences, different individuals or teams can craft their own routing policies so each query lands on the model they trust most. And it stays flexible and adaptable: see a new model you want to try, or add a task to your product? Simply update the routing policy file and use it—no costly retraining, no pipeline rebuilding. Here are some stats:

‍

Speed: 50ms median routing time (75ms at p99)

Accuracy: 93.06% routing accuracy on provided benchmark

Cost: $0.00132 per routing query

Comparison*: Proprietary routers average 1000ms+ routing time with upto $5 per routing query (GPT-4o)

‍

Ready to dive deeper?

‍This blog post scratches the surface of what and how to use Arch-Router; the full story lives in our open-source stack:

Research paper - Detailed methodology, benchmarks, and ablation studies
Arch-Router collection - Arch-Router-1.5B from Hugging Face with gguf
Arch: A models-native proxy server for agents - move faster by offloading the plumbing work in AI and spend more time modeling business workflows in any language or framework.

‍

Visit our repository for implementation guides, contribute improvements, or report issues. We welcome community contributions to advance LLM-based agents. And hey, if you like what we've build don't forget to ⭐️ the project.