As the diversity of AI components accelerates, orchestration transforms from an optimisation problem into a fundamental system requirement.


Overview
Modern AI systems rarely rely on a single large language model. Instead, they must decide which combination of models and tools should handle each request, often under tight constraints on cost, latency, and reliability. This article examines the orchestration challenge that emerges from this shift, focusing on:
Using empirical results and deployment-oriented insights, we show how coordinated orchestration allows smaller, specialised components to outperform larger monolithic models. The framework highlights practical strategies for balancing performance, efficiency, and generalisation in real-world, multi-domain AI systems.
Introduction: The Orchestration Challenge
The AI landscape has changed in a fundamental way. Production systems no longer rely on a single large language model for every task but must choose the right mix of models and tools for each query.
Different tasks place very different demands on orchestration:
As more specialised components are added, the decision space grows quickly. With five models and four tools, the system already faces 20 possible combinations per query. Add further domain-specific models or tools, and the search space expands rapidly.
Poor routing decisions tend to surface quickly as:
Choosing well allows a system built from smaller, cheaper components to outperform a single large, expensive model.
This creates a high-dimensional optimisation problem at the heart of modern AI deployment. Model choice and tool use can no longer be treated as isolated decisions. Performance depends on how well specific models and tools work together, and those pairings vary widely across domains.
Our work on ATLAS (Adaptive Tool-LLM Alignment and Synergistic Invocation), (Wu, 2026[1]) addresses this challenge through a practical orchestration framework that balances accuracy, efficiency, and generalisation. In the sections that follow, we describe the core problem, outline ATLAS’s dual-path solution, and highlight lessons that practitioners can apply when building multi-domain AI systems.
The Core Problem: Beyond Single-Model Routing
Most current systems treat model choice and tool use as separate decisions. A router might pick which LLM to call, while tool logic sits elsewhere with fixed rules. That split overlooks something important: some models work far better with certain tools, and those pairings change depending on the domain.
Take a chemistry question involving pH calculations. A general-purpose model can stumble on numerical accuracy, while a reasoning-focused model combined with a calculator tool may solve it cleanly. That same setup, though, would be unnecessary for a simple factual query that a lightweight model with retrieval handles just fine.
The key point is simple. Strong performance comes from choosing a model and a tool together, not from optimising each in isolation.
Two Paths - Using What We Know and Adapting to What’s New
ATLAS is built around a trade-off that shows up in every production AI system: leaning on what you already know versus staying flexible when something new appears. That tension led us to a dual-path design.
Path 1: Cluster-Based Routing for Known Domains
If you have historical signals (previous queries, performance outcomes, cost data), you can turn them into practical priors. Our cluster-based method works in three steps:
This path needs no training and runs in constant time. In our tests, it reached 63.5% average accuracy across varied benchmarks when clusters aligned with evaluation domains. For well-scoped enterprise systems (customer support, internal QA, domain-specific assistants) it delivers immediate, reliable gains.
Practical Tip: Start here. If you have query logs, cluster-based routing gets you most of the value with minimal complexity. Simple K-means (we used K8) on sentence embeddings, plus basic accuracy and cost tracking per cluster, is often enough.
Path 2 - RL-Based Routing for Open-Domain Generalisation
Cluster-based routing starts to break down when tasks move outside familiar territory, with accuracy dropping to 49.2% under domain shift. Real production systems cannot assume stable inputs. They have to cope with new task types, changing user behaviour, and question patterns they have never seen before. To handle this, we frame routing as a sequential decision problem and solve it using reinforcement learning.
This RL path maintained 59.4% accuracy on out-of-distribution tasks, outperforming clustering by 10.2% and supervised baselines by 13.1%. When we expanded the available model–tool pool at inference time by adding new medical and maths-focused models, the policy adapted without retraining, lifting accuracy from 59.4% to 61.7%.
Practical Tip: Use RL when you expect domain drift or broad task coverage. The upfront cost matters, but it pays back in generalisation. We trained for 250 steps across three datasets. A small policy model is enough (ours coordinated far larger models without becoming a bottleneck).
Final Thoughts on Orchestration as Infrastructure
The AI systems that matter going forward will not be single models. They will be collections of specialised components, stitched together through orchestration. The logic that decides what runs when becomes just as important as the models and tools themselves.
Our work with ATLAS shows that careful orchestration allows smaller, mixed model–tool setups to match, and often beat, large monolithic systems. For practitioners, the direction is straightforward. Use simple cluster-based routing where domain patterns are stable. Bring in RL-based policies when generalisation matters. Build systems that can absorb new models and tools without being redesigned from scratch.
References
Wu, J., Zhai, G., Jin, R., Yuan, J., Shen, Y., Zhang, S., Wen, Z., & Tao, J. (2026). ATLAS: Orchestrating heterogeneous models and tools for multi-domain complex reasoning., https://arxiv.org/abs/2601.03872