The Missing Layer in Enterprise AI: Model - Tool Orchestration

As the diversity of AI components accelerates, orchestration transforms from an optimisation problem into a fundamental system requirement.

Jingang Wu

5 min read

13 January 2026

Agents

Agent Architecture

Agents

13 January 2026

The Missing Layer in Enterprise AI: Model - Tool Orchestration

Overview

Modern AI systems rarely rely on a single large language model. Instead, they must decide which combination of models and tools should handle each request, often under tight constraints on cost, latency, and reliability. This article examines the orchestration challenge that emerges from this shift, focusing on:

Why model and tool selection has become a high-dimensional optimisation problem
How poor routing choices increase cost, latency, and failure rates
Why treating model choice and tool use as separate decisions limits performance
How ATLAS (Adaptive Tool-LLM Alignment and Synergistic Invocation) jointly optimises model–tool combinations in practice

Using empirical results and deployment-oriented insights, we show how coordinated orchestration allows smaller, specialised components to outperform larger monolithic models. The framework highlights practical strategies for balancing performance, efficiency, and generalisation in real-world, multi-domain AI systems.

Introduction: The Orchestration Challenge

The AI landscape has changed in a fundamental way. Production systems no longer rely on a single large language model for every task but must choose the right mix of models and tools for each query.

Different tasks place very different demands on orchestration:

Code generation may favour a specialised coding model paired with execution sandboxes
Mathematical reasoning often benefits from a general-purpose LLM augmented with symbolic computation tools
Multimodal queries may require vision–language models coordinated with OCR and visual reasoning modules

As more specialised components are added, the decision space grows quickly. With five models and four tools, the system already faces 20 possible combinations per query. Add further domain-specific models or tools, and the search space expands rapidly.

Poor routing decisions tend to surface quickly as:

Wasted compute
Increased latency
Outright task failures

Choosing well allows a system built from smaller, cheaper components to outperform a single large, expensive model.

This creates a high-dimensional optimisation problem at the heart of modern AI deployment. Model choice and tool use can no longer be treated as isolated decisions. Performance depends on how well specific models and tools work together, and those pairings vary widely across domains.

Our work on ATLAS (Adaptive Tool-LLM Alignment and Synergistic Invocation), (Wu, 2026^[1]) addresses this challenge through a practical orchestration framework that balances accuracy, efficiency, and generalisation. In the sections that follow, we describe the core problem, outline ATLAS’s dual-path solution, and highlight lessons that practitioners can apply when building multi-domain AI systems.

The Core Problem: Beyond Single-Model Routing

Most current systems treat model choice and tool use as separate decisions. A router might pick which LLM to call, while tool logic sits elsewhere with fixed rules. That split overlooks something important: some models work far better with certain tools, and those pairings change depending on the domain.

Take a chemistry question involving pH calculations. A general-purpose model can stumble on numerical accuracy, while a reasoning-focused model combined with a calculator tool may solve it cleanly. That same setup, though, would be unnecessary for a simple factual query that a lightweight model with retrieval handles just fine.

The key point is simple. Strong performance comes from choosing a model and a tool together, not from optimising each in isolation.

Two Paths - Using What We Know and Adapting to What’s New

ATLAS is built around a trade-off that shows up in every production AI system: leaning on what you already know versus staying flexible when something new appears. That tension led us to a dual-path design.

Path 1: Cluster-Based Routing for Known Domains

If you have historical signals (previous queries, performance outcomes, cost data), you can turn them into practical priors. Our cluster-based method works in three steps:

Semantic Clustering: Past queries are embedded into a shared space and grouped by meaning. Clear domain structure emerges on its own: maths questions cluster together, coding tasks form separate groups, and factual queries sit elsewhere.
Performance Profiling: For each cluster, we track which model–tool combinations worked, along with their operational costs, including tokens, latency, and API pricing.
Utility-Based Selection: At inference time, a new query is projected into this space, matched to the nearest cluster, and routed to the combination that best balances accuracy and cost.

This path needs no training and runs in constant time. In our tests, it reached 63.5% average accuracy across varied benchmarks when clusters aligned with evaluation domains. For well-scoped enterprise systems (customer support, internal QA, domain-specific assistants) it delivers immediate, reliable gains.

Practical Tip: Start here. If you have query logs, cluster-based routing gets you most of the value with minimal complexity. Simple K-means (we used K8) on sentence embeddings, plus basic accuracy and cost tracking per cluster, is often enough.

Path 2 - RL-Based Routing for Open-Domain Generalisation

Cluster-based routing starts to break down when tasks move outside familiar territory, with accuracy dropping to 49.2% under domain shift. Real production systems cannot assume stable inputs. They have to cope with new task types, changing user behaviour, and question patterns they have never seen before. To handle this, we frame routing as a sequential decision problem and solve it using reinforcement learning.

Multi-step interaction: The router behaves like an agent. It can switch between internal reasoning (“think”) and external actions (“route to model X with tool Y”). This supports multi-turn refinement: call a tool, check the result, then decide whether to invoke another combination.
Composite reward signal: The reward function has three parts: – Format reward, which penalises invalid syntax or malformed tool calls – Outcome reward, which directly reflects task correctness – Selection reward, which biases the policy towards efficient, domain-appropriate choices
PPO optimisation: A small policy model (3B parameters) is trained using Proximal Policy Optimisation. Over time, it picks up transferable behaviours, such as using symbolic tools for verification or deferring numerical reasoning to specialised models.

This RL path maintained 59.4% accuracy on out-of-distribution tasks, outperforming clustering by 10.2% and supervised baselines by 13.1%. When we expanded the available model–tool pool at inference time by adding new medical and maths-focused models, the policy adapted without retraining, lifting accuracy from 59.4% to 61.7%.

Practical Tip: Use RL when you expect domain drift or broad task coverage. The upfront cost matters, but it pays back in generalisation. We trained for 250 steps across three datasets. A small policy model is enough (ours coordinated far larger models without becoming a bottleneck).

Final Thoughts on Orchestration as Infrastructure

The AI systems that matter going forward will not be single models. They will be collections of specialised components, stitched together through orchestration. The logic that decides what runs when becomes just as important as the models and tools themselves.

Our work with ATLAS shows that careful orchestration allows smaller, mixed model–tool setups to match, and often beat, large monolithic systems. For practitioners, the direction is straightforward. Use simple cluster-based routing where domain patterns are stable. Bring in RL-based policies when generalisation matters. Build systems that can absorb new models and tools without being redesigned from scratch.

References

Wu, J., Zhai, G., Jin, R., Yuan, J., Shen, Y., Zhang, S., Wen, Z., & Tao, J. (2026). ATLAS: Orchestrating heterogeneous models and tools for multi-domain complex reasoning., https://arxiv.org/abs/2601.03872