When implementing enterprise AI Agent platforms, development teams frequently encounter a common yet challenging problem. Major large language model providers including OpenAI, Anthropic, DeepSeek, Tongyi Qwen and Ollama adopt distinct API specifications, request formats and streaming response rules. If developers rely heavily on if-else statements to distinguish different LLM types within business code, the codebase will become extremely bloated as more models are integrated. This will dramatically increase the costs of subsequent iteration, operation, maintenance and model switching.
This article introduces a mature four-tier layered architecture. Powered by configuration-driven design, it enables developers to add or replace LLMs without modifying any business code. Meanwhile, it addresses practical engineering issues such as inconsistent output formats across models and service failover, helping you build a highly scalable underlying scheduling system for AI Agent projects.
1. Critical Drawbacks of Traditional Development Mode
In traditional architectures, the business layer connects directly to individual LLM service APIs. Every time a new model is integrated, developers have to write dedicated logic for request encapsulation, streaming parsing and token statistics. The code is filled with conditional branches judging model providers, as shown in the following Go code snippet:
if provider == "openai" {
// Call OpenAI interface
} else if provider == "anthropic" {
// Call Anthropic interface
} else if provider == "deepseek" {
// Call DeepSeek interface
}
This hardcoding pattern brings three major flaws: low iteration efficiency, tight code coupling and poor failover capability. Once an LLM service goes offline or malfunctions, the system cannot quickly switch to backup nodes, which directly threatens the stability of online AI Agent services.
To tackle these pain points, layered scheduling architectures have gradually become mainstream in the industry. Many teams adopt TreeRouter as a unified access gateway for multi-model APIs, to centrally manage LLM calls, optimize traffic scheduling and improve service access efficiency.
2. Core Four-Tier Architecture: Profile + Router + Workflow + Adapter
The whole system is divided into four independent layers from top to bottom. The business code only depends on unified abstract interfaces and is completely isolated from underlying LLM providers. The standard call chain is as follows:
Business Code → Request with Profile Identifier → Router Distribution → Workflow Protocol Adaptation → Adapter Low-Level Invocation → LLM Provider API
2.1 Unified Provider Interface
As the core abstraction of the entire architecture, all LLMs implement this interface, which supports standard chat and streaming chat capabilities. The standard Go interface definition is listed below:
type Provider interface {
Name() string
Chat(ctx context.Context, req ChatRequest) (*ChatResponse, error)
Stream(ctx context.Context, req ChatRequest) (<-chan StreamChunk, error)
}
The business layer only invokes methods from the Provider interface, achieving thorough decoupling from underlying LLM services.
2.2 Router Layer
The Router matches the corresponding LLM according to the Profile parameter carried in requests. It also supports transparent transmission of raw request payloads to adapt to special scenarios such as direct SDK calls. Its core interface is defined as:
type Router interface {
Pick(ctx context.Context, profile string) (Provider, error)
RawCall(ctx context.Context, profile string, body []byte, stream bool) (*RawResponse, error)
}
2.3 Workflow Adaptation Layer
To bridge the gaps between different vendors’ API standards, we define five types of standard adaptation solutions, covering the vast majority of mainstream models on the market:
const (
AnthropicCompat Kind = "anthropic-compat"
OpenAICompat Kind = "openai-compat"
ClaudeSubscription Kind = "claude-subscription"
CodexSubscription Kind = "codex-subscription"
GitHubCopilot Kind = "github-copilot"
)
Among them, openai-compat serves as a universal adapter. Domestic and open-source models including DeepSeek, Tongyi Qwen, Zhipu GLM and local Ollama can reuse this adapter directly, with no need to develop separate adaptation logic.
2.4 Adapter Layer
The Adapter is responsible for initiating basic HTTP requests. It focuses purely on underlying network communication and maintains concise logic.
3. Configuration-Driven Management: Zero-Code Model Switching & Primary-Backup Failover
All LLM information and routing policies are managed via the configuration file configs/llm/profiles.yaml. Developers can add or switch models simply by editing configurations, instead of rewriting code. The sample configuration is as follows:
profiles:
- name: deepseek-v3
workflow: openai-compat
base_url: https://api.deepseek.com
model: deepseek-chat
auth: ${DEEPSEEK_API_KEY}
- name: claude-prod
workflow: anthropic-compat
base_url: https://api.anthropic.com
model: claude-sonnet-4-6
auth: ${ANTHROPIC_API_KEY}
routing:
default:
primary_profile: deepseek-v3
fallback_profiles: [kimi-via-anthropic, glm-4]
strategy: sticky
The configuration defines primary models and backup models. The sticky session strategy ensures that the same session consistently uses one fixed LLM to avoid context disorder, realizing fully automatic service failover.
4. Quirks Layer: Unify Irregular Output Formats
Different LLMs produce various non-standard output content. For instance, DeepSeek may mix redundant URLs in responses, Claude often wraps JSON data inside code blocks, and lightweight models tend to append extra commas at the end of JSON strings.
We introduce the quirks.yaml file to fix these format issues in a data-driven manner, without modifying business logic:
quirks:
- name: extract-from-codeblock
phase: post_response
transform: codeblock_unwrap
reason: Claude偶发把JSON包在```json ... ```里
The framework also reserves a universal Transform interface for developers to extend more custom formatting rules as required.
5. Request Validation & Framework Integration
The architecture enforces strict parameter validation before every request to block invalid inputs. The core validation logic is implemented below:
func (r ChatRequest) Validate() error {
if r.Profile == "" { return ErrChatNoProfile }
if r.Temperature < 0 || r.Temperature > 2 { return ErrChatTemperatureRange }
if r.MaxTokens < 0 || r.MaxTokens > 200_000 { return nil }
return nil
}
This architecture can integrate seamlessly with the Eino framework. Adapters convert message formats between two systems: Eino takes charge of AI Agent workflow orchestration, while this four-tier system handles LLM scheduling. The two components operate independently with clear division of labor.
6. Conclusion
The core design philosophy of this four-tier architecture is decoupling + configuration-driven operation. It centralizes all differences in API protocols, model behaviors and output formats into underlying components. Developers can add new LLMs, replace service providers and configure primary-backup failover policies merely by editing YAML files, which truly realizes zero modification to business code.
Combined with TreeRouter for front-end traffic distribution, the whole system delivers higher stability and scheduling efficiency when deployed on large-scale AI Agent clusters. For medium and large enterprise AI platforms and multi-agent systems, this practical architecture is an optimal solution that balances scalability, maintainability and operational stability.




