Config-Driven LLM Routing & Failover Solution

When implementing enterprise AI Agent platforms, development teams frequently encounter a common yet challenging problem. Major large language model providers including OpenAI, Anthropic, DeepSeek, Tongyi Qwen and Ollama adopt distinct API specifications, request formats and streaming response rules. If developers rely heavily on if-else statements to distinguish different LLM types within business code, the codebase will become extremely bloated as more models are integrated. This will dramatically increase the costs of subsequent iteration, operation, maintenance and model switching.

This article introduces a mature four-tier layered architecture. Powered by configuration-driven design, it enables developers to add or replace LLMs without modifying any business code. Meanwhile, it addresses practical engineering issues such as inconsistent output formats across models and service failover, helping you build a highly scalable underlying scheduling system for AI Agent projects.

1. Critical Drawbacks of Traditional Development Mode

In traditional architectures, the business layer connects directly to individual LLM service APIs. Every time a new model is integrated, developers have to write dedicated logic for request encapsulation, streaming parsing and token statistics. The code is filled with conditional branches judging model providers, as shown in the following Go code snippet:

if provider == "openai" {
    // Call OpenAI interface
} else if provider == "anthropic" {
    // Call Anthropic interface
} else if provider == "deepseek" {
    // Call DeepSeek interface
}

This hardcoding pattern brings three major flaws: low iteration efficiency, tight code coupling and poor failover capability. Once an LLM service goes offline or malfunctions, the system cannot quickly switch to backup nodes, which directly threatens the stability of online AI Agent services.

To tackle these pain points, layered scheduling architectures have gradually become mainstream in the industry. Many teams adopt TreeRouter as a unified access gateway for multi-model APIs, to centrally manage LLM calls, optimize traffic scheduling and improve service access efficiency.

2. Core Four-Tier Architecture: Profile + Router + Workflow + Adapter

The whole system is divided into four independent layers from top to bottom. The business code only depends on unified abstract interfaces and is completely isolated from underlying LLM providers. The standard call chain is as follows:

Business Code → Request with Profile Identifier → Router Distribution → Workflow Protocol Adaptation → Adapter Low-Level Invocation → LLM Provider API

2.1 Unified Provider Interface

As the core abstraction of the entire architecture, all LLMs implement this interface, which supports standard chat and streaming chat capabilities. The standard Go interface definition is listed below:

type Provider interface {
    Name() string
    Chat(ctx context.Context, req ChatRequest) (*ChatResponse, error)
    Stream(ctx context.Context, req ChatRequest) (<-chan StreamChunk, error)
}

The business layer only invokes methods from the Provider interface, achieving thorough decoupling from underlying LLM services.

2.2 Router Layer

The Router matches the corresponding LLM according to the Profile parameter carried in requests. It also supports transparent transmission of raw request payloads to adapt to special scenarios such as direct SDK calls. Its core interface is defined as:

type Router interface {
    Pick(ctx context.Context, profile string) (Provider, error)
    RawCall(ctx context.Context, profile string, body []byte, stream bool) (*RawResponse, error)
}

2.3 Workflow Adaptation Layer

To bridge the gaps between different vendors’ API standards, we define five types of standard adaptation solutions, covering the vast majority of mainstream models on the market:

const (
    AnthropicCompat    Kind = "anthropic-compat"
    OpenAICompat       Kind = "openai-compat"
    ClaudeSubscription Kind = "claude-subscription"
    CodexSubscription  Kind = "codex-subscription"
    GitHubCopilot      Kind = "github-copilot"
)

Among them, openai-compat serves as a universal adapter. Domestic and open-source models including DeepSeek, Tongyi Qwen, Zhipu GLM and local Ollama can reuse this adapter directly, with no need to develop separate adaptation logic.

2.4 Adapter Layer

The Adapter is responsible for initiating basic HTTP requests. It focuses purely on underlying network communication and maintains concise logic.

3. Configuration-Driven Management: Zero-Code Model Switching & Primary-Backup Failover

All LLM information and routing policies are managed via the configuration file configs/llm/profiles.yaml. Developers can add or switch models simply by editing configurations, instead of rewriting code. The sample configuration is as follows:

profiles:
  - name: deepseek-v3
    workflow: openai-compat
    base_url: https://api.deepseek.com
    model: deepseek-chat
    auth: ${DEEPSEEK_API_KEY}
  - name: claude-prod
    workflow: anthropic-compat
    base_url: https://api.anthropic.com
    model: claude-sonnet-4-6
    auth: ${ANTHROPIC_API_KEY}
routing:
  default:
    primary_profile: deepseek-v3
    fallback_profiles: [kimi-via-anthropic, glm-4]
    strategy: sticky

The configuration defines primary models and backup models. The sticky session strategy ensures that the same session consistently uses one fixed LLM to avoid context disorder, realizing fully automatic service failover.

4. Quirks Layer: Unify Irregular Output Formats

Different LLMs produce various non-standard output content. For instance, DeepSeek may mix redundant URLs in responses, Claude often wraps JSON data inside code blocks, and lightweight models tend to append extra commas at the end of JSON strings.

We introduce the quirks.yaml file to fix these format issues in a data-driven manner, without modifying business logic:

quirks:
  - name: extract-from-codeblock
    phase: post_response
    transform: codeblock_unwrap
    reason: Claude偶发把JSON包在```json ... ```里

The framework also reserves a universal Transform interface for developers to extend more custom formatting rules as required.

5. Request Validation & Framework Integration

The architecture enforces strict parameter validation before every request to block invalid inputs. The core validation logic is implemented below:

func (r ChatRequest) Validate() error {
    if r.Profile == "" { return ErrChatNoProfile }
    if r.Temperature < 0 || r.Temperature > 2 { return ErrChatTemperatureRange }
    if r.MaxTokens < 0 || r.MaxTokens > 200_000 { return nil }
    return nil
}

This architecture can integrate seamlessly with the Eino framework. Adapters convert message formats between two systems: Eino takes charge of AI Agent workflow orchestration, while this four-tier system handles LLM scheduling. The two components operate independently with clear division of labor.

6. Conclusion

The core design philosophy of this four-tier architecture is decoupling + configuration-driven operation. It centralizes all differences in API protocols, model behaviors and output formats into underlying components. Developers can add new LLMs, replace service providers and configure primary-backup failover policies merely by editing YAML files, which truly realizes zero modification to business code.

Combined with TreeRouter for front-end traffic distribution, the whole system delivers higher stability and scheduling efficiency when deployed on large-scale AI Agent clusters. For medium and large enterprise AI platforms and multi-agent systems, this practical architecture is an optimal solution that balances scalability, maintainability and operational stability.

Config-Driven LLM Routing & Failover Solution

1. Critical Drawbacks of Traditional Development Mode

2. Core Four-Tier Architecture: Profile + Router + Workflow + Adapter

2.1 Unified Provider Interface

2.2 Router Layer

2.3 Workflow Adaptation Layer

2.4 Adapter Layer

3. Configuration-Driven Management: Zero-Code Model Switching & Primary-Backup Failover

4. Quirks Layer: Unify Irregular Output Formats

5. Request Validation & Framework Integration

6. Conclusion

40+ top providers, 300+ core models, scheduled reliably

How to Use Kimi K3 After Subscription Suspension

Codex Context Migration Guide: Keep AI Coding Memory

GLM5 vs Kimi 2.5 vs Minimax M2.5: LLM Selection Guide

ZCode for GLM-5.2: AI Agent IDE for Developers

1. Critical Drawbacks of Traditional Development Mode

2. Core Four-Tier Architecture: Profile + Router + Workflow + Adapter

2.1 Unified Provider Interface

2.2 Router Layer

2.3 Workflow Adaptation Layer

2.4 Adapter Layer

3. Configuration-Driven Management: Zero-Code Model Switching & Primary-Backup Failover

4. Quirks Layer: Unify Irregular Output Formats

5. Request Validation & Framework Integration

6. Conclusion

40+ top providers, 300+ core models, scheduled reliably

Further Reading

How to Use Kimi K3 After Subscription Suspension

Codex Context Migration Guide: Keep AI Coding Memory

GLM5 vs Kimi 2.5 vs Minimax M2.5: LLM Selection Guide

ZCode for GLM-5.2: AI Agent IDE for Developers