Skip to content

Routing Strategies

The routing service (RoutingService) resolves an ordered chain of providers for each request. After loading the routing configuration and filtering out disabled or not-allowed providers (unhealthy providers are demoted to a last-resort pool rather than removed), it applies one of ten strategies to determine the execution order.

Strategy Overview

Request arrives
|
v
Load routing config for (tenant, capability)
|
v
Load provider configs from routes
|
v
Filter: remove disabled, not-in-allowed-list, missing-model
(unhealthy providers are demoted to last resort, not removed)
|
v
Apply strategy --> ordered candidate list
|
v
Append fallback chain
|
v
Return resolved provider chain

1. Priority

Algorithm: Sort candidates by the priority field (ascending). Lower number = higher preference.

Behavior: The first healthy provider in priority order handles the request. If it fails, the next in order is tried. This is deterministic — the same provider always handles requests when healthy.

When to use: When you have a clear primary provider and want predictable failover. Good for compliance scenarios where a specific provider must be preferred.

Configuration: Set priority: 1 on your primary provider, priority: 2 on your secondary, and so on.

2. Round-Robin

Algorithm: Uses a Redis counter (INCR) scoped to (tenant, capability). The counter modulo the provider count determines the starting index. Candidates are first sorted by priority, then rotated from the starting index.

Behavior: Each request shifts to the next provider in order. Distribution is even over time. Falls back to priority strategy if Redis is unavailable.

When to use: When you want to spread load evenly across providers with similar capabilities and pricing. Useful for maximizing aggregate rate limits.

Details: The Redis counter key has a 1-hour TTL and auto-resets periodically.

3. Weighted

Algorithm: Fisher-Yates weighted shuffle. Each candidate has a weight value. On each iteration, a random number is drawn proportional to total remaining weight; the selected candidate is placed next in the ordered list.

Behavior: Higher-weight providers are more likely to be selected first, but all providers get traffic proportional to their weight. The result is probabilistic — distribution converges to weight ratios over many requests.

When to use: When you want proportional traffic splitting. For example, send 70% of traffic to a primary provider and 30% to a secondary for gradual migration or A/B testing.

Configuration: Set weight: 70 on provider A and weight: 30 on provider B.

4. Least-Cost

Algorithm: Batch-loads ModelConfig documents to retrieve pricing data (input + output cost per million tokens). Sorts candidates by total cost (ascending). Cheapest provider is tried first.

Behavior: Always prefers the cheapest available provider for the requested model. Providers without pricing data are sorted last (infinite cost).

When to use: When minimizing cost is the primary concern and you are willing to accept any provider that supports the model. Ideal for high-volume, cost-sensitive workloads.

Details: Cost is computed as inputPerMillionTokens + outputPerMillionTokens from the model’s pricing configuration. Ensure pricing is configured on your model entries for this strategy to work correctly.

5. Least-Latency

Algorithm: Sorts candidates by providerConfig.health.avgLatencyMs (ascending). Fastest provider is tried first. Providers without latency data are sorted last (infinite latency).

Behavior: Always prefers the provider with the lowest measured latency. Latency values come from health check measurements and are updated periodically.

When to use: When response speed is critical. Good for real-time applications, chatbots, and interactive use cases.

Details: Latency data is updated by the health check system. The values reflect provider-side latency, not end-to-end latency including network transit.

6. Free-Tier-First

Algorithm: Separates candidates into two groups: providers with an active free tier and providers without. For each free-tier provider, checks whether the daily limits (requests and tokens) have been exhausted via the FreeTierTrackingService. Exhausted free-tier providers are demoted to the paid group. Each group is sorted by priority.

Behavior: Free-tier providers are always tried before paid providers, until their daily quotas are exhausted. This maximizes use of free allowances before incurring costs.

When to use: When you have providers offering free tiers (e.g., Gemini, Groq) and want to exhaust free capacity before paying. Particularly useful for development, testing, or cost-conscious workloads.

Details: Free-tier limits are tracked per (tenant, providerConfig, model) using the FreeTierTrackingService. Limits can be set at the provider level (freeTier.requestsPerDay, freeTier.tokensPerDay) or overridden per model (freeTierLimits).

7. Task-Optimized

Algorithm: Delegates to the ModelIntelligenceService.applyTaskOptimizedRouting() method, which analyzes the request prompt and optional max-tokens to determine the task type (coding, creative writing, analysis, etc.) and selects the best model for that task.

Behavior: Routes requests to the provider/model combination that is best suited for the detected task type based on model intelligence data. This is the most sophisticated strategy and requires the model intelligence system to be populated with benchmark data.

When to use: When you have multiple providers with different model strengths and want the gateway to automatically select the best model for each request type.

Inputs: The prompt (last user message) and maxTokens from the request are passed to the intelligence service for task classification.

8. Cost-Optimized

Algorithm: An alias of Least-Cost. The strategy resolves to the exact same code path (strategyLeastCost) — candidates are sorted by inputPerMillionTokens + outputPerMillionTokens (ascending), cheapest first. Providers without pricing data are sorted last (infinite cost).

Behavior: Identical to Least-Cost. The separate name exists for configuration clarity and forward compatibility; it does not currently apply any additional request-size estimation or volume-discount logic.

When to use: Interchangeable with Least-Cost. Use whichever name reads better in your routing config.

Details: Ensure pricing is configured on your model entries for this strategy to order providers correctly.

9. Failover

Algorithm: Partitions candidates into two groups by health status — those that are healthy, and everything else (degraded, unhealthy, or unknown). Each group is sorted by priority (ascending), then the healthy group is concatenated ahead of the non-healthy group.

Behavior: Healthy providers are always tried first, in priority order. Any provider that is not currently healthy is pushed behind all healthy providers — but it is not removed, so it can still serve the request as a last resort if every healthy provider fails. Once health checks report the provider healthy again, it resumes its priority position.

When to use: When you want priority-based routing with automatic, temporary demotion of providers that are experiencing issues, while still keeping them available as a fallback.

Details: The split is determined by providerConfig.health.status === 'healthy'. Non-healthy providers (including fully unhealthy ones) are demoted, not filtered out. The demotion is transient and reverses when health checks recover.

10. Random

Algorithm: Fisher-Yates shuffle with uniform random selection. All candidates receive equal probability regardless of priority or weight.

Behavior: Each request is routed to a randomly selected provider from the available pool. Over a large number of requests, traffic distributes uniformly across all candidates.

When to use: When you want simple, unbiased load distribution without any preference for specific providers. Useful for testing, benchmarking, or scenarios where all providers are equivalent and no ordering preference exists.

Details: Uses Math.random() for selection. The full candidate list is shuffled, so if the first randomly selected provider fails, the next in the shuffled order is tried.

Fallback Chain

After the strategy orders the main candidates, the routing service appends the fallback chain — a separately configured list of last-resort providers. The fallback chain is useful for providers that should only be used when all primary routes fail.

Additionally, a local fallback can be configured (e.g., an Ollama or vLLM instance) for complete offline resilience.

The routing service detects circular references in the fallback chain and breaks the cycle with a warning log.

Caching

Resolved provider chains are cached in an in-memory LRU cache (max 1000 entries, 60-second TTL with 10% jitter). The cache key is (tenantId, capability, model). The cache is invalidated when routing configurations change.

Next Steps