Central LLM Broker: A 2D Routing Matrix for Batch vs Sync Decisions
When your AI system makes LLM calls scattered across dozens of workflow nodes, batch optimization becomes a coordination nightmare. Each call site ends up with its own “if >= 5 requests, use batch” logic, creating inconsistent behavior and making it impossible for users to control speed/cost trade-offs globally.
This article presents a Central LLM Broker pattern that solves this through a 2D routing matrix combining user modes with call-site policies.
The Problem with Decentralized Batch Decisions
Consider a document processing workflow with 15 different LLM call sites:
- Chapter detection
- Section summarization
- Metadata extraction
- Relevance scoring
- Query translation
Each site originally made its own batching decision:
# Scattered across the codebase
if len(documents) >= 5:
results = await batch_processor.process(documents)
else:
results = await asyncio.gather(*[llm.invoke(d) for d in documents])This approach has several problems:
- Hardcoded thresholds: Why 5? Why not 3 or 10? Each site picks a number.
- No user control: Users cannot configure speed/cost preferences globally.
- Duplicate logic: Batch collection code repeated everywhere.
- Complex coordination: Manual semaphores and double-batching patterns emerge.
- No observability: Difficult to monitor or change batching behavior.
The Solution: A 2D Routing Matrix
The Central LLM Broker introduces two orthogonal concepts:
User Mode (global preference): Controls speed/cost trade-off for the entire workflow
FAST: No batching, lowest latencyBALANCED: Reasonable trade-off (default)ECONOMICAL: Aggressive batching, 50% cost savings
Call-site Policy (local intent): Declares what the code wants, not how to achieve it
FORCE_BATCH: Always batch (bulk operations)PREFER_BALANCE: Batch when mode allowsPREFER_SPEED: Batch only in economical modeREQUIRE_SYNC: Never batch (interactive features)
The routing matrix combines them:
| Policy ↓ / Mode → | FAST | BALANCED | ECONOMICAL |
|---|---|---|---|
FORCE_BATCH | Batch | Batch | Batch |
PREFER_BALANCE | Sync | Batch | Batch |
PREFER_SPEED | Sync | Sync | Batch |
REQUIRE_SYNC | Sync | Sync | Sync |
Implementation
Define the Enums
from enum import Enum
class BatchPolicy(Enum):
"""Call-site batch policy declaration."""
FORCE_BATCH = "force_batch" # Always batch.
PREFER_BALANCE = "prefer_balance" # Batch in Balanced/Economical.
PREFER_SPEED = "prefer_speed" # Batch only in Economical.
REQUIRE_SYNC = "sync" # Never batch.
class UserMode(Enum):
"""User-configurable processing mode."""
FAST = "fast" # No batching, lowest latency.
BALANCED = "balanced" # Default, reasonable trade-off.
ECONOMICAL = "economical" # Aggressive batching, 50% savings.The Routing Decision
The core logic is a simple cascade of checks:
def should_batch(
policy: BatchPolicy,
mode: UserMode,
model: str,
thinking_budget: int | None = None,
) -> bool:
"""2D routing matrix: policy + mode → sync or batch."""
# Hard constraints (never batch)
if "deepseek" in model.lower(): # DeepSeek has no batch API.
return False
if thinking_budget: # Extended thinking incompatible.
return False
# Policy evaluation (absolute policies first).
if policy == BatchPolicy.REQUIRE_SYNC:
return False
if policy == BatchPolicy.FORCE_BATCH:
return True
# Mode evaluation (preference policies).
if mode == UserMode.FAST:
return False
if policy == BatchPolicy.PREFER_BALANCE:
return mode in (UserMode.BALANCED, UserMode.ECONOMICAL)
if policy == BatchPolicy.PREFER_SPEED:
return mode == UserMode.ECONOMICAL
return FalseBroker Request API
The broker returns a future that resolves when the response is available, regardless of sync or batch execution:
async def request(
self,
prompt: str,
policy: BatchPolicy = BatchPolicy.PREFER_SPEED,
model: str = "claude-sonnet-4-5-20250929",
**kwargs,
) -> asyncio.Future:
"""Submit request with automatic sync/batch routing."""
request_id = self._generate_id()
future = asyncio.get_running_loop().create_future()
self._pending_futures[request_id] = future
if should_batch(policy, self._mode, model, kwargs.get("thinking_budget")):
await self._queue_for_batch(request_id, prompt, model, **kwargs)
else:
self._spawn_sync_task(request_id, prompt, model, **kwargs)
return futureBatch Group Context Manager
For explicit batch grouping:
@asynccontextmanager
async def batch_group(self, mode: UserMode | None = None) -> AsyncIterator[BatchGroup]:
"""Group requests for batch submission on context exit."""
group = BatchGroup(broker=self, mode=mode or self._mode)
try:
yield group
finally:
if group.request_ids:
await self._flush_batch_group(group)Usage Patterns
Pattern 1: Individual requests with different policies
broker = LLMBroker(mode=UserMode.BALANCED)
async with broker:
# Interactive query—always sync for responsiveness.
future1 = await broker.request(
prompt="What is 2+2?",
policy=BatchPolicy.REQUIRE_SYNC,
)
# Bulk processing—let mode decide.
future2 = await broker.request(
prompt="Summarize this document...",
policy=BatchPolicy.PREFER_BALANCE,
)
result1 = await future1 # Returns immediately.
result2 = await future2 # May wait for batch completion.Pattern 2: Explicit batch grouping
async with broker.batch_group() as group:
futures = []
for doc in documents:
f = await broker.request(
prompt=f"Extract metadata: {doc}",
policy=BatchPolicy.FORCE_BATCH,
)
futures.append(f)
# All requests submitted as single batch here.
results = await asyncio.gather(*futures)Pattern 3: User-controlled mode
# Set via environment variable or config.
mode = UserMode(os.environ.get("LLM_MODE", "balanced"))
broker = LLMBroker(mode=mode)
# All call sites automatically respect the user's preference.
# No code changes needed when user switches from BALANCED to ECONOMICAL.Why This Approach Works
Separation of concerns
Call sites declare intent (PREFER_SPEED), not implementation (if count >= 5). The broker handles the complexity of batch collection, API submission, and response correlation.
Single configuration point
Users set one mode for the entire workflow. Switching from BALANCED to ECONOMICAL affects all eligible call sites without code changes.
Graceful degradation
The broker can fall back to sync execution on queue overflow, API errors, or when hard constraints apply (DeepSeek models, extended thinking).
Observability
All routing decisions flow through a single point, enabling centralized metrics, tracing, and debugging.
Cost Impact
With Anthropic’s Batch API providing 50% cost reduction:
| Mode | Batch Usage | Typical Savings |
|---|---|---|
| FAST | 0% | 0% |
| BALANCED | 40-60% | 20-30% |
| ECONOMICAL | 80-95% | 40-48% |
For a workflow processing 1,000 documents with 15 LLM calls each, switching from FAST to ECONOMICAL can reduce API costs by nearly half.
Trade-offs
- Latency variance: Batched requests take minutes/hours vs. instant sync calls
- System complexity: Broker lifecycle, persistence, monitoring infrastructure
- Migration effort: Call sites must add
batch_policyparameter - Feature flag recommended: Safe rollout with opt-in behavior
Related Resources
- Complete implementation (GitHub Gist)
- Message Batches API (Anthropic)
Conclusion
The Central LLM Broker pattern transforms scattered batch decisions into a coherent routing system. By separating user preference (mode) from call-site intent (policy), it enables 50% cost savings while maintaining the flexibility for latency-sensitive operations.
The 2D routing matrix is the key insight: neither user mode nor call-site policy alone determines execution strategy. Their intersection does.