Central LLM Broker: A 2D Routing Matrix for Batch vs Sync Decisions

When your AI system makes LLM calls scattered across dozens of workflow nodes, batch optimization becomes a coordination nightmare. Each call site ends up with its own “if >= 5 requests, use batch” logic, creating inconsistent behavior and making it impossible for users to control speed/cost trade-offs globally.

This article presents a Central LLM Broker pattern that solves this through a 2D routing matrix combining user modes with call-site policies.

The Problem with Decentralized Batch Decisions

Consider a document processing workflow with 15 different LLM call sites:

  • Chapter detection
  • Section summarization
  • Metadata extraction
  • Relevance scoring
  • Query translation

Each site originally made its own batching decision:

# Scattered across the codebase
if len(documents) >= 5:
    results = await batch_processor.process(documents)
else:
    results = await asyncio.gather(*[llm.invoke(d) for d in documents])

This approach has several problems:

  1. Hardcoded thresholds: Why 5? Why not 3 or 10? Each site picks a number.
  2. No user control: Users cannot configure speed/cost preferences globally.
  3. Duplicate logic: Batch collection code repeated everywhere.
  4. Complex coordination: Manual semaphores and double-batching patterns emerge.
  5. No observability: Difficult to monitor or change batching behavior.

The Solution: A 2D Routing Matrix

The Central LLM Broker introduces two orthogonal concepts:

User Mode (global preference): Controls speed/cost trade-off for the entire workflow

  • FAST: No batching, lowest latency
  • BALANCED: Reasonable trade-off (default)
  • ECONOMICAL: Aggressive batching, 50% cost savings

Call-site Policy (local intent): Declares what the code wants, not how to achieve it

  • FORCE_BATCH: Always batch (bulk operations)
  • PREFER_BALANCE: Batch when mode allows
  • PREFER_SPEED: Batch only in economical mode
  • REQUIRE_SYNC: Never batch (interactive features)

The routing matrix combines them:

Policy ↓ / Mode →FASTBALANCEDECONOMICAL
FORCE_BATCHBatchBatchBatch
PREFER_BALANCESyncBatchBatch
PREFER_SPEEDSyncSyncBatch
REQUIRE_SYNCSyncSyncSync

Implementation

Define the Enums

from enum import Enum
 
class BatchPolicy(Enum):
    """Call-site batch policy declaration."""
    FORCE_BATCH = "force_batch"       # Always batch.
    PREFER_BALANCE = "prefer_balance" # Batch in Balanced/Economical.
    PREFER_SPEED = "prefer_speed"     # Batch only in Economical.
    REQUIRE_SYNC = "sync"             # Never batch.
 
class UserMode(Enum):
    """User-configurable processing mode."""
    FAST = "fast"            # No batching, lowest latency.
    BALANCED = "balanced"    # Default, reasonable trade-off.
    ECONOMICAL = "economical"  # Aggressive batching, 50% savings.

The Routing Decision

The core logic is a simple cascade of checks:

def should_batch(
    policy: BatchPolicy,
    mode: UserMode,
    model: str,
    thinking_budget: int | None = None,
) -> bool:
    """2D routing matrix: policy + mode → sync or batch."""
 
    # Hard constraints (never batch)
    if "deepseek" in model.lower():  # DeepSeek has no batch API.
        return False
    if thinking_budget:  # Extended thinking incompatible.
        return False
 
    # Policy evaluation (absolute policies first).
    if policy == BatchPolicy.REQUIRE_SYNC:
        return False
    if policy == BatchPolicy.FORCE_BATCH:
        return True
 
    # Mode evaluation (preference policies).
    if mode == UserMode.FAST:
        return False
    if policy == BatchPolicy.PREFER_BALANCE:
        return mode in (UserMode.BALANCED, UserMode.ECONOMICAL)
    if policy == BatchPolicy.PREFER_SPEED:
        return mode == UserMode.ECONOMICAL
 
    return False

Broker Request API

The broker returns a future that resolves when the response is available, regardless of sync or batch execution:

async def request(
    self,
    prompt: str,
    policy: BatchPolicy = BatchPolicy.PREFER_SPEED,
    model: str = "claude-sonnet-4-5-20250929",
    **kwargs,
) -> asyncio.Future:
    """Submit request with automatic sync/batch routing."""
 
    request_id = self._generate_id()
    future = asyncio.get_running_loop().create_future()
    self._pending_futures[request_id] = future
 
    if should_batch(policy, self._mode, model, kwargs.get("thinking_budget")):
        await self._queue_for_batch(request_id, prompt, model, **kwargs)
    else:
        self._spawn_sync_task(request_id, prompt, model, **kwargs)
 
    return future

Batch Group Context Manager

For explicit batch grouping:

@asynccontextmanager
async def batch_group(self, mode: UserMode | None = None) -> AsyncIterator[BatchGroup]:
    """Group requests for batch submission on context exit."""
    group = BatchGroup(broker=self, mode=mode or self._mode)
 
    try:
        yield group
    finally:
        if group.request_ids:
            await self._flush_batch_group(group)

Usage Patterns

Pattern 1: Individual requests with different policies

broker = LLMBroker(mode=UserMode.BALANCED)
 
async with broker:
    # Interactive query—always sync for responsiveness.
    future1 = await broker.request(
        prompt="What is 2+2?",
        policy=BatchPolicy.REQUIRE_SYNC,
    )
 
    # Bulk processing—let mode decide.
    future2 = await broker.request(
        prompt="Summarize this document...",
        policy=BatchPolicy.PREFER_BALANCE,
    )
 
    result1 = await future1  # Returns immediately.
    result2 = await future2  # May wait for batch completion.

Pattern 2: Explicit batch grouping

async with broker.batch_group() as group:
    futures = []
    for doc in documents:
        f = await broker.request(
            prompt=f"Extract metadata: {doc}",
            policy=BatchPolicy.FORCE_BATCH,
        )
        futures.append(f)
# All requests submitted as single batch here.
 
results = await asyncio.gather(*futures)

Pattern 3: User-controlled mode

# Set via environment variable or config.
mode = UserMode(os.environ.get("LLM_MODE", "balanced"))
broker = LLMBroker(mode=mode)
 
# All call sites automatically respect the user's preference.
# No code changes needed when user switches from BALANCED to ECONOMICAL.

Why This Approach Works

Separation of concerns

Call sites declare intent (PREFER_SPEED), not implementation (if count >= 5). The broker handles the complexity of batch collection, API submission, and response correlation.

Single configuration point

Users set one mode for the entire workflow. Switching from BALANCED to ECONOMICAL affects all eligible call sites without code changes.

Graceful degradation

The broker can fall back to sync execution on queue overflow, API errors, or when hard constraints apply (DeepSeek models, extended thinking).

Observability

All routing decisions flow through a single point, enabling centralized metrics, tracing, and debugging.

Cost Impact

With Anthropic’s Batch API providing 50% cost reduction:

ModeBatch UsageTypical Savings
FAST0%0%
BALANCED40-60%20-30%
ECONOMICAL80-95%40-48%

For a workflow processing 1,000 documents with 15 LLM calls each, switching from FAST to ECONOMICAL can reduce API costs by nearly half.

Trade-offs

  • Latency variance: Batched requests take minutes/hours vs. instant sync calls
  • System complexity: Broker lifecycle, persistence, monitoring infrastructure
  • Migration effort: Call sites must add batch_policy parameter
  • Feature flag recommended: Safe rollout with opt-in behavior

Conclusion

The Central LLM Broker pattern transforms scattered batch decisions into a coherent routing system. By separating user preference (mode) from call-site intent (policy), it enables 50% cost savings while maintaining the flexibility for latency-sensitive operations.

The 2D routing matrix is the key insight: neither user mode nor call-site policy alone determines execution strategy. Their intersection does.