Parallel Candidate Generation with Vision-Based Selection

When generating visual content with LLMs, retry loops often produce inconsistent results. Each retry waits for the previous to fail validation, and there’s no guarantee that retries improve quality—they might oscillate. Here’s a better approach: generate multiple candidates in parallel, then use a vision-capable LLM to compare them and select the best one.

The Problem With Retry Loops

Traditional retry-based generation has several issues:

# Old approach: retry on failure
for attempt in range(max_retries):
    svg = await generate_svg(analysis)
    overlap = check_overlaps(svg)
    if not overlap.has_overlaps:
        return svg
    # Retry with feedback about overlaps
    svg = await regenerate_with_feedback(svg, overlap)
# May still have overlaps after all retries

Problems:

Sequential bottleneck: Each retry waits for the previous to fail
Inconsistent quality: Retries don’t guarantee improvement
Heuristic selection: Programmatic checks miss visual quality nuances
Wasted computation: Failed attempts provide no benefit

The Choose-Best Pattern

Instead of retrying until something works, generate N candidates simultaneously and let a vision model pick the winner:

import asyncio
 
async def generate_candidates(
    analysis,
    config,
    generate_svg_fn,
    check_overlaps_fn,
    convert_to_png_fn,
) -> list[DiagramCandidate]:
    """Generate N SVG candidates in parallel with quality metrics."""
 
    # Generate all SVGs in parallel
    tasks = [generate_svg_fn(analysis, config) for _ in range(config.num_candidates)]
    svg_results = await asyncio.gather(*tasks, return_exceptions=True)
 
    candidates = []
    for i, svg_result in enumerate(svg_results):
        if isinstance(svg_result, Exception):
            continue  # Log and skip failed generations
 
        # Collect quality metrics for each candidate
        overlap = check_overlaps_fn(svg_result)
        png_bytes = convert_to_png_fn(svg_result, config.dpi)
 
        candidates.append(DiagramCandidate(
            svg_content=svg_result,
            png_bytes=png_bytes,
            overlap_check=overlap,
            candidate_id=i + 1,
        ))
 
    return candidates

The key insight: Parallel generation costs N times the base cost, but takes the same wall-clock time as a single generation. You’re trading compute for quality.

Vision-Based Selection

Once you have candidates, send all the rendered PNGs to a vision model in a single call:

async def select_and_improve(
    candidates: list[DiagramCandidate],
    llm,  # Vision-capable model
) -> tuple[str, int, str]:
    """Two-phase selection: compare visually, then refine winner."""
 
    # Build multimodal content with all candidates
    content_parts = []
    for candidate in candidates:
        # Add overlap analysis as context
        content_parts.append({
            "type": "text",
            "text": f"**Candidate {candidate.candidate_id}**\n"
                    f"Overlaps: {len(candidate.overlap_check.overlap_pairs)}",
        })
 
        # Add the rendered image
        b64_png = base64.b64encode(candidate.png_bytes).decode("utf-8")
        content_parts.append({
            "type": "image",
            "source": {"type": "base64", "media_type": "image/png", "data": b64_png},
        })
 
    content_parts.append({
        "type": "text",
        "text": "Which candidate is best? State your choice and explain briefly.",
    })
 
    # Phase 1: Visual selection
    selection_response = await llm.ainvoke([
        {"role": "system", "content": SVG_SELECTION_SYSTEM},
        {"role": "user", "content": content_parts},
    ])
 
    # Parse selection, then Phase 2: Improve the winner
    # ...

The two-phase approach separates selection from improvement: The model first compares all options visually, then focuses on refining the chosen one.

Graceful Degradation

Vision-based selection can fail. Always have a fallback:

try:
    selection_result = await select_and_improve(candidates, llm)
except Exception as e:
    # Fallback: heuristic selection (fewest overlaps)
    best = min(candidates, key=lambda c: len(c.overlap_check.overlap_pairs))
    selection_result = (best.svg_content, best.candidate_id, f"Fallback: {e}")

Cost and Quality Tradeoffs

For three candidates:

Generation: three times base cost (but parallel, same latency)
Selection: 1x vision call (all images in one request)
Improvement: 1x text call
Total: approximately four times single generation

This is worth it when:

Output quality is subjective and benefits from comparison
Retry loops produce inconsistent improvements
The content is important enough to justify the cost

Candidate count guidelines:

Use Case	Candidates	Notes
Quick preview	two	Minimum for comparison
Standard	three	Good balance of quality vs. cost
High-quality	five	More options for critical output

When to Use This Pattern

Good fits:

SVG/diagram generation where layout quality is subjective
Content generation with multiple valid approaches
Tasks where heuristic validation misses visual quality
Outputs where users will see and judge quality

Poor fits:

Deterministic outputs (only one correct answer)
Latency-critical applications (vision adds a call)
Low-value outputs where quality doesn’t matter

Complete Example

See the full implementation, which includes:

candidate_schemas.py: Data structures for candidates and results
parallel_generation.py: Parallel SVG generation with metrics
vision_selection.py: Two-phase vision comparison and improvement
choose_best_pipeline.py: Complete pipeline with dependency injection

The pattern generalizes beyond diagrams to any LLM-generated content where quality is subjective and comparison helps: email drafts, UI mockups, or any output where “good enough” isn’t good enough.

about thala

Explorer