When generating visual content with LLMs, retry loops often produce inconsistent results. Each retry waits for the previous to fail validation, and there’s no guarantee that retries improve quality—they might oscillate. Here’s a better approach: generate multiple candidates in parallel, then use a vision-capable LLM to compare them and select the best one.
The Problem With Retry Loops
Traditional retry-based generation has several issues:
# Old approach: retry on failure
for attempt in range(max_retries):
svg = await generate_svg(analysis)
overlap = check_overlaps(svg)
if not overlap.has_overlaps:
return svg
# Retry with feedback about overlaps
svg = await regenerate_with_feedback(svg, overlap)
# May still have overlaps after all retriesProblems:
- Sequential bottleneck: Each retry waits for the previous to fail
- Inconsistent quality: Retries don’t guarantee improvement
- Heuristic selection: Programmatic checks miss visual quality nuances
- Wasted computation: Failed attempts provide no benefit
The Choose-Best Pattern
Instead of retrying until something works, generate N candidates simultaneously and let a vision model pick the winner:
import asyncio
async def generate_candidates(
analysis,
config,
generate_svg_fn,
check_overlaps_fn,
convert_to_png_fn,
) -> list[DiagramCandidate]:
"""Generate N SVG candidates in parallel with quality metrics."""
# Generate all SVGs in parallel
tasks = [generate_svg_fn(analysis, config) for _ in range(config.num_candidates)]
svg_results = await asyncio.gather(*tasks, return_exceptions=True)
candidates = []
for i, svg_result in enumerate(svg_results):
if isinstance(svg_result, Exception):
continue # Log and skip failed generations
# Collect quality metrics for each candidate
overlap = check_overlaps_fn(svg_result)
png_bytes = convert_to_png_fn(svg_result, config.dpi)
candidates.append(DiagramCandidate(
svg_content=svg_result,
png_bytes=png_bytes,
overlap_check=overlap,
candidate_id=i + 1,
))
return candidatesThe key insight: Parallel generation costs N times the base cost, but takes the same wall-clock time as a single generation. You’re trading compute for quality.
Vision-Based Selection
Once you have candidates, send all the rendered PNGs to a vision model in a single call:
async def select_and_improve(
candidates: list[DiagramCandidate],
llm, # Vision-capable model
) -> tuple[str, int, str]:
"""Two-phase selection: compare visually, then refine winner."""
# Build multimodal content with all candidates
content_parts = []
for candidate in candidates:
# Add overlap analysis as context
content_parts.append({
"type": "text",
"text": f"**Candidate {candidate.candidate_id}**\n"
f"Overlaps: {len(candidate.overlap_check.overlap_pairs)}",
})
# Add the rendered image
b64_png = base64.b64encode(candidate.png_bytes).decode("utf-8")
content_parts.append({
"type": "image",
"source": {"type": "base64", "media_type": "image/png", "data": b64_png},
})
content_parts.append({
"type": "text",
"text": "Which candidate is best? State your choice and explain briefly.",
})
# Phase 1: Visual selection
selection_response = await llm.ainvoke([
{"role": "system", "content": SVG_SELECTION_SYSTEM},
{"role": "user", "content": content_parts},
])
# Parse selection, then Phase 2: Improve the winner
# ...The two-phase approach separates selection from improvement: The model first compares all options visually, then focuses on refining the chosen one.
Graceful Degradation
Vision-based selection can fail. Always have a fallback:
try:
selection_result = await select_and_improve(candidates, llm)
except Exception as e:
# Fallback: heuristic selection (fewest overlaps)
best = min(candidates, key=lambda c: len(c.overlap_check.overlap_pairs))
selection_result = (best.svg_content, best.candidate_id, f"Fallback: {e}")Cost and Quality Tradeoffs
For three candidates:
- Generation: three times base cost (but parallel, same latency)
- Selection: 1x vision call (all images in one request)
- Improvement: 1x text call
- Total: approximately four times single generation
This is worth it when:
- Output quality is subjective and benefits from comparison
- Retry loops produce inconsistent improvements
- The content is important enough to justify the cost
Candidate count guidelines:
| Use Case | Candidates | Notes |
|---|---|---|
| Quick preview | two | Minimum for comparison |
| Standard | three | Good balance of quality vs. cost |
| High-quality | five | More options for critical output |
When to Use This Pattern
Good fits:
- SVG/diagram generation where layout quality is subjective
- Content generation with multiple valid approaches
- Tasks where heuristic validation misses visual quality
- Outputs where users will see and judge quality
Poor fits:
- Deterministic outputs (only one correct answer)
- Latency-critical applications (vision adds a call)
- Low-value outputs where quality doesn’t matter
Complete Example
See the full implementation, which includes:
candidate_schemas.py: Data structures for candidates and resultsparallel_generation.py: Parallel SVG generation with metricsvision_selection.py: Two-phase vision comparison and improvementchoose_best_pipeline.py: Complete pipeline with dependency injection
The pattern generalizes beyond diagrams to any LLM-generated content where quality is subjective and comparison helps: email drafts, UI mockups, or any output where “good enough” isn’t good enough.