Per-location image selection answers the question “Which candidate is best for this spot?” But it cannot answer a harder question: “Do these images work together as a set?” After over-generating N+2 images for a document and selecting winners at each location, you still need someone to look at the full spread and decide which two to cut. That someone is a vision LLM making a single holistic call.
This article describes a pattern we call editorial curation—a single Sonnet vision call that receives all non-header images in document order, evaluates each on four dimensions, ranks them by overall contribution, and identifies the weakest for removal.
Why Per-Location Selection Is Not Enough
Our illustration pipeline generates two candidates per location and uses vision pair comparison to pick the winner at each spot. This produces good individual images, but the full set can still have problems that per-location selection cannot detect:
- Clustering: Three images in adjacent sections while a long stretch has none
- Redundancy: Two conceptual illustrations that look nearly identical despite being at different locations
- Style drift: A generated watercolor next to a photographic public domain image, breaking visual coherence
- Over-illustration: Some sections work better with text alone, but every planned location got an image
These are document-level concerns. You can only see them by looking at all the images together.
The Pattern: Over-Generate, Then Curate
The editorial review sits at the end of the illustration pipeline, after per-location pair selection has produced winners.
graph TD A[Generate two candidates per location] --> B[Per-location pair selection] B --> C[Assemble winning images] C --> D[Editorial review—single vision call] D --> E[Apply cuts and finalize]
The over-generation strategy intentionally produces N+2 images. Per-location selection picks the best candidate at each spot. Editorial review then evaluates the full set and cuts the two surplus images that contribute least to the whole.
Structured Output for Editorial Decisions
The LLM returns a Pydantic schema with constrained fields. Four scoring dimensions act as chain-of-thought scaffolding, forcing the model to reason about different quality aspects before committing to cuts.
from pydantic import BaseModel, ConfigDict, Field
class EditorialImageEvaluation(BaseModel):
"""Evaluation of a single image in the document context."""
model_config = ConfigDict(extra="forbid")
location_id: str = Field(
description="The location_id label shown with each image",
)
contribution_rank: int = Field(
ge=1,
description="1=strongest contribution to the article, N=weakest",
)
# Scoring dimensions—LLM steering, not consumed programmatically
visual_coherence: int = Field(
ge=1, le=5,
description="1=clashes with surrounding images, "
"5=perfectly complements the visual identity",
)
pacing_contribution: int = Field(
ge=1, le=5,
description="1=creates clustering with nearby images, "
"5=well-spaced in document flow",
)
variety_contribution: int = Field(
ge=1, le=5,
description="1=redundant with other images, "
"5=adds unique visual perspective",
)
individual_quality: int = Field(
ge=1, le=5,
description="1=poor composition or artifacts, "
"5=publication-ready quality",
)
cut_reason: str | None = Field(
default=None,
description="Only for images marked for cutting",
)
class EditorialReviewResult(BaseModel):
"""Full editorial review output."""
model_config = ConfigDict(extra="forbid")
evaluations: list[EditorialImageEvaluation]
cut_location_ids: list[str] = Field(
description="The location_ids to remove",
)
editorial_summary: str = Field(
description="Overall assessment for logging",
)extra="forbid" catches hallucinated fields. The ge/le constraints on scores prevent out-of-range values. The four dimensions (coherence, pacing, variety, quality) are deliberately orthogonal: without them, the LLM tends to focus only on individual quality and ignore document-level concerns.
The scoring fields are never consumed programmatically. Only cut_location_ids and cut_reason drive behavior. But they improve the model’s reasoning by forcing structured evaluation before the cut decision, similar to chain-of-thought prompting through schema design.
Building the Multimodal Message
The core technique is interleaving each base64-encoded image with a text label that identifies it.
import base64
MAX_IMAGE_SIZE = 20 * 1024 * 1024 # 20 MB
content_parts: list[dict] = [{"type": "text", "text": user_prompt}]
for img in non_header_images:
image_bytes = img["image_bytes"]
if not image_bytes or len(image_bytes) > MAX_IMAGE_SIZE:
continue
b64 = base64.b64encode(image_bytes).decode("utf-8")
content_parts.append({
"type": "image",
"source": {
"type": "base64",
"media_type": detect_media_type(image_bytes),
"data": b64,
},
})
# Label immediately after the image reduces ambiguity
content_parts.append({
"type": "text",
"text": f"Image above is '{img['location_id']}' "
f"({img['image_type']}, {img['purpose']})",
})The [image, label] interleaving pattern is important. When the LLM sees six images in a single message, it needs a way to reference each one unambiguously. Placing the label text immediately after its image, rather than listing all labels before or after all images, makes the association clear.
The 20 MB size guard prevents memory exhaustion during base64 encoding. In practice, generated images are typically under 1 MB, but public domain images can be larger.
The Prompt
The prompt establishes the editorial art director persona and provides the article’s visual identity context from an earlier creative direction pass.
EDITORIAL_SYSTEM = (
"You are an editorial art director reviewing the full set of "
"illustrations for a long-form article. Your job is to ensure "
"the illustrations work as a cohesive set."
)
EDITORIAL_USER = """This article was intentionally illustrated with \
{cuts_count} more images than needed so you can select the strongest \
set. You will evaluate {n_images} non-header images and cut the \
{cuts_count} that contribute least.
Evaluate each image on:
1. Visual coherence—Does it match the style/identity of the other images?
2. Pacing contribution—Is it well-placed? Does it avoid clustering?
3. Variety contribution—Is it different from its neighbors?
4. Individual quality—Is it technically good and contextually relevant?
Visual identity for this article:
- Style: {primary_style}
- Palette: {color_palette}
- Mood: {mood}
Rank ALL {n_images} images from strongest to weakest contribution, \
then mark the bottom {cuts_count} for removal. For each cut, explain why."""Two design choices are worth noting. First, the prompt explains that over-generation was intentional. Without this context, the model may be reluctant to cut images, assuming all of them are supposed to stay. Second, the “rank ALL images” instruction mitigates positional bias—the model must consider every image’s relative contribution, not evaluate the first few and skip the rest.
The visual identity parameters (style, palette, mood) come from a creative direction pass earlier in the pipeline. Including them enables coherence judgments that would otherwise be impossible.
Validating LLM Output
Vision LLMs can hallucinate location IDs, return duplicates, or recommend too many cuts. A three-layer validation chain handles all of these.
valid_ids = {img["location_id"] for img in non_header_images}
# 1. Filter to valid location IDs (prevents hallucinated IDs)
# 2. Deduplicate while preserving order
# 3. Cap at requested cuts_count (prevents over-cutting)
cut_ids = list(dict.fromkeys(
cid for cid in response.cut_location_ids if cid in valid_ids
))[:cuts_count]dict.fromkeys() is the key technique here. It deduplicates while preserving insertion order, unlike set() which loses order. If the LLM returns ["section_2", "section_4", "section_2"], the result is ["section_2", "section_4"].
The [:cuts_count] slice at the end prevents the LLM from cutting more images than requested. Combined with the adaptive cut count (below), this ensures the image set never drops below target.
Adaptive Cut Count
If over-generation produces fewer images than expected (some locations fail), cutting a fixed two would drop below the target. The adaptive formula prevents this.
def compute_cuts_count(n_images: int, n_target: int, max_cuts: int = 2) -> int:
"""Never cut below the target image count."""
surplus = n_images - n_target
return max(0, min(max_cuts, surplus))For a target of six non-header images with N+2 over-generation: eight images available means cut two. But if only five images succeeded, surplus = -1 and cuts_count = 0. No cuts.
Fail-Open Error Handling
Editorial review is a quality enhancement, not a blocking requirement. If the vision call fails, the workflow continues with all images.
try:
llm = get_llm(tier=ModelTier.SONNET).with_structured_output(
EditorialReviewResult
)
response = await llm.ainvoke(messages)
# ... validation ...
except Exception:
logger.exception("Editorial review failed, keeping all images")
return EditorialReviewResult(
evaluations=[],
cut_location_ids=[],
editorial_summary="Editorial review failed. All images kept.",
)The fail-open approach means an API timeout or parsing error produces a document with two extra images rather than a crashed workflow. In practice, the extra images are acceptable because per-location selection already ensured each one meets a quality threshold.
Status Determination After Cuts
A subtle but important integration detail is the finalize node must account for editorial cuts when determining completion status. Without this, a workflow that successfully generated all needed images would incorrectly report “partial.”
def determine_status(planned_count, actual_count, cut_count=0):
"""Exclude editorial cuts from the expected count."""
expected = planned_count - cut_count
if expected <= 0 or actual_count <= 0:
return "failed"
return "complete" if actual_count >= expected else "partial"After cutting two images from a six-image plan, the expected count drops to four. If four images were successfully generated, the status is “complete,” not “partial.” This seems obvious in isolation, but we discovered the bug in production after editorial review was already deployed.
Cost Analysis
For six non-header images:
| Approach | API calls | Context |
|---|---|---|
| Per-image evaluation | Six vision calls | No awareness of other images |
| Single editorial call | One vision call | All images in context |
The single-call approach is roughly 6x cheaper and provides the document-level context necessary for coherence, pacing, and variety judgments. The trade-off is a memory spike from encoding all images simultaneously, mitigated by the size guard.
Comparison with Per-Location Selection
These two patterns serve complementary purposes in the pipeline:
| Per-Location Pair Selection | Editorial Curation | |
|---|---|---|
| Scope | Two candidates at one location | All images across all locations |
| Question | Which is better for this spot? | Which contribute least to the whole? |
| Criteria | Brief compliance, visual quality | Coherence, pacing, variety, quality |
| When | During generation | After all winners assembled |
Per-location selection is local optimization. Editorial curation is global optimization. You need both.
Related Work
The MLLM-as-a-Judge benchmark (ICML 2024) found that pair comparison achieves 80.6 percent alignment with human judgment, compared to 55.7 percent for scoring-based evaluation. This validates our two-phase approach—pair comparison per-location, then a single holistic call for editorial cuts.
The Text-to-ImageSet (T2IS) paper (“Why Settle for One?”, 2025) addresses set-level coherence for image generation, evaluating identity, style, and logic consistency. Our pattern solves the complementary problem—given an already-generated set, curate it by cutting the weakest members.
Gist
A self-contained implementation with schemas, message construction, validation, and adaptive cut logic.