Section Rewriting and Citation Validation for LangGraph

Structured edit operations like find/replace and delete_paragraph consistently fail when LLMs specify exact replacement text. The gap between “identifying an issue” and “specifying how to fix it” is large. Citations introduced during editing also lack validation, leading to untraceable references.

This pattern replaces structured edits with direct section rewriting and uses Zotero as the source of truth for citation validation.

The Problem

Edit Operation Failures

Structured edit specifications fail predictably:

  • Validation failures: LLMs cannot reliably specify exact replacement_text.
  • Retry complexity: Failed edits lead to retry loops with diminishing returns.
  • Specification gap: Identifying an issue is different from specifying the fix.

Citation Integrity Issues

Citations introduced during editing lack validation:

  • Synthetic keys: Metadata-only papers use generated keys that are not verifiable.
  • No source of truth: Corpus keys are not authoritative.
  • Silent failures: Invalid citations pass through undetected.

The Solution

LLMs Are Better at Rewriting Than Editing

Instead of having LLMs generate structured edit operations, have them directly rewrite affected sections:

Old (edit operations):
Phase A: Identify issues
Phase B: Generate StructuralEdit specs → Validate → Retry if fails → Apply

New (section rewriting):
Phase A: Identify issues (diagnosis only)
Phase B: Rewrite sections directly → Apply rewrites

This eliminates the validation/retry loop entirely.

Section Rewriting Implementation

The rewriter extracts sections with surrounding context (three paragraphs on each side) and lets the LLM rewrite holistically:

async def rewrite_section_for_issue(
    issue: StructuralIssue,
    paragraph_mapping: dict[int, str],
    topic: str,
    llm_factory,
) -> SectionRewriteResult:
    """Rewrite a section to fix a structural issue."""
    # Extract section with surrounding context
    context_before, section, context_after, start, end = extract_section_with_context(
        paragraph_mapping,
        issue.affected_paragraphs,
        context_size=3,
    )
 
    # Build rewriting prompt
    prompt = SECTION_REWRITE_USER.format(
        issue_id=issue.issue_id,
        issue_type=issue.issue_type,
        description=issue.description,
        context_before=context_before,
        section_content=section,
        context_after=context_after,
    )
 
    # Use Sonnet for rewriting (good enough, faster than Opus)
    llm = llm_factory(ModelTier.SONNET, max_tokens=8000)
    rewritten = await llm.ainvoke([
        {"role": "system", "content": SECTION_REWRITE_SYSTEM},
        {"role": "user", "content": prompt},
    ])
 
    return SectionRewriteResult(
        issue_id=issue.issue_id,
        original_paragraphs=list(range(start, end + 1)),
        rewritten_content=rewritten.content,
        confidence=0.8,
    )

The system prompt enforces critical rules:

SECTION_REWRITE_SYSTEM = """You are an expert academic editor fixing a specific structural issue.
 
Critical Rules:
1. FIX the specific issue described - nothing more, nothing less
2. PRESERVE all citations in exact [@KEY] format
3. MAINTAIN document's voice and style
4. DO NOT add new factual claims - only restructure/consolidate/clarify
5. DO NOT rewrite context paragraphs - only the section between markers
6. KEEP similar length (±30% unless issue requires expansion/reduction)
 
Output: ONLY the rewritten section content - no explanations, no meta-commentary.
"""

Citation Validation with Zotero

Zotero becomes the source of truth for all citations:

async def validate_edit_citations_with_zotero(
    original_section: str,
    edited_section: str,
    corpus_keys: set[str],
    zotero_client,
    verify_all: bool = True,
) -> CitationValidationResult:
    """Validate citations in edited content against Zotero."""
    edited_citations = extract_citations(edited_section)
    verified_keys = set()
    invalid_citations = []
 
    if verify_all and edited_citations:
        verification_results = await verify_zotero_citations_batch(
            edited_citations, zotero_client
        )
 
        for key, exists in verification_results.items():
            if exists:
                verified_keys.add(key)
            elif key not in corpus_keys:
                # Not in Zotero AND not in corpus - invalid
                invalid_citations.append(f"{key} (not in Zotero)")
            else:
                # In corpus but not Zotero - trust corpus as fallback
                verified_keys.add(key)
 
    return CitationValidationResult(
        is_valid=len(invalid_citations) == 0,
        verified_keys=verified_keys,
        invalid_citations=invalid_citations,
    )

For metadata-only papers, create Zotero stubs before synthesis:

async def create_zotero_stubs_for_papers(
    papers: list[dict],
    zotero_client,
) -> dict[str, str]:
    """Create real Zotero records for metadata-only papers."""
    zotero_keys = {}
 
    for paper in papers:
        doi = paper.get("doi")
        if not doi:
            continue
 
        item = ZoteroItemCreate(
            itemType="journalArticle",
            fields={
                "title": paper.get("title", "Unknown"),
                "DOI": doi,
                "abstractNote": paper.get("abstract", "")[:2000],
            },
            tags=[{"tag": "metadata-only", "type": 1}],
        )
 
        zotero_key = await zotero_client.add(item)
        zotero_keys[doi] = zotero_key
 
    return zotero_keys

LangGraph Workflow

The Loop 3 graph implements structural supervision:

def create_loop3_graph() -> StateGraph:
    """Create Loop 3 graph for structural supervision."""
    builder = StateGraph(Loop3State)
 
    builder.add_node("number_paragraphs", number_paragraphs)
    builder.add_node("phase_a_identify_issues", phase_a_identify_issues)
    builder.add_node("phase_b_rewrite", phase_b_rewrite_sections)
    builder.add_node("verify_architecture", verify_architecture)
    builder.add_node("pass_through", pass_through)
    builder.add_node("finalize", finalize)
 
    builder.add_edge(START, "number_paragraphs")
    builder.add_edge("number_paragraphs", "phase_a_identify_issues")
 
    # Conditional routing after Phase A
    builder.add_conditional_edges(
        "phase_a_identify_issues",
        route_after_phase_a,
        {
            "phase_b_rewrite": "phase_b_rewrite",
            "pass_through": "pass_through",
        }
    )
 
    builder.add_edge("phase_b_rewrite", "verify_architecture")
 
    # Conditional routing after verification
    builder.add_conditional_edges(
        "verify_architecture",
        route_after_verification,
        {
            "finalize": "finalize",
            "phase_b_rewrite": "phase_b_rewrite",  # Retry
        }
    )
 
    builder.add_edge("pass_through", "finalize")
    builder.add_edge("finalize", END)
 
    return builder.compile()

The workflow has three key design choices:

  1. Conditional routing after Phase A avoids unnecessary rewriting when no issues exist.
  2. Verification gate (coherence of at least 0.8) provides quality control.
  3. Retry loop with iteration limit handles persistent issues.

Model Tier Strategy

Different phases use different model tiers for cost optimization:

PhaseModelRationale
Phase A (Diagnosis)OpusFull analysis, architectural reasoning
Phase B (Rewriting)SonnetGood enough for text transformation
Change SummariesHaikuSimple summarization, minimal cost
Citation VerificationAPIZotero API calls, no LLM needed

Issue Type Taxonomy

The rewriter handles these structural issue types:

Issue typeDescription
content_sprawlSame topic scattered across three or more sections
premature_detailTechnical content before foundations
orphaned_contentDisconnected paragraph lacking context
redundant_framingMultiple introductions to same concept
logical_gapMissing connecting tissue between ideas
consolidateContent from three or more locations needs reorganization

Trade-offs

Benefits:

  • No validation failures: LLMs generate valid text, not edit specs.
  • Simpler code path without validate/retry loops.
  • Better structural fixes through holistic rewriting.
  • Citation integrity with all references verifiable against Zotero.
  • Fail-fast behavior catches missing Zotero keys early.

Costs:

  • Pure move operations cannot be handled (rewriting in place).
  • Context tokens add to usage (three paragraphs on each side).
  • Style drift possible if LLM is not careful.
  • Zotero dependency requires API access.