Document Processing

Converts documents (PDFs, URLs, EPUBs) into structured markdown with summaries, metadata extraction, and Zotero integration.

The workflow:

  1. Resolves input (URL, file path, or raw text) into processable content
  2. Creates a Zotero reference entry
  3. Detects language and generates summaries
  4. Extracts metadata (authors, dates, DOIs)
  5. For long documents, generates condensed chapter summaries

When to Use This

This workflow is intended for ingesting documents into the knowledge base—turning raw files or URLs into searchable, summarized content with proper citation management.

For researching a topic, use Web Research or Literature Review. For enhancing existing reports, use Report Enhancement.

How It Works

flowchart TD
    subgraph input["1. Input Resolution"]
        A[URL / PDF / Text] --> B[Convert to Markdown]
        B --> C[Create Zotero Entry]
    end

    subgraph core["2. Core Processing"]
        C --> D[Detect Language]
        D --> E[Generate Summary]
        D --> F[Extract Metadata]
        E --> G[Save Summary]
        F --> G
    end

    subgraph validation["3. Validation"]
        G --> H{Content Valid?}
        H -->|Yes| I[Update Zotero]
        H -->|No| J[Finalize with Errors]
    end

    subgraph chapters["4. Long Document Handling"]
        I --> K{Long Document?}
        K -->|Yes| L[Detect Chapters]
        L --> M[Summarize Each Chapter]
        M --> N[Create 10:1 Summary]
        K -->|No| O[Finalize]
        N --> O
    end

    O --> P[Processed Document]
    J --> P

    style input fill:#e8f4f8
    style core fill:#f0f8e8
    style validation fill:#fff8e8
    style chapters fill:#f8e8f4

The Steps Explained

1. Input Resolution Accepts URLs, local file paths, or raw markdown text. PDFs are converted using OCR if needed. EPUBs are extracted. Web pages are scraped and cleaned.

2. Core Processing Runs two parallel processes:

  • Summary generation: Creates a concise summary in the document’s original language (plus English translation if non-English)
  • Metadata extraction: Identifies authors, publication date, DOI, and other bibliographic information

3. Validation Checks that content and metadata are consistent and valid. Updates the Zotero entry with extracted metadata.

4. Long Document Handling For documents over ~3,000 words, detects chapter structure and generates a condensed “10:1 summary” (10x compression) to make long documents more accessible.

Inputs

InputDescriptionExample
SourceURL, file path, or markdown texthttps://arxiv.org/pdf/2301.00001
Title (optional)Document title”Attention Is All You Need”
Item type (optional)Zotero item type”journalArticle”, “book”, “webpage”
Languages (optional)OCR languages to try[“English”, “German”]

Outputs

OutputDescription
MarkdownFull document content as markdown
Short summaryConcise summary (original language + English if applicable)
10:1 summaryCondensed version for long documents
MetadataExtracted bibliographic information
Zotero keyCitation key for reference management

Batch Processing

Multiple documents can be processed together with process_documents_batch(). Benefits:

  • Concurrent execution (configurable parallelism)
  • 50% cost reduction via batch API
  • Graceful error handling (failures don’t stop the batch)

Example

Input:

Typical output:

  • Markdown: ~15 pages of converted content
  • Short summary: 200-word summary of the transformer architecture paper
  • Metadata: Authors (Vaswani et al.), date (2017), arXiv ID
  • Zotero entry created with full citation information

Developer Reference

Entry point: workflows/document_processing/graph.py — exposes process_document() and process_documents_batch()

Graph construction: Same file — create_document_processing_graph()

State: workflows/document_processing/state.pyDocumentProcessingState

Nodes: workflows/document_processing/nodes/

  • input_resolver.py — URL/PDF/EPUB/text handling
  • zotero_stub.py — creates initial Zotero entry
  • language_detector.py — identifies document language
  • summary_agent.py — generates summaries
  • metadata_agent.py — extracts bibliographic data
  • content_metadata_validator.py — consistency checks
  • update_zotero.py — updates Zotero with metadata
  • chapter_detector.py — identifies document structure
  • save_short_summary.py / save_tenth_summary.py — persists summaries
  • finalizer.py — packages output

Subgraphs: workflows/document_processing/subgraphs/chapter_summarization/

  • graph.py — chapter-by-chapter summarization
  • chunking.py — intelligent text splitting
  • nodes.py — chapter processing workers

Prompts: workflows/document_processing/prompts.py — summarization and extraction prompts