Document Processing

Converts documents (PDFs, URLs, EPUBs) into structured markdown with summaries, metadata extraction, and Zotero integration.

The workflow:

Resolves input (URL, file path, or raw text) into processable content
Creates a Zotero reference entry
Detects language and generates summaries
Extracts metadata (authors, dates, DOIs)
For long documents, generates condensed chapter summaries

When to Use This

This workflow is intended for ingesting documents into the knowledge base—turning raw files or URLs into searchable, summarized content with proper citation management.

For researching a topic, use Web Research or Literature Review. For enhancing existing reports, use Report Enhancement.

How It Works

flowchart TD
    subgraph input["1. Input Resolution"]
        A[URL / PDF / Text] --> B[Convert to Markdown]
        B --> C[Create Zotero Entry]
    end

    subgraph core["2. Core Processing"]
        C --> D[Detect Language]
        D --> E[Generate Summary]
        D --> F[Extract Metadata]
        E --> G[Save Summary]
        F --> G
    end

    subgraph validation["3. Validation"]
        G --> H{Content Valid?}
        H -->|Yes| I[Update Zotero]
        H -->|No| J[Finalize with Errors]
    end

    subgraph chapters["4. Long Document Handling"]
        I --> K{Long Document?}
        K -->|Yes| L[Detect Chapters]
        L --> M[Summarize Each Chapter]
        M --> N[Create 10:1 Summary]
        K -->|No| O[Finalize]
        N --> O
    end

    O --> P[Processed Document]
    J --> P

    style input fill:#e8f4f8
    style core fill:#f0f8e8
    style validation fill:#fff8e8
    style chapters fill:#f8e8f4

The Steps Explained

1. Input Resolution Accepts URLs, local file paths, or raw markdown text. PDFs are converted using OCR if needed. EPUBs are extracted. Web pages are scraped and cleaned.

2. Core Processing Runs two parallel processes:

Summary generation: Creates a concise summary in the document’s original language (plus English translation if non-English)
Metadata extraction: Identifies authors, publication date, DOI, and other bibliographic information

3. Validation Checks that content and metadata are consistent and valid. Updates the Zotero entry with extracted metadata.

4. Long Document Handling For documents over ~3,000 words, detects chapter structure and generates a condensed “10:1 summary” (10x compression) to make long documents more accessible.

Inputs

Input	Description	Example
Source	URL, file path, or markdown text	”https://arxiv.org/pdf/2301.00001”
Title (optional)	Document title	”Attention Is All You Need”
Item type (optional)	Zotero item type	”journalArticle”, “book”, “webpage”
Languages (optional)	OCR languages to try	[“English”, “German”]

Outputs

Output	Description
Markdown	Full document content as markdown
Short summary	Concise summary (original language + English if applicable)
10:1 summary	Condensed version for long documents
Metadata	Extracted bibliographic information
Zotero key	Citation key for reference management

Batch Processing

Multiple documents can be processed together with process_documents_batch(). Benefits:

Concurrent execution (configurable parallelism)
50% cost reduction via batch API
Graceful error handling (failures don’t stop the batch)

Example

Input:

Source: “https://arxiv.org/pdf/1706.03762” (Attention Is All You Need)
Languages: [“English”]

Typical output:

Markdown: ~15 pages of converted content
Short summary: 200-word summary of the transformer architecture paper
Metadata: Authors (Vaswani et al.), date (2017), arXiv ID
Zotero entry created with full citation information

Developer Reference

Entry point: workflows/document_processing/graph.py — exposes process_document() and process_documents_batch()

Graph construction: Same file — create_document_processing_graph()

State: workflows/document_processing/state.py — DocumentProcessingState

Nodes: workflows/document_processing/nodes/

input_resolver.py — URL/PDF/EPUB/text handling
zotero_stub.py — creates initial Zotero entry
language_detector.py — identifies document language
summary_agent.py — generates summaries
metadata_agent.py — extracts bibliographic data
content_metadata_validator.py — consistency checks
update_zotero.py — updates Zotero with metadata
chapter_detector.py — identifies document structure
save_short_summary.py / save_tenth_summary.py — persists summaries
finalizer.py — packages output

Subgraphs: workflows/document_processing/subgraphs/chapter_summarization/

graph.py — chapter-by-chapter summarization
chunking.py — intelligent text splitting
nodes.py — chapter processing workers

Prompts: workflows/document_processing/prompts.py — summarization and extraction prompts

about thala

Explorer