Document Processing
Converts documents (PDFs, URLs, EPUBs) into structured markdown with summaries, metadata extraction, and Zotero integration.
The workflow:
- Resolves input (URL, file path, or raw text) into processable content
- Creates a Zotero reference entry
- Detects language and generates summaries
- Extracts metadata (authors, dates, DOIs)
- For long documents, generates condensed chapter summaries
When to Use This
This workflow is intended for ingesting documents into the knowledge base—turning raw files or URLs into searchable, summarized content with proper citation management.
For researching a topic, use Web Research or Literature Review. For enhancing existing reports, use Report Enhancement.
How It Works
flowchart TD subgraph input["1. Input Resolution"] A[URL / PDF / Text] --> B[Convert to Markdown] B --> C[Create Zotero Entry] end subgraph core["2. Core Processing"] C --> D[Detect Language] D --> E[Generate Summary] D --> F[Extract Metadata] E --> G[Save Summary] F --> G end subgraph validation["3. Validation"] G --> H{Content Valid?} H -->|Yes| I[Update Zotero] H -->|No| J[Finalize with Errors] end subgraph chapters["4. Long Document Handling"] I --> K{Long Document?} K -->|Yes| L[Detect Chapters] L --> M[Summarize Each Chapter] M --> N[Create 10:1 Summary] K -->|No| O[Finalize] N --> O end O --> P[Processed Document] J --> P style input fill:#e8f4f8 style core fill:#f0f8e8 style validation fill:#fff8e8 style chapters fill:#f8e8f4
The Steps Explained
1. Input Resolution Accepts URLs, local file paths, or raw markdown text. PDFs are converted using OCR if needed. EPUBs are extracted. Web pages are scraped and cleaned.
2. Core Processing Runs two parallel processes:
- Summary generation: Creates a concise summary in the document’s original language (plus English translation if non-English)
- Metadata extraction: Identifies authors, publication date, DOI, and other bibliographic information
3. Validation Checks that content and metadata are consistent and valid. Updates the Zotero entry with extracted metadata.
4. Long Document Handling For documents over ~3,000 words, detects chapter structure and generates a condensed “10:1 summary” (10x compression) to make long documents more accessible.
Inputs
| Input | Description | Example |
|---|---|---|
| Source | URL, file path, or markdown text | ”https://arxiv.org/pdf/2301.00001” |
| Title (optional) | Document title | ”Attention Is All You Need” |
| Item type (optional) | Zotero item type | ”journalArticle”, “book”, “webpage” |
| Languages (optional) | OCR languages to try | [“English”, “German”] |
Outputs
| Output | Description |
|---|---|
| Markdown | Full document content as markdown |
| Short summary | Concise summary (original language + English if applicable) |
| 10:1 summary | Condensed version for long documents |
| Metadata | Extracted bibliographic information |
| Zotero key | Citation key for reference management |
Batch Processing
Multiple documents can be processed together with process_documents_batch(). Benefits:
- Concurrent execution (configurable parallelism)
- 50% cost reduction via batch API
- Graceful error handling (failures don’t stop the batch)
Example
Input:
- Source: “https://arxiv.org/pdf/1706.03762” (Attention Is All You Need)
- Languages: [“English”]
Typical output:
- Markdown: ~15 pages of converted content
- Short summary: 200-word summary of the transformer architecture paper
- Metadata: Authors (Vaswani et al.), date (2017), arXiv ID
- Zotero entry created with full citation information
Developer Reference
Entry point: workflows/document_processing/graph.py — exposes process_document() and process_documents_batch()
Graph construction: Same file — create_document_processing_graph()
State: workflows/document_processing/state.py — DocumentProcessingState
Nodes: workflows/document_processing/nodes/
input_resolver.py— URL/PDF/EPUB/text handlingzotero_stub.py— creates initial Zotero entrylanguage_detector.py— identifies document languagesummary_agent.py— generates summariesmetadata_agent.py— extracts bibliographic datacontent_metadata_validator.py— consistency checksupdate_zotero.py— updates Zotero with metadatachapter_detector.py— identifies document structuresave_short_summary.py/save_tenth_summary.py— persists summariesfinalizer.py— packages output
Subgraphs: workflows/document_processing/subgraphs/chapter_summarization/
graph.py— chapter-by-chapter summarizationchunking.py— intelligent text splittingnodes.py— chapter processing workers
Prompts: workflows/document_processing/prompts.py — summarization and extraction prompts