Recency-Aware Paper Selection for Academic Literature Reviews

Citation-based filtering creates a fundamental recency bias in academic literature reviews. A 2024 paper with five citations gets filtered out, while a 2014 paper with 50 citations passes. This pattern solves the problem by treating recent and older papers differently.

The problem

Citation counts are a cumulative metric. Papers need time to be read, cited, and for those citations to propagate through the system:

Paper Age	Typical Citations	Filtering Outcome (threshold=10)
10+ years	100-1000+	Passes
5-10 years	20-100	Usually passes
2-5 years	5-30	Often fails
Less than two years	0-10	Almost always filtered out

A 2024 study found that NLP research shows a -12.8 percent decline in citation age, termed “citation amnesia”—the strongest among 20 fields studied. Using a single citation threshold systematically excludes cutting-edge developments.

The solution

The pattern uses three components:

Two-phase search with different citation thresholds for recent vs. older papers
Overcollection (three times target) to ensure enough candidates for quota enforcement
Quota-based finalization guaranteeing a minimum percentage of recent papers

graph TD
    A[Quality Settings] --> B[Two-Phase Search]
    B --> C[Phase 1: Recent<br/>min_citations = 0]
    B --> D[Phase 2: Older<br/>min_citations = 10]
    C --> E[Overcollect three times target]
    D --> E
    E --> F[Quota-Based Finalization]
    F --> G[Final Corpus<br/>25 percent recent guaranteed]

Two-phase search

The key insight: Recent papers have not had time to accumulate citations, so applying a citation threshold to them is meaningless.

async def two_phase_keyword_search(
    queries: list[str],
    cutoff_year: int,
    min_citations: int,
) -> list[dict]:
    """Search with different thresholds for recent vs. older papers."""
 
    async def search_query(query: str) -> list[dict]:
        results = []
 
        # Phase 1: Recent papers with no citation threshold
        # A 2024 paper with five citations may be highly important;
        # it just has not had time to accumulate citations.
        recent_results = await openalex_search(
            query=query,
            from_year=cutoff_year,
            min_citations=0,  # No threshold for recent
        )
        results.extend(recent_results)
 
        # Phase 2: Older papers with normal citation threshold
        # A 2015 paper with only five citations after 10+ years
        # is likely not impactful.
        older_results = await openalex_search(
            query=query,
            to_year=cutoff_year - 1,
            min_citations=min_citations,  # Normal threshold
        )
        results.extend(older_results)
 
        return results
 
    all_results = await asyncio.gather(*[search_query(q) for q in queries])
    return deduplicate_by_doi(all_results)

The same principle applies to citation fetching. Forward citations (papers that cite your seed papers) should use two-phase thresholds. Backward citations don’t need this treatment since they are older by definition.

Overcollection

To enforce a quota effectively, you need enough candidates in each recency bucket. Collect three times the target corpus size before finalization.

# Why three times?
# - 25 percent recency quota means we need enough recent papers to fill that slot
# - Discovery yield is ~60-70 percent (not all candidates score as relevant)
# - Three times provides headroom while limiting API cost
OVERCOLLECTION_MULTIPLIER = 3
 
def get_collection_target(max_papers: int) -> int:
    return max_papers * OVERCOLLECTION_MULTIPLIER

For a target of 100 papers, collect 300 candidates. This ensures the finalization step has enough recent papers to meet the 25 percent quota.

Quota-based finalization

After collecting candidates and scoring for relevance, enforce the recency quota:

def finalize_with_recency_quota(
    paper_corpus: dict[str, dict],
    max_papers: int,
    recency_quota: float,
    cutoff_year: int,
) -> list[str]:
    """Select final corpus with guaranteed recency quota."""
 
    # Partition by recency
    recent = [(doi, p) for doi, p in paper_corpus.items()
              if p.get("year", 0) >= cutoff_year]
    older = [(doi, p) for doi, p in paper_corpus.items()
             if p.get("year", 0) < cutoff_year]
 
    # Sort each partition by relevance score
    recent.sort(key=lambda x: x[1].get("relevance_score", 0.5), reverse=True)
    older.sort(key=lambda x: x[1].get("relevance_score", 0.5), reverse=True)
 
    # Select: top recent up to quota, then fill with older
    target_recent = int(max_papers * recency_quota)
    recent_selected = recent[:target_recent]
    slots_for_older = max_papers - len(recent_selected)
    older_selected = older[:slots_for_older]
 
    return [doi for doi, _ in recent_selected + older_selected]

The algorithm:

Partition papers into recent vs. older
Sort each partition by relevance score
Select top recent papers up to quota (for example, 25 of 100)
Fill remaining slots with top older papers
If you still have slots and more recent papers exist, use them

Configuration

Encode recency settings in quality presets:

QUALITY_PRESETS = {
    "quick": {
        "max_papers": 30,
        "recency_years": 3,
        "recency_quota": 0.25,
        "min_citations_filter": 5,
    },
    "standard": {
        "max_papers": 50,
        "recency_years": 3,
        "recency_quota": 0.25,
        "min_citations_filter": 10,
    },
    "high_quality": {
        "max_papers": 200,
        "recency_years": 3,
        "recency_quota": 0.30,  # Higher for cutting-edge coverage
        "min_citations_filter": 10,
    },
}

Why these defaults:

recency_years: three aligns with typical citation lag (papers take three to five years to reach citation peak)
recency_quota: 0.25 ensures 25 percent recent coverage without sacrificing foundational work
Adjust for your domain: Fast-moving fields (ML, biotech) may want 40 percent; historical analysis may want 10 percent

When to use this pattern

Use when:

Literature review needs both foundational and emerging work
Citation filtering is used for quality control
Fast-moving fields where recent breakthroughs matter
Quality tiers need configurable recency balance

Don’t use when:

Only historical or archival research matters
No citation filtering is applied
Corpus is small enough to include all papers
Recency isn’t a meaningful quality signal

Trade-offs

Benefits:

Balanced coverage of both seminal and emerging work
Recent papers aren’t penalized for low citations
Configurable via quality presets
Transparent composition logging

Costs:

Three times more API calls due to overcollection
Recent papers may have less rigorous peer review
Fixed quota may not suit all topics
Two-phase search adds implementation complexity

about thala

Explorer

Recency-Aware Paper Selection for Academic Literature Reviews

The problem

The solution

Two-phase search

Overcollection

Quota-based finalization

Configuration

When to use this pattern

Trade-offs

Table of Contents

about thala

Explorer

Recency-Aware Paper Selection for Academic Literature Reviews

The problem

The solution

Two-phase search

Overcollection

Quota-based finalization

Configuration

When to use this pattern

Trade-offs

Related resources

Table of Contents