Recency-Aware Paper Selection for Academic Literature Reviews

Citation-based filtering creates a fundamental recency bias in academic literature reviews. A 2024 paper with five citations gets filtered out, while a 2014 paper with 50 citations passes. This pattern solves the problem by treating recent and older papers differently.

The problem

Citation counts are a cumulative metric. Papers need time to be read, cited, and for those citations to propagate through the system:

Paper AgeTypical CitationsFiltering Outcome (threshold=10)
10+ years100-1000+Passes
5-10 years20-100Usually passes
2-5 years5-30Often fails
Less than two years0-10Almost always filtered out

A 2024 study found that NLP research shows a -12.8 percent decline in citation age, termed “citation amnesia”—the strongest among 20 fields studied. Using a single citation threshold systematically excludes cutting-edge developments.

The solution

The pattern uses three components:

  1. Two-phase search with different citation thresholds for recent vs. older papers
  2. Overcollection (three times target) to ensure enough candidates for quota enforcement
  3. Quota-based finalization guaranteeing a minimum percentage of recent papers
graph TD
    A[Quality Settings] --> B[Two-Phase Search]
    B --> C[Phase 1: Recent<br/>min_citations = 0]
    B --> D[Phase 2: Older<br/>min_citations = 10]
    C --> E[Overcollect three times target]
    D --> E
    E --> F[Quota-Based Finalization]
    F --> G[Final Corpus<br/>25 percent recent guaranteed]

The key insight: Recent papers have not had time to accumulate citations, so applying a citation threshold to them is meaningless.

async def two_phase_keyword_search(
    queries: list[str],
    cutoff_year: int,
    min_citations: int,
) -> list[dict]:
    """Search with different thresholds for recent vs. older papers."""
 
    async def search_query(query: str) -> list[dict]:
        results = []
 
        # Phase 1: Recent papers with no citation threshold
        # A 2024 paper with five citations may be highly important;
        # it just has not had time to accumulate citations.
        recent_results = await openalex_search(
            query=query,
            from_year=cutoff_year,
            min_citations=0,  # No threshold for recent
        )
        results.extend(recent_results)
 
        # Phase 2: Older papers with normal citation threshold
        # A 2015 paper with only five citations after 10+ years
        # is likely not impactful.
        older_results = await openalex_search(
            query=query,
            to_year=cutoff_year - 1,
            min_citations=min_citations,  # Normal threshold
        )
        results.extend(older_results)
 
        return results
 
    all_results = await asyncio.gather(*[search_query(q) for q in queries])
    return deduplicate_by_doi(all_results)

The same principle applies to citation fetching. Forward citations (papers that cite your seed papers) should use two-phase thresholds. Backward citations don’t need this treatment since they are older by definition.

Overcollection

To enforce a quota effectively, you need enough candidates in each recency bucket. Collect three times the target corpus size before finalization.

# Why three times?
# - 25 percent recency quota means we need enough recent papers to fill that slot
# - Discovery yield is ~60-70 percent (not all candidates score as relevant)
# - Three times provides headroom while limiting API cost
OVERCOLLECTION_MULTIPLIER = 3
 
def get_collection_target(max_papers: int) -> int:
    return max_papers * OVERCOLLECTION_MULTIPLIER

For a target of 100 papers, collect 300 candidates. This ensures the finalization step has enough recent papers to meet the 25 percent quota.

Quota-based finalization

After collecting candidates and scoring for relevance, enforce the recency quota:

def finalize_with_recency_quota(
    paper_corpus: dict[str, dict],
    max_papers: int,
    recency_quota: float,
    cutoff_year: int,
) -> list[str]:
    """Select final corpus with guaranteed recency quota."""
 
    # Partition by recency
    recent = [(doi, p) for doi, p in paper_corpus.items()
              if p.get("year", 0) >= cutoff_year]
    older = [(doi, p) for doi, p in paper_corpus.items()
             if p.get("year", 0) < cutoff_year]
 
    # Sort each partition by relevance score
    recent.sort(key=lambda x: x[1].get("relevance_score", 0.5), reverse=True)
    older.sort(key=lambda x: x[1].get("relevance_score", 0.5), reverse=True)
 
    # Select: top recent up to quota, then fill with older
    target_recent = int(max_papers * recency_quota)
    recent_selected = recent[:target_recent]
    slots_for_older = max_papers - len(recent_selected)
    older_selected = older[:slots_for_older]
 
    return [doi for doi, _ in recent_selected + older_selected]

The algorithm:

  1. Partition papers into recent vs. older
  2. Sort each partition by relevance score
  3. Select top recent papers up to quota (for example, 25 of 100)
  4. Fill remaining slots with top older papers
  5. If you still have slots and more recent papers exist, use them

Configuration

Encode recency settings in quality presets:

QUALITY_PRESETS = {
    "quick": {
        "max_papers": 30,
        "recency_years": 3,
        "recency_quota": 0.25,
        "min_citations_filter": 5,
    },
    "standard": {
        "max_papers": 50,
        "recency_years": 3,
        "recency_quota": 0.25,
        "min_citations_filter": 10,
    },
    "high_quality": {
        "max_papers": 200,
        "recency_years": 3,
        "recency_quota": 0.30,  # Higher for cutting-edge coverage
        "min_citations_filter": 10,
    },
}

Why these defaults:

  • recency_years: three aligns with typical citation lag (papers take three to five years to reach citation peak)
  • recency_quota: 0.25 ensures 25 percent recent coverage without sacrificing foundational work
  • Adjust for your domain: Fast-moving fields (ML, biotech) may want 40 percent; historical analysis may want 10 percent

When to use this pattern

Use when:

  • Literature review needs both foundational and emerging work
  • Citation filtering is used for quality control
  • Fast-moving fields where recent breakthroughs matter
  • Quality tiers need configurable recency balance

Don’t use when:

  • Only historical or archival research matters
  • No citation filtering is applied
  • Corpus is small enough to include all papers
  • Recency isn’t a meaningful quality signal

Trade-offs

Benefits:

  • Balanced coverage of both seminal and emerging work
  • Recent papers aren’t penalized for low citations
  • Configurable via quality presets
  • Transparent composition logging

Costs:

  • Three times more API calls due to overcollection
  • Recent papers may have less rigorous peer review
  • Fixed quota may not suit all topics
  • Two-phase search adds implementation complexity