Recency-Aware Paper Selection for Academic Literature Reviews
Citation-based filtering creates a fundamental recency bias in academic literature reviews. A 2024 paper with five citations gets filtered out, while a 2014 paper with 50 citations passes. This pattern solves the problem by treating recent and older papers differently.
The problem
Citation counts are a cumulative metric. Papers need time to be read, cited, and for those citations to propagate through the system:
| Paper Age | Typical Citations | Filtering Outcome (threshold=10) |
|---|---|---|
| 10+ years | 100-1000+ | Passes |
| 5-10 years | 20-100 | Usually passes |
| 2-5 years | 5-30 | Often fails |
| Less than two years | 0-10 | Almost always filtered out |
A 2024 study found that NLP research shows a -12.8 percent decline in citation age, termed “citation amnesia”—the strongest among 20 fields studied. Using a single citation threshold systematically excludes cutting-edge developments.
The solution
The pattern uses three components:
- Two-phase search with different citation thresholds for recent vs. older papers
- Overcollection (three times target) to ensure enough candidates for quota enforcement
- Quota-based finalization guaranteeing a minimum percentage of recent papers
graph TD A[Quality Settings] --> B[Two-Phase Search] B --> C[Phase 1: Recent<br/>min_citations = 0] B --> D[Phase 2: Older<br/>min_citations = 10] C --> E[Overcollect three times target] D --> E E --> F[Quota-Based Finalization] F --> G[Final Corpus<br/>25 percent recent guaranteed]
Two-phase search
The key insight: Recent papers have not had time to accumulate citations, so applying a citation threshold to them is meaningless.
async def two_phase_keyword_search(
queries: list[str],
cutoff_year: int,
min_citations: int,
) -> list[dict]:
"""Search with different thresholds for recent vs. older papers."""
async def search_query(query: str) -> list[dict]:
results = []
# Phase 1: Recent papers with no citation threshold
# A 2024 paper with five citations may be highly important;
# it just has not had time to accumulate citations.
recent_results = await openalex_search(
query=query,
from_year=cutoff_year,
min_citations=0, # No threshold for recent
)
results.extend(recent_results)
# Phase 2: Older papers with normal citation threshold
# A 2015 paper with only five citations after 10+ years
# is likely not impactful.
older_results = await openalex_search(
query=query,
to_year=cutoff_year - 1,
min_citations=min_citations, # Normal threshold
)
results.extend(older_results)
return results
all_results = await asyncio.gather(*[search_query(q) for q in queries])
return deduplicate_by_doi(all_results)The same principle applies to citation fetching. Forward citations (papers that cite your seed papers) should use two-phase thresholds. Backward citations don’t need this treatment since they are older by definition.
Overcollection
To enforce a quota effectively, you need enough candidates in each recency bucket. Collect three times the target corpus size before finalization.
# Why three times?
# - 25 percent recency quota means we need enough recent papers to fill that slot
# - Discovery yield is ~60-70 percent (not all candidates score as relevant)
# - Three times provides headroom while limiting API cost
OVERCOLLECTION_MULTIPLIER = 3
def get_collection_target(max_papers: int) -> int:
return max_papers * OVERCOLLECTION_MULTIPLIERFor a target of 100 papers, collect 300 candidates. This ensures the finalization step has enough recent papers to meet the 25 percent quota.
Quota-based finalization
After collecting candidates and scoring for relevance, enforce the recency quota:
def finalize_with_recency_quota(
paper_corpus: dict[str, dict],
max_papers: int,
recency_quota: float,
cutoff_year: int,
) -> list[str]:
"""Select final corpus with guaranteed recency quota."""
# Partition by recency
recent = [(doi, p) for doi, p in paper_corpus.items()
if p.get("year", 0) >= cutoff_year]
older = [(doi, p) for doi, p in paper_corpus.items()
if p.get("year", 0) < cutoff_year]
# Sort each partition by relevance score
recent.sort(key=lambda x: x[1].get("relevance_score", 0.5), reverse=True)
older.sort(key=lambda x: x[1].get("relevance_score", 0.5), reverse=True)
# Select: top recent up to quota, then fill with older
target_recent = int(max_papers * recency_quota)
recent_selected = recent[:target_recent]
slots_for_older = max_papers - len(recent_selected)
older_selected = older[:slots_for_older]
return [doi for doi, _ in recent_selected + older_selected]The algorithm:
- Partition papers into recent vs. older
- Sort each partition by relevance score
- Select top recent papers up to quota (for example, 25 of 100)
- Fill remaining slots with top older papers
- If you still have slots and more recent papers exist, use them
Configuration
Encode recency settings in quality presets:
QUALITY_PRESETS = {
"quick": {
"max_papers": 30,
"recency_years": 3,
"recency_quota": 0.25,
"min_citations_filter": 5,
},
"standard": {
"max_papers": 50,
"recency_years": 3,
"recency_quota": 0.25,
"min_citations_filter": 10,
},
"high_quality": {
"max_papers": 200,
"recency_years": 3,
"recency_quota": 0.30, # Higher for cutting-edge coverage
"min_citations_filter": 10,
},
}Why these defaults:
- recency_years: three aligns with typical citation lag (papers take three to five years to reach citation peak)
- recency_quota: 0.25 ensures 25 percent recent coverage without sacrificing foundational work
- Adjust for your domain: Fast-moving fields (ML, biotech) may want 40 percent; historical analysis may want 10 percent
When to use this pattern
Use when:
- Literature review needs both foundational and emerging work
- Citation filtering is used for quality control
- Fast-moving fields where recent breakthroughs matter
- Quality tiers need configurable recency balance
Don’t use when:
- Only historical or archival research matters
- No citation filtering is applied
- Corpus is small enough to include all papers
- Recency isn’t a meaningful quality signal
Trade-offs
Benefits:
- Balanced coverage of both seminal and emerging work
- Recent papers aren’t penalized for low citations
- Configurable via quality presets
- Transparent composition logging
Costs:
- Three times more API calls due to overcollection
- Recent papers may have less rigorous peer review
- Fixed quota may not suit all topics
- Two-phase search adds implementation complexity