The Problem: Research Takes Too Long
As a researcher and developer, I found myself spending hours manually searching academic databases, reading abstracts, and trying to synthesize findings across multiple sources.
For my work on circular economy and battery recycling, I needed to:
- Search 5 different academic databases
- Read through dozens of papers
- Extract key findings and methodologies
- Identify research gaps and trends
- Synthesize everything into a coherent report
This process took 4-6 hours per week. I knew there had to be a better way.
The Solution: Full-Stack Research Automation
I built an n8n workflow that does all of this automatically in under 5 minutes. Here's what it does:
✅ Queries 5 academic APIs simultaneously
✅ Uses AI to extract insights from each paper
✅ Scores papers for relevance and quality
✅ Stores everything in Google Sheets
✅ Generates a comprehensive synthesis report
✅ Emails me a beautiful HTML report
Result: I went from 4-6 hours of manual work to one click and 5 minutes of waiting.
Tech Stack
Here's what powers this automation:
Tool | Purpose |
|---|---|
n8n | Workflow automation platform (open-source) |
Groq AI | Llama 3.3 70B model for content extraction & synthesis |
Semantic Scholar | Computer science & general academic papers |
OpenAlex | 200M+ open-access research papers |
Crossref | DOI registry & journal metadata |
arXiv | Preprints in physics, math, CS |
PubMed | Biomedical & life sciences |
Google Sheets | Data persistence |
Gmail | Report delivery |
Architecture Overview
Here's the complete workflow in 7 stages:
Manual Trigger → Configuration → 5 Parallel API Calls →
Normalize & Deduplicate → AI Extraction → Score & Filter →
Google Sheets + AI Synthesis → Email Report
Total nodes: 23
Execution time: 2-5 minutes
Papers processed: Up to 50 (10 per API)
Stage 1: Configuration
Instead of hardcoding search parameters in every node, I created a central configuration node:
{
"keywords": "circular economy battery recycling remanufacturing",
"min_year": "2020",
"min_citations": "2",
"max_results": "10",
"relevance_threshold": "15"
}
This makes the workflow easy to customize. Want to research a different topic? Just change the keywords field.
Stage 2: Parallel API Collection
The workflow queries all 5 APIs simultaneously using n8n's parallel execution. Each API call is preceded by a "Rate Limit Delay" node to avoid 429 errors.
Example: Semantic Scholar API
// HTTP Request Node Configuration
{
"url": "https://api.semanticscholar.org/graph/v1/paper/search",
"queryParameters": {
"query": "{{ $json.keywords }}",
"limit": "{{ $json.max_results }}",
"fields": "title,abstract,year,authors,citationCount,venue,externalIds,url,openAccessPdf",
"year": "{{ $json.min_year }}-"
}
}
Key insight: All API nodes have onError: "continueRegularOutput" set. This means if one API fails, the workflow continues with whatever data it successfully retrieved.
Stage 3: Normalization & Deduplication
Each API returns data in a different format. This JavaScript code normalizes everything into a standard structure:
// Simplified normalization logic
const normalized = [];
const seenDOIs = new Set();
const seenTitles = new Set();
for (const item of apiResponses) {
// Skip errors
if (item.json.error || item.json.status === 429) {
console.log('Skipping failed API');
continue;
}
// Normalize Semantic Scholar
if (item.json.data && Array.isArray(item.json.data)) {
for (const paper of item.json.data) {
const normalized = {
title: paper.title || 'Unknown',
abstract: paper.abstract || 'No abstract',
year: paper.year || null,
authors: paper.authors.map(a => a.name).join(', '),
citations: paper.citationCount || 0,
doi: paper.externalIds?.DOI || null,
source: 'Semantic Scholar'
};
// Deduplicate by DOI
if (normalized.doi && !seenDOIs.has(normalized.doi)) {
seenDOIs.add(normalized.doi);
results.push(normalized);
}
}
}
// Similar logic for OpenAlex, Crossref, arXiv, PubMed...
}
Why this matters: Without normalization, the next stages wouldn't know how to read the data. Without deduplication, you'd get the same paper multiple times.
Stage 4: AI-Powered Content Extraction
Now comes the magic. For each paper, I use Groq's Llama 3.3 70B model to extract structured insights.
The Prompt
const prompt = `You are an expert research analyst. Analyze this paper:
Title: ${paper.title}
Abstract: ${paper.abstract}
Year: ${paper.year}
Authors: ${paper.authors}
Extract:
1. research_question: Main research question
2. methodology: Research methods used
3. key_findings: Main findings (2-3 sentences)
4. conclusion: Key conclusions
5. themes: Array of themes (e.g., ["circular economy", "battery recycling"])
Return ONLY valid JSON with these exact keys.`;
The API Call
// Groq API Request
{
"model": "llama-3.3-70b-versatile",
"messages": [
{
"role": "system",
"content": "You are a research extraction assistant. Always return valid JSON only."
},
{
"role": "user",
"content": prompt
}
],
"temperature": 0.3, // Low temperature for factual extraction
"max_tokens": 2000
}
Optimization: I use n8n's batching feature to process 5 papers at once, reducing API calls by 80%.
Stage 5: Scoring & Filtering
Not all papers are equally relevant. I built a custom scoring algorithm:
Relevance Score (max 70 points)
const keywords = ['circular', 'economy', 'recycling', 'remanufacturing',
'sustainability', 'waste', 'battery', 'lithium'];
let score = 0;
for (const keyword of keywords) {
if (title.includes(keyword)) score += 6;
if (abstract.includes(keyword)) score += 4;
if (themes.includes(keyword)) score += 3;
}
Quality Score (max 30 points)
const citationScore = Math.min(citations / 3, 20); // Max 20 points
const recencyScore = Math.max(0, 10 - (currentYear - paperYear)); // Max 10 points
const qualityScore = citationScore + recencyScore;
Total Score
const totalScore = relevanceScore + qualityScore; // Max 100 points
Papers below the threshold (default: 15) are filtered out.
Stage 6: Storage & Synthesis
Filtered papers are saved to Google Sheets for future reference. Then, all papers are aggregated into one massive prompt for AI synthesis:
const prompt = `You are a research synthesis expert analyzing ${papersCount} papers.
[Paper summaries with all extracted data...]
Generate a comprehensive synthesis report with:
- Executive Summary
- Key Themes Identified
- Emerging Trends
- Research Gaps
- Methodological Approaches
- Key Findings Summary
- Future Research Directions`;
Groq's AI generates a multi-section markdown report analyzing all papers together.
Stage 7: Email Delivery
The final node converts the markdown synthesis into a beautiful HTML email:
// Markdown to HTML conversion
let html = synthesisText
.replace(/## (.*?)\n/g, '<h2 style="color: #2c3e50;">$1</h2>')
.replace(/\*\*(.*?)\*\*/g, '<strong>$1</strong>')
.replace(/^- (.*?)$/gm, '<li>$1</li>');
The email includes:
- Paper count and date
- Data source summary
- Full synthesis report with styling
- Link to Google Sheets
- Timestamp
Results & Impact
Before Automation
- ⏱️ 4-6 hours per week manually searching and reading
- 📊 10-15 papers reviewed per session
- 📝 Inconsistent note-taking and synthesis
After Automation
- ⏱️ 5 minutes (one click + wait time)
- 📊 50 papers processed automatically
- 📝 Structured AI-generated synthesis every time
- 📧 Professional reports delivered to inbox
Time saved: ~20 hours per month
Key Learnings
1. Error Handling is Critical
Setting onError: "continueRegularOutput" on all API nodes means one failed API doesn't crash the entire workflow.
2. Batching Saves API Costs
Processing 5 papers at once reduced my Groq API calls by 80%.
3. Prompt Engineering Matters
Specifying "Return ONLY valid JSON" in the system message dramatically improved parsing reliability.
4. Centralised Configuration is Essential
One configuration node makes the workflow easy to customize and maintain.
5. Parallel Execution is Powerful
Querying 5 APIs simultaneously reduced execution time from 15+ minutes to under 5 minutes.
How to Use This Workflow
Prerequisites
- n8n instance (self-hosted or cloud)
- Groq API key (free tier available)
- Google account (for Sheets & Gmail)
Setup Steps
- Import the workflow into n8n
- Configure credentials:
- Groq API (OpenAI-compatible endpoint)
- Google Sheets OAuth2
- Gmail OAuth2
- Update configuration:
- Change
keywordsto your research topic - Adjust
relevance_thresholdfor filtering - Update email recipient
- Run the workflow and wait for your report!
Code Repository
The complete workflow JSON and setup guide are available on GitHub:
👉 https://github.com/chidoziemanagwu/Research-Automation-Workflow
Includes:
workflow.json- The complete n8n workflowREADME.md- Overview and featuresSETUP.md- Step-by-step configuration guide
Future Enhancements
I'm planning to add:
- 📄 PDF full-text analysis for deeper insights
- 📊 Citation network visualisation
- ⏰ Scheduled execution (daily/weekly)
Conclusion
Building this automation taught me that the best code is the code you don't have to write. By combining n8n's visual workflow builder, Groq's powerful AI, and multiple academic APIs, I created a system that does in 5 minutes what used to take me hours.
If you're a researcher, student, or anyone who regularly reviews academic literature, I encourage you to build something similar. The tools are accessible, the APIs are (mostly) free, and the time savings are massive.
About the Author
Chidozie Managwu is a contributor to FreeCodeCamp, a software engineer and an automation expert focused on using AI and workflow tools to solve real-world problems.
