How I Built an AI-Powered Research Automation System with n8n, Groq, and 5 Academic APIs

The Problem: Research Takes Too Long

As a researcher and developer, I found myself spending hours manually searching academic databases, reading abstracts, and trying to synthesize findings across multiple sources.

For my work on circular economy and battery recycling, I needed to:

Search 5 different academic databases
Read through dozens of papers
Extract key findings and methodologies
Identify research gaps and trends
Synthesize everything into a coherent report

This process took 4-6 hours per week. I knew there had to be a better way.

The Solution: Full-Stack Research Automation

I built an n8n workflow that does all of this automatically in under 5 minutes. Here's what it does:

✅ Queries 5 academic APIs simultaneously

✅ Uses AI to extract insights from each paper

✅ Scores papers for relevance and quality

✅ Stores everything in Google Sheets

✅ Generates a comprehensive synthesis report

✅ Emails me a beautiful HTML report

Result: I went from 4-6 hours of manual work to one click and 5 minutes of waiting.

Tech Stack

Here's what powers this automation:

Tool	Purpose
n8n	Workflow automation platform (open-source)
Groq AI	Llama 3.3 70B model for content extraction & synthesis
Semantic Scholar	Computer science & general academic papers
OpenAlex	200M+ open-access research papers
Crossref	DOI registry & journal metadata
arXiv	Preprints in physics, math, CS
PubMed	Biomedical & life sciences
Google Sheets	Data persistence
Gmail	Report delivery

Architecture Overview

Here's the complete workflow in 7 stages:

Manual Trigger → Configuration → 5 Parallel API Calls → 
Normalize & Deduplicate → AI Extraction → Score & Filter → 
Google Sheets + AI Synthesis → Email Report

Total nodes: 23

Execution time: 2-5 minutes

Papers processed: Up to 50 (10 per API)

Stage 1: Configuration

Instead of hardcoding search parameters in every node, I created a central configuration node:

{
  "keywords": "circular economy battery recycling remanufacturing",
  "min_year": "2020",
  "min_citations": "2",
  "max_results": "10",
  "relevance_threshold": "15"
}

This makes the workflow easy to customize. Want to research a different topic? Just change the keywords field.

Stage 2: Parallel API Collection

The workflow queries all 5 APIs simultaneously using n8n's parallel execution. Each API call is preceded by a "Rate Limit Delay" node to avoid 429 errors.

Example: Semantic Scholar API

// HTTP Request Node Configuration
{
  "url": "https://api.semanticscholar.org/graph/v1/paper/search",
  "queryParameters": {
    "query": "{{ $json.keywords }}",
    "limit": "{{ $json.max_results }}",
    "fields": "title,abstract,year,authors,citationCount,venue,externalIds,url,openAccessPdf",
    "year": "{{ $json.min_year }}-"
  }
}

Key insight: All API nodes have onError: "continueRegularOutput" set. This means if one API fails, the workflow continues with whatever data it successfully retrieved.

Stage 3: Normalization & Deduplication

Each API returns data in a different format. This JavaScript code normalizes everything into a standard structure:

// Simplified normalization logic
const normalized = [];
const seenDOIs = new Set();
const seenTitles = new Set();

for (const item of apiResponses) {
  // Skip errors
  if (item.json.error || item.json.status === 429) {
    console.log('Skipping failed API');
    continue;
  }
  
  // Normalize Semantic Scholar
  if (item.json.data && Array.isArray(item.json.data)) {
    for (const paper of item.json.data) {
      const normalized = {
        title: paper.title || 'Unknown',
        abstract: paper.abstract || 'No abstract',
        year: paper.year || null,
        authors: paper.authors.map(a => a.name).join(', '),
        citations: paper.citationCount || 0,
        doi: paper.externalIds?.DOI || null,
        source: 'Semantic Scholar'
      };
      
      // Deduplicate by DOI
      if (normalized.doi && !seenDOIs.has(normalized.doi)) {
        seenDOIs.add(normalized.doi);
        results.push(normalized);
      }
    }
  }
  
  // Similar logic for OpenAlex, Crossref, arXiv, PubMed...
}

Why this matters: Without normalization, the next stages wouldn't know how to read the data. Without deduplication, you'd get the same paper multiple times.

Stage 4: AI-Powered Content Extraction

Now comes the magic. For each paper, I use Groq's Llama 3.3 70B model to extract structured insights.

The Prompt

const prompt = `You are an expert research analyst. Analyze this paper:

Title: ${paper.title}
Abstract: ${paper.abstract}
Year: ${paper.year}
Authors: ${paper.authors}

Extract:
1. research_question: Main research question
2. methodology: Research methods used
3. key_findings: Main findings (2-3 sentences)
4. conclusion: Key conclusions
5. themes: Array of themes (e.g., ["circular economy", "battery recycling"])

Return ONLY valid JSON with these exact keys.`;

The API Call

// Groq API Request
{
  "model": "llama-3.3-70b-versatile",
  "messages": [
    {
      "role": "system",
      "content": "You are a research extraction assistant. Always return valid JSON only."
    },
    {
      "role": "user",
      "content": prompt
    }
  ],
  "temperature": 0.3,  // Low temperature for factual extraction
  "max_tokens": 2000
}

Optimization: I use n8n's batching feature to process 5 papers at once, reducing API calls by 80%.

Stage 5: Scoring & Filtering

Not all papers are equally relevant. I built a custom scoring algorithm:

Relevance Score (max 70 points)

const keywords = ['circular', 'economy', 'recycling', 'remanufacturing', 
                  'sustainability', 'waste', 'battery', 'lithium'];

let score = 0;
for (const keyword of keywords) {
  if (title.includes(keyword)) score += 6;
  if (abstract.includes(keyword)) score += 4;
  if (themes.includes(keyword)) score += 3;
}

Quality Score (max 30 points)

const citationScore = Math.min(citations / 3, 20);  // Max 20 points
const recencyScore = Math.max(0, 10 - (currentYear - paperYear));  // Max 10 points
const qualityScore = citationScore + recencyScore;

Total Score

const totalScore = relevanceScore + qualityScore;  // Max 100 points

Papers below the threshold (default: 15) are filtered out.

Stage 6: Storage & Synthesis

Filtered papers are saved to Google Sheets for future reference. Then, all papers are aggregated into one massive prompt for AI synthesis:

const prompt = `You are a research synthesis expert analyzing ${papersCount} papers.

[Paper summaries with all extracted data...]

Generate a comprehensive synthesis report with:
- Executive Summary
- Key Themes Identified
- Emerging Trends
- Research Gaps
- Methodological Approaches
- Key Findings Summary
- Future Research Directions`;

Groq's AI generates a multi-section markdown report analyzing all papers together.

Stage 7: Email Delivery

The final node converts the markdown synthesis into a beautiful HTML email:

// Markdown to HTML conversion
let html = synthesisText
  .replace(/## (.*?)\n/g, '<h2 style="color: #2c3e50;">$1</h2>')
  .replace(/\*\*(.*?)\*\*/g, '<strong>$1</strong>')
  .replace(/^- (.*?)$/gm, '<li>$1</li>');

The email includes:

Paper count and date
Data source summary
Full synthesis report with styling
Link to Google Sheets
Timestamp

Results & Impact

Before Automation

⏱️ 4-6 hours per week manually searching and reading
📊 10-15 papers reviewed per session
📝 Inconsistent note-taking and synthesis

After Automation

⏱️ 5 minutes (one click + wait time)
📊 50 papers processed automatically
📝 Structured AI-generated synthesis every time
📧 Professional reports delivered to inbox

Time saved: ~20 hours per month

Key Learnings

1. Error Handling is Critical

Setting onError: "continueRegularOutput" on all API nodes means one failed API doesn't crash the entire workflow.

2. Batching Saves API Costs

Processing 5 papers at once reduced my Groq API calls by 80%.

3. Prompt Engineering Matters

Specifying "Return ONLY valid JSON" in the system message dramatically improved parsing reliability.

4. Centralised Configuration is Essential

One configuration node makes the workflow easy to customize and maintain.

5. Parallel Execution is Powerful

Querying 5 APIs simultaneously reduced execution time from 15+ minutes to under 5 minutes.

How to Use This Workflow

Prerequisites

n8n instance (self-hosted or cloud)
Groq API key (free tier available)
Google account (for Sheets & Gmail)

Setup Steps

Import the workflow into n8n
Configure credentials:

Groq API (OpenAI-compatible endpoint)
Google Sheets OAuth2
Gmail OAuth2

Update configuration:

Change keywords to your research topic
Adjust relevance_threshold for filtering
Update email recipient

Run the workflow and wait for your report!

Code Repository

The complete workflow JSON and setup guide are available on GitHub:

👉 https://github.com/chidoziemanagwu/Research-Automation-Workflow

Includes:

workflow.json - The complete n8n workflow
README.md - Overview and features
SETUP.md - Step-by-step configuration guide

Future Enhancements

I'm planning to add:

📄 PDF full-text analysis for deeper insights
📊 Citation network visualisation
⏰ Scheduled execution (daily/weekly)

Conclusion

Building this automation taught me that the best code is the code you don't have to write. By combining n8n's visual workflow builder, Groq's powerful AI, and multiple academic APIs, I created a system that does in 5 minutes what used to take me hours.

If you're a researcher, student, or anyone who regularly reviews academic literature, I encourage you to build something similar. The tools are accessible, the APIs are (mostly) free, and the time savings are massive.

About the Author

Chidozie Managwu is a contributor to FreeCodeCamp, a software engineer and an automation expert focused on using AI and workflow tools to solve real-world problems.