How I Built an AI-Powered Research Automation System with n8n, Groq, and 5 Academic APIs

Written by chidoziemanagwu | Published 2025/12/02
Tech Story Tags: open-source-tools | n8n-tutorial | automation | n8n-workflow | academic-twitter | llm-applications | groq-ai | research-automation

TLDRThis post guides you through architecting an AI-powered research automation system using a low-code approach. It shows you how to integrate n8n for workflow orchestration, Groq for high-speed LLM inference, and five different academic APIs to automatically query, analyze, and process research data, accelerating the entire R&D-to-publication cycle.via the TL;DR App

The Problem: Research Takes Too Long

As a researcher and developer, I found myself spending hours manually searching academic databases, reading abstracts, and trying to synthesize findings across multiple sources.

For my work on circular economy and battery recycling, I needed to:

  • Search 5 different academic databases
  • Read through dozens of papers
  • Extract key findings and methodologies
  • Identify research gaps and trends
  • Synthesize everything into a coherent report

This process took 4-6 hours per week. I knew there had to be a better way.


The Solution: Full-Stack Research Automation

I built an n8n workflow that does all of this automatically in under 5 minutes. Here's what it does:

✅ Queries 5 academic APIs simultaneously

✅ Uses AI to extract insights from each paper

✅ Scores papers for relevance and quality

✅ Stores everything in Google Sheets

✅ Generates a comprehensive synthesis report

✅ Emails me a beautiful HTML report

Result: I went from 4-6 hours of manual work to one click and 5 minutes of waiting.


Tech Stack

Here's what powers this automation:

Tool

Purpose

n8n

Workflow automation platform (open-source)

Groq AI

Llama 3.3 70B model for content extraction & synthesis

Semantic Scholar

Computer science & general academic papers

OpenAlex

200M+ open-access research papers

Crossref

DOI registry & journal metadata

arXiv

Preprints in physics, math, CS

PubMed

Biomedical & life sciences

Google Sheets

Data persistence

Gmail

Report delivery


Architecture Overview

Here's the complete workflow in 7 stages:

Manual Trigger → Configuration → 5 Parallel API Calls → 
Normalize & Deduplicate → AI Extraction → Score & Filter → 
Google Sheets + AI Synthesis → Email Report

Total nodes: 23

Execution time: 2-5 minutes

Papers processed: Up to 50 (10 per API)


Stage 1: Configuration

Instead of hardcoding search parameters in every node, I created a central configuration node:

{
  "keywords": "circular economy battery recycling remanufacturing",
  "min_year": "2020",
  "min_citations": "2",
  "max_results": "10",
  "relevance_threshold": "15"
}

This makes the workflow easy to customize. Want to research a different topic? Just change the keywords field.


Stage 2: Parallel API Collection

The workflow queries all 5 APIs simultaneously using n8n's parallel execution. Each API call is preceded by a "Rate Limit Delay" node to avoid 429 errors.

Example: Semantic Scholar API

// HTTP Request Node Configuration
{
  "url": "https://api.semanticscholar.org/graph/v1/paper/search",
  "queryParameters": {
    "query": "{{ $json.keywords }}",
    "limit": "{{ $json.max_results }}",
    "fields": "title,abstract,year,authors,citationCount,venue,externalIds,url,openAccessPdf",
    "year": "{{ $json.min_year }}-"
  }
}

Key insight: All API nodes have onError: "continueRegularOutput" set. This means if one API fails, the workflow continues with whatever data it successfully retrieved.


Stage 3: Normalization & Deduplication

Each API returns data in a different format. This JavaScript code normalizes everything into a standard structure:

// Simplified normalization logic
const normalized = [];
const seenDOIs = new Set();
const seenTitles = new Set();

for (const item of apiResponses) {
  // Skip errors
  if (item.json.error || item.json.status === 429) {
    console.log('Skipping failed API');
    continue;
  }
  
  // Normalize Semantic Scholar
  if (item.json.data && Array.isArray(item.json.data)) {
    for (const paper of item.json.data) {
      const normalized = {
        title: paper.title || 'Unknown',
        abstract: paper.abstract || 'No abstract',
        year: paper.year || null,
        authors: paper.authors.map(a => a.name).join(', '),
        citations: paper.citationCount || 0,
        doi: paper.externalIds?.DOI || null,
        source: 'Semantic Scholar'
      };
      
      // Deduplicate by DOI
      if (normalized.doi && !seenDOIs.has(normalized.doi)) {
        seenDOIs.add(normalized.doi);
        results.push(normalized);
      }
    }
  }
  
  // Similar logic for OpenAlex, Crossref, arXiv, PubMed...
}

Why this matters: Without normalization, the next stages wouldn't know how to read the data. Without deduplication, you'd get the same paper multiple times.


Stage 4: AI-Powered Content Extraction

Now comes the magic. For each paper, I use Groq's Llama 3.3 70B model to extract structured insights.

The Prompt

const prompt = `You are an expert research analyst. Analyze this paper:

Title: ${paper.title}
Abstract: ${paper.abstract}
Year: ${paper.year}
Authors: ${paper.authors}

Extract:
1. research_question: Main research question
2. methodology: Research methods used
3. key_findings: Main findings (2-3 sentences)
4. conclusion: Key conclusions
5. themes: Array of themes (e.g., ["circular economy", "battery recycling"])

Return ONLY valid JSON with these exact keys.`;

The API Call

// Groq API Request
{
  "model": "llama-3.3-70b-versatile",
  "messages": [
    {
      "role": "system",
      "content": "You are a research extraction assistant. Always return valid JSON only."
    },
    {
      "role": "user",
      "content": prompt
    }
  ],
  "temperature": 0.3,  // Low temperature for factual extraction
  "max_tokens": 2000
}

Optimization: I use n8n's batching feature to process 5 papers at once, reducing API calls by 80%.


Stage 5: Scoring & Filtering

Not all papers are equally relevant. I built a custom scoring algorithm:

Relevance Score (max 70 points)

const keywords = ['circular', 'economy', 'recycling', 'remanufacturing', 
                  'sustainability', 'waste', 'battery', 'lithium'];

let score = 0;
for (const keyword of keywords) {
  if (title.includes(keyword)) score += 6;
  if (abstract.includes(keyword)) score += 4;
  if (themes.includes(keyword)) score += 3;
}

Quality Score (max 30 points)

const citationScore = Math.min(citations / 3, 20);  // Max 20 points
const recencyScore = Math.max(0, 10 - (currentYear - paperYear));  // Max 10 points
const qualityScore = citationScore + recencyScore;

Total Score

const totalScore = relevanceScore + qualityScore;  // Max 100 points

Papers below the threshold (default: 15) are filtered out.


Stage 6: Storage & Synthesis

Filtered papers are saved to Google Sheets for future reference. Then, all papers are aggregated into one massive prompt for AI synthesis:

const prompt = `You are a research synthesis expert analyzing ${papersCount} papers.

[Paper summaries with all extracted data...]

Generate a comprehensive synthesis report with:
- Executive Summary
- Key Themes Identified
- Emerging Trends
- Research Gaps
- Methodological Approaches
- Key Findings Summary
- Future Research Directions`;

Groq's AI generates a multi-section markdown report analyzing all papers together.


Stage 7: Email Delivery

The final node converts the markdown synthesis into a beautiful HTML email:

// Markdown to HTML conversion
let html = synthesisText
  .replace(/## (.*?)\n/g, '<h2 style="color: #2c3e50;">$1</h2>')
  .replace(/\*\*(.*?)\*\*/g, '<strong>$1</strong>')
  .replace(/^- (.*?)$/gm, '<li>$1</li>');

The email includes:

  • Paper count and date
  • Data source summary
  • Full synthesis report with styling
  • Link to Google Sheets
  • Timestamp


Results & Impact

Before Automation

  • ⏱️ 4-6 hours per week manually searching and reading
  • 📊 10-15 papers reviewed per session
  • 📝 Inconsistent note-taking and synthesis

After Automation

  • ⏱️ 5 minutes (one click + wait time)
  • 📊 50 papers processed automatically
  • 📝 Structured AI-generated synthesis every time
  • 📧 Professional reports delivered to inbox

Time saved: ~20 hours per month


Key Learnings

1. Error Handling is Critical

Setting onError: "continueRegularOutput" on all API nodes means one failed API doesn't crash the entire workflow.

2. Batching Saves API Costs

Processing 5 papers at once reduced my Groq API calls by 80%.

3. Prompt Engineering Matters

Specifying "Return ONLY valid JSON" in the system message dramatically improved parsing reliability.

4. Centralised Configuration is Essential

One configuration node makes the workflow easy to customize and maintain.

5. Parallel Execution is Powerful

Querying 5 APIs simultaneously reduced execution time from 15+ minutes to under 5 minutes.


How to Use This Workflow

Prerequisites

  • n8n instance (self-hosted or cloud)
  • Groq API key (free tier available)
  • Google account (for Sheets & Gmail)

Setup Steps

  1. Import the workflow into n8n
  2. Configure credentials:
  • Groq API (OpenAI-compatible endpoint)
  • Google Sheets OAuth2
  • Gmail OAuth2
  1. Update configuration:
  • Change keywords to your research topic
  • Adjust relevance_threshold for filtering
  • Update email recipient
  1. Run the workflow and wait for your report!

Code Repository

The complete workflow JSON and setup guide are available on GitHub:

👉 https://github.com/chidoziemanagwu/Research-Automation-Workflow

Includes:

  • workflow.json - The complete n8n workflow
  • README.md - Overview and features
  • SETUP.md - Step-by-step configuration guide


Future Enhancements

I'm planning to add:

  • 📄 PDF full-text analysis for deeper insights
  • 📊 Citation network visualisation
  • Scheduled execution (daily/weekly)


Conclusion

Building this automation taught me that the best code is the code you don't have to write. By combining n8n's visual workflow builder, Groq's powerful AI, and multiple academic APIs, I created a system that does in 5 minutes what used to take me hours.

If you're a researcher, student, or anyone who regularly reviews academic literature, I encourage you to build something similar. The tools are accessible, the APIs are (mostly) free, and the time savings are massive.


About the Author

Chidozie Managwu is a contributor to FreeCodeCamp, a software engineer and an automation expert focused on using AI and workflow tools to solve real-world problems.


Written by chidoziemanagwu | Here to share what i build and my thoughts
Published by HackerNoon on 2025/12/02