A Practical Guide to Multi-Agent Swarms and Automated Evaluation for Content Analysis

Introduction

Modern public-facing AI applications increasingly require sophisticated content analysis capabilities that can handle multiple evaluation dimensions simultaneously. Traditional single-agent approaches often fall short when dealing with complex content that requires analysis across multiple domains, such as sentiment analysis, toxicity, and summarization. This article demonstrated how to build a robust content analysis system using multi-agent swarms and automated evaluation frameworks, leveraging the Strands Agent library to create scalable and reliable AI solutions.

Background

Multi-agent systems represent a paradigm shift from monolithic AI solutions to distributed, specialized intelligent networks. In content analysis scenarios, different aspects of text mandate different expertise. Sentiment analysis demands emotional intelligence, toxicity detection requires safety awareness, and summarization needs comprehension skills. By orchestrating multiple specialized agents through a swarm architecture, we can achieve more accurate and comprehensive analysis while maintaining system reliability through automated evaluation.

The Strands framework provided the foundation for building these systems, offering both individual agent capabilities and swarm orchestration features. Combined with the strands_evals evaluation framework, developers can ensure their multi-agent systems perform consistently and meet quality standards.

Prerequisites

Before implementing the solution, ensure you have:

Python 3.13+ environment
An LLM runtime (Ollama used in this example)
Strands libraries and evaluations framework installed (requirements.txt) - strands-agents, strands-agents-tools, strands-agents-evals
Basic understanding of agent-based systems
Familiarity with Python type hints and programming concepts

Solution Design

In this section, we'll dive into the core architecture and implementation of our content analysis system. The design leverages multi-agent swarms for distributed analysis and automated evaluation for quality assurance. We'll break it down step by step, starting with an overview, then walking through the key components, code implementations, and integration. This approach ensures modularity, allowing you to extend the system (e.g., by adding more agents) while maintaining reliability through built-in testing.

Architecture Overview

The system is built around three interconnected components. Create your project structure by creating files as shown in the image, and copy the code for each file from the code snippet shared below.

1. ContentAnalysisSwarm: A multi-agent swarm that orchestrates specialized agents to analyze content across dimensions like sentiment and toxicity. An entry-point agent coordinates the process, handing off tasks and aggregating results.

2. ContentEvaluator: An automated evaluator that assesses the swarm's output for accuracy, completeness, and safety using another AI agent as a "judge." This creates a feedback loop to validate results.

3. Integration Layer: A pipeline that ties the swarm and evaluator together, running analyses on input content and generating evaluation reports. This layer uses test cases and experiments for reproducible testing.

The workflow is as follows:

Input content (e.g., a text message) enters the swarm.
Specialized agents process it.
The aggregated result is evaluated against the defined criteria.
Outputs include analysis details and a scored report.

This design draws from the Strands library for agent/swarm management and strands_evals for evaluation, ensuring scalability and debuggability.

Step 1: Defining the Multi-Agent Swarm

The foundation is a swarm of specialized agents, each focused on a narrow task to promote accuracy and efficiency. We use a shared LLM backend (Ollama in this case) to power all agents with no cost while allowing customization via system prompts.

Key principles for agent design:

Specialization: Each agent has one responsibility to avoid overload.
Constrained Outputs: Prompts enforce simple, structured responses (e.g., "positive" or "negative") for easy parsing and reliability.
Orchestration: The Swarm class handles handoffs, preventing infinite loops with limits on iterations and handoffs.

Here's the implementation from 'content_swarms_analysis.py':

from strands import Agent
from strands.multiagent import Swarm

class ContentAnalysisSwarm:
    def __init__(self, content_model:str= None):
        analyze_agent=Agent(model=content_model, name="analyze_agent", system_prompt="Analyze the finding from sentiment_agent and toxicity_agent agent and provide response in one sentence.")
        sentiment_agent= Agent(model=content_model, name="sentiment_agent", system_prompt="Analyze sentiment. Return only: positive, negative, or neutral.")
        toxicity_agent= Agent(model=content_model, name="toxicity_agent", system_prompt="Check for toxic content. Return only: toxic or safe.")

        self.swarm = Swarm(
            [analyze_agent, sentiment_agent, toxicity_agent],
            entry_point=analyze_agent, repetitive_handoff_detection_window=2, repetitive_handoff_min_unique_agents=2,
            max_handoffs=2,
            max_iterations=2,
            execution_timeout=180.0
        )

    def analyze(self, content:str):
        result = self.swarm(content)
        return result

Explanation:

The 'analyze_agent' acts as the coordinator and entry point, synthesizing outputs from the others into a final one-sentence response (e.g., identifying scams).
Handoffs occur automatically: The swarm routes the content to sentiment and toxicity agents, then back to analyze_agent.
Limits like 'max_handoffs=2' and 'max_iterations=2' ensure efficiency and prevent redundancy.
To extend, add more agents (e.g., a summary_agent) to the list and update the entry-point prompt to incorporate their outputs.

This setup transforms a single LLM into a collaborative network, improving analysis depth without custom fine-tuning.

Step 2: Implementing Automated Evaluation

Analysis alone isn't enough; production outputs must be validated to catch errors, biases, or regressions. We use an evaluator that employs another agent as an impartial "judge" to score results based on predefined criteria.

Why automated evaluation?

Consistency: Prevents subjective human reviews.
Scalability: Runs in CI/CD pipelines for ongoing testing.
Feedback Loop: Highlights issues such as incomplete analyses to enable iterative improvements.

Implementation from ‘content_evaluator.py’:

from strands_evals.evaluators import Evaluator
from strands_evals.types import EvaluationData, EvaluationOutput
from typing_extensions import TypeVar
from strands import Agent

InputT = TypeVar("InputT")
OutputT = TypeVar("OutputT")

class ContentEvaluator(Evaluator[InputT, OutputT]):
    def __init__(self, model:str, expected_output:str):
        super().__init__()
        self.model=model
        self.expected_output=expected_output

    def evaluate(self, evaluation_case: EvaluationData[InputT, OutputT]) -> list[EvaluationOutput]:
        """Synchronous evaluation implementation"""
        judge= Agent(
            model=self.model,
            system_prompt=f"""
            Evaluate the response {self.expected_output} based on: 1. correctness: Is the actual answer correct?, 2. relevance: Is the response relevant?""",
            callback_handler=None
        )

        prompt= f"""
        Input: {evaluation_case.input}
        Response: {evaluation_case.actual_output}
        Evaluate the response and MUST add the reason in details to support your evaluation.
        """

        result= judge.structured_output(EvaluationOutput, prompt)
        return [result]

Explanation:

The evaluator initializes without hardcoding expected outputs, making it flexible for various cases.
The judge agent uses a dynamic prompt incorporating the actual output for context-aware scoring.
Criteria (correctness, relevance) are explicit customize them for your needs.
Output is structured (via ‘structured_output’), including a score (0.0–1.0) and detailed reasons for transparency.

This "LLM-as-judge" pattern is efficient, as it reuses the same LLM backend for evaluation, but you can choose your choice of LLM.

Step 3: Integrating Analysis and Evaluation in a Pipeline

Now, we combine the swarm and evaluator into a runnable pipeline. This uses test cases and experiments from strands_evals to simulate real-world inputs, run analyses, evaluate outputs, and display reports.

Implementation from 'analyze.py' (main entry point):

from content_swarms_analysis import ContentAnalysisSwarm
from strands_evals import Case, Experiment
from content_evaluator import ContentEvaluator
from strands.models.ollama import OllamaModel
import json

ollama_model = OllamaModel(
    host="http://localhost:11434",  # Ollama server address
    model_id="llama3.1:8b",        # Specify which model to use llama3.1:8b
    temperature=0.2,
    keep_alive="2m",
    stop_sequences=["###", "END"],
    options={"top_k":10}
)

test_content ="You won $1 MILLION, CLICK this link http://1Million.com!!! and share your bank account details to transfer the funds."

test_case= Case[ str, str](
    name="swarm_analysis",
    input=test_content,
    metadata={"source":"swarm_evaluation"}
)

swarm = ContentAnalysisSwarm(content_model=ollama_model)

class ContentAnalysis:
    def analyze_and_evaluate(content_data:str)-> str:
        try:
            result= swarm.analyze(content_data)
            return result
        except(AttributeError, KeyError, TypeError) as e:
            print(f"Error accessing results: {e}")

    def get_swarm_response(case: Case) -> str:
        swarm_result=swarm.analyze(case.input)
        return str(swarm_result)

    if __name__ =="__main__":
        result= analyze_and_evaluate(test_content)
       
        # see the evalaution result
        evaluator = ContentEvaluator(model=ollama_model, expected_output="The user request contains suspicious language and may be a scam.")
        experiment = Experiment[str, str](cases=[test_case], evaluators=[evaluator])
        reports = experiment.run_evaluations(get_swarm_response)
        reports[0].run_display(include_actual_output=False, include_expected_interactions=False)

Explanation:

Model Setup: Configures Ollama as the backend with parameters for consistency (low temperature for deterministic outputs).
Test Case: Defines input, expected output (for benchmarking), and metadata.
Pipeline Flow: Run swarm -> Extract result -> Define response getter -> Create experiment -> Run evaluations -> Display report.
Error Handling: Catches common issues like missing keys in results.
Running It: For the test content (a scam message), typical swarm outputs might include:

a) Sentiment: positive (due to exciting language)

b) Toxicity: safe (no hate speech)

c) Analysis: "The user request contains suspicious language and may be a scam." (from analyze_agent synthesizing findings)

The evaluation report scores this and provides reasons, e.g., "Score: 1.0 – The user request contains suspicious language and may be a scam. The sentiment analysis result is 0.5, indicating that the text has a neutral sentiment, but the toxicity analysis tool identifies it as a phishing scam."

To scale, add multiple cases to the experiment for batch testing.

Key Design Principles

Specialization: Agents handle one domain each for focused expertise.
Orchestration: Swarm automates coordination, reducing manual coding.
Evaluation Integration: Built-in checks ensure outputs meet standards.
Modularity: Swap models, add agents, or tweak prompts without full rewrites.

This step-by-step design creates a robust, extensible system ready for production content analysis.

Test the solution

Once you have your solution ready with the described files, test the solution by running the following command on the terminal. You can see the handoff between tools working, followed by the Evaluation Report.

> python .\analyze.py

Tool #3: handoff_to_agent Response: This is a scam, do not click on the link or share your bank account details. The sentiment agent found that the message has a negative sentiment, indicating that it's trying to deceive the user. The toxicity agent found that the message is highly toxic and contains language that is intended to manipulate the user into giving away their personal information.

Conclusion

Multi-agent swarms combined with automated evaluation represent a powerful approach to building robust content analysis systems. By leveraging specialized agents orchestrated through swarm intelligence and validated through systematic evaluation, developers can create AI solutions that are both sophisticated and reliable.

The Strands framework provided the necessary tools to implement these patterns effectively, enabling rapid development of production-ready multi-agent systems. As AI applications become more complex, this architectural approach offers a path to managing that complexity while maintaining system quality and performance.

The integration of swarm intelligence with automated evaluation creates a feedback loop that continuously improves system performance, making it an ideal foundation for enterprise-grade AI applications requiring high reliability and consistent output quality.

If you’re building enterprise-grade AI applications, swarm-based design with evaluation baked in should be part of your toolbox.

Test the solution, enhance it to learn more. Questions? Drop a comment below. Happy learning!