The AI Agent Tool Stack

What tools do you need to build an AI agent?

It goes well beyond just the LLM. While the AI takes center stage, the tooling to create robust agents is more about infrastructure than intelligence. You need databases that can keep up with agent workloads, orchestration to handle multi-step processes, ways to verify that agents aren’t making mistakes, and systems to monitor everything in production.

This stack can contain dozens of tools, each with its own trade-offs and integration requirements. Here, we want to go through each within the framework of how they’ll work within your agent.

The AI Agent framework

We could just list out all the tools you might need to build an AI agent, but that won’t be as helpful if you want to develop agents (or agent platforms) that work to a specific use case or production standard. Instead, we need to understand how these tools fit together into a coherent architecture.

AI agents operate in a continuous feedback loop with three distinct phases: gather context, take action, and verify work.

This pattern repeats until the agent completes its task or determines it needs human intervention.

Context layer

The context phase is where your agent gathers information. It needs to retrieve relevant data, load it efficiently into the model's context window, and decide what's important enough to keep. This shouldn’t be just passive data loading, but an active search process where the agent determines what information matters for the task at hand.

Action layer

The action phase is where your agent executes. It takes the context it's gathered and does something with it: calling APIs, writing code, transforming data, or triggering workflows. The key is giving your agent the right capabilities to solve problems flexibly rather than just following rigid scripts.

Verification layer

The verification phase closes the loop. Your agent checks its own work, catches errors, and decides whether to iterate or move forward. Agents that can self-correct are fundamentally more reliable because they catch mistakes before they compound.

This is all wrapped within an infrastructure that handles orchestration, monitoring, and scaling, ensuring your agent runs reliably in production.

Now, let's break down what you need at each layer.

Context layer tooling

The tools for the context layer are all about storing and retrieving information. The right tools depend on what kind of data you're working with and how your agent accesses it.

Databases for transactional data

Agents need OLTP databases that can provision instantly when spinning up new projects or user sessions, handle unpredictable bursty workloads that idle most of the time then spike suddenly, provide isolated environments for testing queries or schema changes, and support multi-agent architectures where each agent or domain gets its own database.

Full-stack agent platforms like Replit or Lovable run their backends on Neon or Supabase. For simpler platforms, SQLite might work well for embedded, single-user agents.

Vector databases

If your agent requires specialized vector workloads at massive scale with optimized indexing, you can lean on dedicated databases like:

Pinecone: Managed vector database with high performance at scale
Weaviate: Open-source vector search engine with GraphQL API and hybrid search
Qdrant: Rust-based vector database with advanced filtering capabilities
Chroma: Lightweight embedding database optimized for development and prototyping

These make sense when you're working with billions of vectors or need advanced filtering. For most agents, keeping everything in a single database system wil reduce complexity and latency. A popular option is to skip the use of a specialized vector database altogether and instead opt for handling vector search alongside transactional data through the pgvector Postgres extension. Vendors like Neon support it as part of their pre-installed catalog.

Blob storage

If your agent needs access to large files, documents, logs, and media - i.e. reading PDFs for context, processing images, analyzing logs, or storing generated reports and visualization - you might need to integrate with object storage.

AWS S3: Industry-standard with broad integration support
Google Cloud Storage: Multi-region with strong consistency
Azure Blob Storage: Enterprise storage with lifecycle management
Cloudflare R2: S3-compatible with zero egress fees

Your agent accesses these through APIs, retrieving files on demand and storing outputs. The challenge is managing permissions and costs, especially egress charges.

MCP servers

You’ll also need a standardize ways to connect your agent to external data without writing custom integrations for every service. The Model Context Protocol does this through MCP servers that expose tools and resources in a consistent format. MCP servers connect to various sources, e.g.:

Filesystem MCP: Read and search local files
Google Drive MCP: Search and retrieve cloud documents
GitHub MCP: Access issues, PRs, and code
Neon MCP: Query Neon databases through a standardized interface

The advantage is standardization. Once your agent knows MCP, it works with any MCP server without needing to learn service-specific APIs. Authentication and calls happen automatically.

For services without MCP servers, you fall back to direct REST APIs, webhooks, or custom connectors. More work, but complete control.

Action layer tooling

The action layer is where your agent executes tasks. Once it has context, it needs to reason about what to do and actually do it. This requires models for intelligence, frameworks for orchestration, and infrastructure for safe execution.

LLM providers

You need a language model as the reasoning engine for your agent. The model interprets context, decides what actions to take, generates responses, and calls tools. Your choice of model determines your agent's capabilities, cost, and latency.

Top-tier models for production agents:

Claude (Anthropic): Strong reasoning, long context windows, excellent at following complex instructions
GPT-5 (OpenAI): Powerful general-purpose model with broad capabilities
Gemini (Google): Multimodal model with strong performance on complex reasoning and long context

The advantage is capability. These models handle complex reasoning, understand nuanced instructions, and generate high-quality outputs. They support function calling for tool use and maintain coherence across long conversations.

The cost comes from API pricing and latency. Every agent decision requires a model call. High-volume agents can rack up significant token costs. Latency matters for real-time interactions, and these models typically have response times of 1-5 seconds.

For specific use cases, consider:

Open-source models (Llama, Mixtral, Qwen): Self-hosted for data privacy or cost control at scale, but require GPU infrastructure
Specialized models (Codex for coding, embedding models for retrieval): Optimized for specific tasks
Smaller models (GPT-3.5, Claude Haiku): Faster and cheaper for simpler tasks

Most production agents use a mix. Primary reasoning with top-tier models, routine tasks with faster models, and embeddings from specialized models.

Agent frameworks

You need orchestration to manage multi-step workflows, tool calling, and memory. Agent frameworks handle the loop of observing context, deciding on actions, executing tools, and updating state.

Popular frameworks for production:

Claude Agent SDK: Built on Claude Code, provides computer access, file operations, and tool execution with feedback loops
LangChain: Mature ecosystem with chains, agents, and memory abstractions
LangGraph: Built on LangChain, adds stateful workflows with cycles and control flow for complex agent logic
AutoGen (Microsoft): Multi-agent systems with conversation-based orchestration

The advantage is speed. These frameworks handle the boilerplate of prompt construction, tool calling, error handling, and state management. They provide pre-built patterns for common agent workflows.

The problem is abstraction overhead. Frameworks can obscure what's actually happening, making debugging harder. They also introduce dependencies and API changes. Some teams find heavy frameworks too rigid for custom agent logic.

For complex multi-agent systems, consider CrewAI for role-based agent teams or Semantic Kernel for .NET environments. For full control, build custom orchestration using direct model APIs with your own state management.

Workflow orchestration

You need durable execution for multi-step processes. Agents often run tasks that span minutes or hours, call multiple external services, and must handle failures gracefully without losing progress.

Vercel Workflows or Inngest provide event-driven workflow orchestration built for agents. For similar capabilities, Temporal offers workflow-as-code with strong consistency guarantees. AWS Step Functions provides managed state machines integrated with AWS services, while Prefect focuses on data pipeline orchestration with scheduling and monitoring.

Code execution and sandboxing

Agents that write and run code need safe execution environments. You can't let agent-generated code access your production systems or run indefinitely. Sandboxing isolates execution and limits damage from bugs or malicious code.

Primary approaches:

Docker containers: Isolated environments with resource limits, network restrictions, and filesystem boundaries
E2B: Managed sandboxes built explicitly for AI code execution with language runtimes pre-configured
Modal: Serverless Python execution with built-in sandboxing, GPU access, and container orchestration
Firejail: Lightweight Linux sandboxing for process isolation
Isolated VMs: Full virtualization for maximum isolation but higher overhead

Docker is the standard. You run agent code in ephemeral containers that have no access to the host system, enforce CPU and memory limits, and tear down after execution. Set timeouts to prevent infinite loops. Use read-only filesystems where possible.

Modal and E2B provide managed sandboxing that removes infrastructure overhead. Modal excels at compute-intensive tasks with its serverless GPU access, while E2B focuses specifically on AI agent code execution with pre-configured runtimes.

For serverless environments, AWS Lambda and similar platforms provide built-in sandboxing with execution time limits. These work well for short-lived agent tasks but have constraints on runtime and available resources.

Verification layer tooling

The verification layer ensures your agent isn't making mistakes. Agents can hallucinate, generate broken code, or make poor decisions. You need tools to monitor behavior, test outputs, and catch errors before they reach users.

Observability and evaluation

You need to monitor what your agent is doing and evaluate whether it's doing it well. This means tracing each decision, logging tool calls, measuring output quality, and detecting when performance degrades.

Braintrust provides AI observability and evaluation built for agents. For similar capabilities, Langfuse offers open-source LLM tracing with prompt management. Arize Phoenix provides observability focused on detecting model drift and data quality issues. Custom logging to the ELK stack or Datadog works for teams that need to build their own evaluation logic.

Testing frameworks

You need automated testing to verify agent behavior before deployment. This includes testing that agents handle expected inputs correctly, fail gracefully on edge cases, and maintain consistent quality across prompt or model changes.

Standard testing frameworks work for agents with some adaptation:

Pytest (Python): Write test cases that call your agent with sample inputs and assert on outputs
Jest (JavaScript): Test agent responses with expect assertions on content and format
Agent-specific assertions: Check not just the final output, but intermediate steps, like which tools were called

The approach is similar to traditional software testing. Create a test suite with representative queries, run your agent against them, and assert that outputs meet quality criteria. The difference is that agent outputs aren't deterministic, so tests often check for semantic correctness rather than exact matches.

Custom eval harnesses provide more sophistication. These run large test sets, use LLM-as-judge to score outputs on fuzzy criteria like helpfulness or tone, and track performance over time. Human review loops add a layer where people evaluate agent outputs, especially for subjective quality or edge cases that automated tests miss.

Linting and code quality

Agents that generate code need validation to catch syntax errors, security issues, and style problems. Running linters on agent-generated code provides immediate feedback that the agent can use to fix mistakes. Language-specific linters catch different issues:

ESLint (JavaScript): Detects syntax errors, undefined variables, and code style violations
Ruff (Python): Fast Python linter that catches common bugs and enforces conventions
Pylint/Flake8 (Python): More comprehensive checking with configurable rules
RuboCop (Ruby): Style and correctness checking for Ruby code

Type checkers add another layer of validation. TypeScript for JavaScript, mypy for Python, and similar tools catch type errors that linters miss. Code formatters like Prettier or Black ensure consistent style. Together, these tools give agents concrete feedback about code quality, turning subjective "is this good code?" into objective "does this pass these checks?"

Deployment platforms

You need somewhere to run your agent code that scales with demand, handles failures gracefully, and doesn't require constant babysitting. The platform choice depends on your agent's runtime requirements and usage patterns.

Container orchestration platforms provide the most flexibility:

Kubernetes: Industry standard for container orchestration with autoscaling, service discovery, and self-healing
Docker Swarm: Simpler alternative for smaller deployments with basic orchestration
AWS ECS/EKS: Managed container services integrated with the AWS ecosystem
Google Kubernetes Engine (GKE): Managed Kubernetes with Google Cloud integration

Serverless platforms remove infrastructure management:

AWS Lambda: Run code without managing servers, pay per execution
Vercel: Deploy functions with automatic scaling and edge distribution
Google Cloud Functions: Event-driven serverless execution
Azure Functions: Serverless compute integrated with Azure services

Cloud VMs remain an option for agents that need long-running processes or specific system configurations. Services like AWS EC2, Google Compute Engine, or Azure VMs give you full machine access but require more operational work for scaling and availability.

API gateways

You need a front door for your agent that handles authentication, rate limiting, request routing, and monitoring. API gateways sit between users and your agent, managing all incoming traffic.

Common gateway solutions:

AWS API Gateway: Managed service with built-in auth, throttling, and CloudWatch integration
Kong: Open-source gateway with plugins for auth, logging, and transformation
Nginx: Lightweight reverse proxy with flexible configuration
Traefik: Modern proxy with automatic service discovery and Let's Encrypt support

Cloudflare Workers provides an edge-based approach, running gateway logic close to users for lower latency. This works well for global agents that need fast response times regardless of user location.

Secrets management

You need secure storage for API keys, database credentials, and other sensitive data. Hardcoding secrets in code or environment variables creates security risks. Secrets management systems provide encrypted storage, access control, and audit logging.

Standard solutions:

AWS Secrets Manager: Managed service with automatic rotation and IAM integration
HashiCorp Vault: Open-source secrets management with dynamic credentials and encryption as a service
Azure Key Vault: Managed vault integrated with Azure services and Active Directory

For development, environment variables work but aren't suitable for production. Tools like dotenv manage local secrets but lack the encryption and audit capabilities needed for production systems.

The key is never committing secrets to version control and rotating them regularly. Secrets management systems enforce these practices through technical controls rather than relying on developer discipline.

Building your agent stack

Building production agents is less about finding the perfect tool and more about understanding what your agent actually needs at each layer. Start with the basics: a database that provisions fast, an LLM that can reason through your use cases, and observability so you know what's happening. Add complexity only when you need it.

Not every agent needs workflow orchestration or dedicated vector databases. A simple agent might just need Neon for data, Claude for reasoning, and basic logging. A complex multi-agent platform needs the whole stack with queues, sandboxing, and sophisticated monitoring.

The common thread is infrastructure that matches agent behavior. Traditional tools built for steady workloads break down when agents create unpredictable spikes, need instant provisioning, or operate across multiple isolated environments. Start simple, measure what matters, and add tools as your agent's requirements become clear.