The AI Agent Tool Stack

What tools do you need to build an AI agent? It goes well beyond just the LLM. While the AI takes center stage, the tooling to create robust agents is more about infrastructure than intelligence. You need databases that can keep up with agent workloads, orchestration to handle multi-step processes, ways to verify that agents aren’t making mistakes, and systems to monitor everything in production. This stack can contain dozens of tools, each with its own trade-offs and integration requirements. Here, we want to go through each within the framework of how they’ll work within your agent. The AI Agent framework We could just list out all the tools you might need to build an AI agent, but that won’t be as helpful if you want to develop agents (or agent platforms) that work to a specific use case or production standard. Instead, we need to understand how these tools fit together into a coherent architecture. AI agents operate in a continuous feedback loop with three distinct phases: gather context, take action, and verify work. continuous feedback loop continuous feedback loop This pattern repeats until the agent completes its task or determines it needs human intervention. Context layer The context phase is where your agent gathers information. It needs to retrieve relevant data, load it efficiently into the model's context window, and decide what's important enough to keep. This shouldn’t be just passive data loading, but an active search process where the agent determines what information matters for the task at hand. Action layer The action phase is where your agent executes. It takes the context it's gathered and does something with it: calling APIs, writing code, transforming data, or triggering workflows. The key is giving your agent the right capabilities to solve problems flexibly rather than just following rigid scripts. Verification layer The verification phase closes the loop. Your agent checks its own work, catches errors, and decides whether to iterate or move forward. Agents that can self-correct are fundamentally more reliable because they catch mistakes before they compound. This is all wrapped within an infrastructure that handles orchestration, monitoring, and scaling, ensuring your agent runs reliably in production. Now, let's break down what you need at each layer. Context layer tooling The tools for the context layer are all about storing and retrieving information. The right tools depend on what kind of data you're working with and how your agent accesses it. Databases for transactional data Agents need OLTP databases that can provision instantly when spinning up new projects or user sessions, handle unpredictable bursty workloads that idle most of the time then spike suddenly, provide isolated environments for testing queries or schema changes, and support multi-agent architectures where each agent or domain gets its own database. Full-stack agent platforms like Replit or Lovable run their backends on Neon or Supabase. For simpler platforms, SQLite might work well for embedded, single-user agents. Replit Replit Lovable Neon Supabase SQLite SQLite Vector databases If your agent requires specialized vector workloads at massive scale with optimized indexing, you can lean on dedicated databases like: Pinecone: Managed vector database with high performance at scale Weaviate: Open-source vector search engine with GraphQL API and hybrid search Qdrant: Rust-based vector database with advanced filtering capabilities Chroma: Lightweight embedding database optimized for development and prototyping Pinecone: Managed vector database with high performance at scale Pinecone Pinecone Weaviate: Open-source vector search engine with GraphQL API and hybrid search Weaviate Weaviate Qdrant: Rust-based vector database with advanced filtering capabilities Qdrant Qdrant Chroma: Lightweight embedding database optimized for development and prototyping Chroma Chroma These make sense when you're working with billions of vectors or need advanced filtering. For most agents, keeping everything in a single database system wil reduce complexity and latency. A popular option is to skip the use of a specialized vector database altogether and instead opt for handling vector search alongside transactional data through the pgvector Postgres extension. Vendors like Neon support it as part of their pre-installed catalog. pgvector pgvector Neon Blob storage If your agent needs access to large files, documents, logs, and media - i.e. reading PDFs for context, processing images, analyzing logs, or storing generated reports and visualization - you might need to integrate with object storage. AWS S3: Industry-standard with broad integration support Google Cloud Storage: Multi-region with strong consistency Azure Blob Storage: Enterprise storage with lifecycle management Cloudflare R2: S3-compatible with zero egress fees AWS S3: Industry-standard with broad integration support AWS S3 AWS S3 Google Cloud Storage: Multi-region with strong consistency Google Cloud Storage Google Cloud Storage Azure Blob Storage: Enterprise storage with lifecycle management Azure Blob Storage Azure Blob Storage Cloudflare R2: S3-compatible with zero egress fees Cloudflare R2 Cloudflare R2 Your agent accesses these through APIs, retrieving files on demand and storing outputs. The challenge is managing permissions and costs, especially egress charges. MCP servers You’ll also need a standardize ways to connect your agent to external data without writing custom integrations for every service. The Model Context Protocol does this through MCP servers that expose tools and resources in a consistent format. MCP servers connect to various sources, e.g.: Model Context Protocol Model Context Protocol Filesystem MCP: Read and search local files Google Drive MCP: Search and retrieve cloud documents GitHub MCP: Access issues, PRs, and code Neon MCP: Query Neon databases through a standardized interface Filesystem MCP: Read and search local files Google Drive MCP: Search and retrieve cloud documents GitHub MCP: Access issues, PRs, and code GitHub MCP GitHub MCP Neon MCP: Query Neon databases through a standardized interface Neon MCP Neon MCP The advantage is standardization. Once your agent knows MCP, it works with any MCP server without needing to learn service-specific APIs. Authentication and calls happen automatically. For services without MCP servers, you fall back to direct REST APIs, webhooks, or custom connectors. More work, but complete control. Action layer tooling The action layer is where your agent executes tasks. Once it has context, it needs to reason about what to do and actually do it. This requires models for intelligence, frameworks for orchestration, and infrastructure for safe execution. LLM providers You need a language model as the reasoning engine for your agent. The model interprets context, decides what actions to take, generates responses, and calls tools. Your choice of model determines your agent's capabilities, cost, and latency. Top-tier models for production agents: Claude (Anthropic): Strong reasoning, long context windows, excellent at following complex instructions GPT-5 (OpenAI): Powerful general-purpose model with broad capabilities Gemini (Google): Multimodal model with strong performance on complex reasoning and long context Claude (Anthropic): Strong reasoning, long context windows, excellent at following complex instructions Claude Claude GPT-5 (OpenAI): Powerful general-purpose model with broad capabilities GPT-5 GPT-5 Gemini (Google): Multimodal model with strong performance on complex reasoning and long context Gemini Gemini The advantage is capability. These models handle complex reasoning, understand nuanced instructions, and generate high-quality outputs. They support function calling for tool use and maintain coherence across long conversations. The cost comes from API pricing and latency. Every agent decision requires a model call. High-volume agents can rack up significant token costs. Latency matters for real-time interactions, and these models typically have response times of 1-5 seconds. For specific use cases, consider: Open-source models (Llama, Mixtral, Qwen): Self-hosted for data privacy or cost control at scale, but require GPU infrastructure Specialized models (Codex for coding, embedding models for retrieval): Optimized for specific tasks Smaller models (GPT-3.5, Claude Haiku): Faster and cheaper for simpler tasks Open-source models (Llama, Mixtral, Qwen): Self-hosted for data privacy or cost control at scale, but require GPU infrastructure Llama Llama Mixtral Mixtral Qwen Qwen Specialized models (Codex for coding, embedding models for retrieval): Optimized for specific tasks Codex Codex Smaller models (GPT-3.5, Claude Haiku): Faster and cheaper for simpler tasks Most production agents use a mix. Primary reasoning with top-tier models, routine tasks with faster models, and embeddings from specialized models. Agent frameworks You need orchestration to manage multi-step workflows, tool calling, and memory. Agent frameworks handle the loop of observing context, deciding on actions, executing tools, and updating state. Popular frameworks for production: Claude Agent SDK: Built on Claude Code, provides computer access, file operations, and tool execution with feedback loops LangChain: Mature ecosystem with chains, agents, and memory abstractions LangGraph: Built on LangChain, adds stateful workflows with cycles and control flow for complex agent logic AutoGen (Microsoft): Multi-agent systems with conversation-based orchestration Claude Agent SDK: Built on Claude Code, provides computer access, file operations, and tool execution with feedback loops Claude Agent SDK Claude Agent SDK LangChain: Mature ecosystem with chains, agents, and memory abstractions LangChain LangChain LangGraph: Built on LangChain, adds stateful workflows with cycles and control flow for complex agent logic LangGraph LangGraph AutoGen (Microsoft): Multi-agent systems with conversation-based orchestration AutoGen AutoGen The advantage is speed. These frameworks handle the boilerplate of prompt construction, tool calling, error handling, and state management. They provide pre-built patterns for common agent workflows. The problem is abstraction overhead. Frameworks can obscure what's actually happening, making debugging harder. They also introduce dependencies and API changes. Some teams find heavy frameworks too rigid for custom agent logic. For complex multi-agent systems, consider CrewAI for role-based agent teams or Semantic Kernel for .NET environments. For full control, build custom orchestration using direct model APIs with your own state management. CrewAI CrewAI Semantic Kernel Semantic Kernel Workflow orchestration You need durable execution for multi-step processes. Agents often run tasks that span minutes or hours, call multiple external services, and must handle failures gracefully without losing progress. Vercel Workflows or Inngest provide event-driven workflow orchestration built for agents. For similar capabilities, Temporal offers workflow-as-code with strong consistency guarantees. AWS Step Functions provides managed state machines integrated with AWS services, while Prefect focuses on data pipeline orchestration with scheduling and monitoring. Vercel Workflows Inngest Inngest Temporal Temporal AWS Step Functions AWS Step Functions Prefect Prefect Code execution and sandboxing Agents that write and run code need safe execution environments. You can't let agent-generated code access your production systems or run indefinitely. Sandboxing isolates execution and limits damage from bugs or malicious code. Primary approaches: Docker containers: Isolated environments with resource limits, network restrictions, and filesystem boundaries E2B: Managed sandboxes built explicitly for AI code execution with language runtimes pre-configured Modal: Serverless Python execution with built-in sandboxing, GPU access, and container orchestration Firejail: Lightweight Linux sandboxing for process isolation Isolated VMs: Full virtualization for maximum isolation but higher overhead Docker containers: Isolated environments with resource limits, network restrictions, and filesystem boundaries E2B: Managed sandboxes built explicitly for AI code execution with language runtimes pre-configured E2B E2B Modal: Serverless Python execution with built-in sandboxing, GPU access, and container orchestration Modal Modal Firejail: Lightweight Linux sandboxing for process isolation Firejail Firejail Isolated VMs: Full virtualization for maximum isolation but higher overhead Docker is the standard. You run agent code in ephemeral containers that have no access to the host system, enforce CPU and memory limits, and tear down after execution. Set timeouts to prevent infinite loops. Use read-only filesystems where possible. Modal and E2B provide managed sandboxing that removes infrastructure overhead. Modal excels at compute-intensive tasks with its serverless GPU access, while E2B focuses specifically on AI agent code execution with pre-configured runtimes. For serverless environments, AWS Lambda and similar platforms provide built-in sandboxing with execution time limits. These work well for short-lived agent tasks but have constraints on runtime and available resources. AWS Lambda AWS Lambda Verification layer tooling The verification layer ensures your agent isn't making mistakes. Agents can hallucinate, generate broken code, or make poor decisions. You need tools to monitor behavior, test outputs, and catch errors before they reach users. Observability and evaluation You need to monitor what your agent is doing and evaluate whether it's doing it well. This means tracing each decision, logging tool calls, measuring output quality, and detecting when performance degrades. Braintrust provides AI observability and evaluation built for agents. For similar capabilities, Langfuse offers open-source LLM tracing with prompt management. Arize Phoenix provides observability focused on detecting model drift and data quality issues. Custom logging to the ELK stack or Datadog works for teams that need to build their own evaluation logic. Braintrust Braintrust Langfuse Langfuse Arize Phoenix Arize Phoenix ELK stack ELK stack Datadog Datadog Testing frameworks You need automated testing to verify agent behavior before deployment. This includes testing that agents handle expected inputs correctly, fail gracefully on edge cases, and maintain consistent quality across prompt or model changes. Standard testing frameworks work for agents with some adaptation: Pytest (Python): Write test cases that call your agent with sample inputs and assert on outputs Jest (JavaScript): Test agent responses with expect assertions on content and format Agent-specific assertions: Check not just the final output, but intermediate steps, like which tools were called Pytest (Python): Write test cases that call your agent with sample inputs and assert on outputs Pytest Pytest Jest (JavaScript): Test agent responses with expect assertions on content and format Jest Jest Agent-specific assertions: Check not just the final output, but intermediate steps, like which tools were called The approach is similar to traditional software testing. Create a test suite with representative queries, run your agent against them, and assert that outputs meet quality criteria. The difference is that agent outputs aren't deterministic, so tests often check for semantic correctness rather than exact matches. Custom eval harnesses provide more sophistication. These run large test sets, use LLM-as-judge to score outputs on fuzzy criteria like helpfulness or tone, and track performance over time. Human review loops add a layer where people evaluate agent outputs, especially for subjective quality or edge cases that automated tests miss. Linting and code quality Agents that generate code need validation to catch syntax errors, security issues, and style problems. Running linters on agent-generated code provides immediate feedback that the agent can use to fix mistakes. Language-specific linters catch different issues: ESLint (JavaScript): Detects syntax errors, undefined variables, and code style violations Ruff (Python): Fast Python linter that catches common bugs and enforces conventions Pylint/Flake8 (Python): More comprehensive checking with configurable rules RuboCop (Ruby): Style and correctness checking for Ruby code Type checkers add another layer of validation. TypeScript for JavaScript, mypy for Python, and similar tools catch type errors that linters miss. Code formatters like Prettier or Black ensure consistent style. Together, these tools give agents concrete feedback about code quality, turning subjective "is this good code?" into objective "does this pass these checks?" ESLint (JavaScript): Detects syntax errors, undefined variables, and code style violations ESLint (JavaScript): Detects syntax errors, undefined variables, and code style violations ESLint ESLint Ruff (Python): Fast Python linter that catches common bugs and enforces conventions Ruff (Python): Fast Python linter that catches common bugs and enforces conventions Ruff Ruff Pylint/Flake8 (Python): More comprehensive checking with configurable rules Pylint/Flake8 (Python): More comprehensive checking with configurable rules Pylint Pylint Flake8 Flake8 RuboCop (Ruby): Style and correctness checking for Ruby code Type checkers add another layer of validation. TypeScript for JavaScript, mypy for Python, and similar tools catch type errors that linters miss. Code formatters like Prettier or Black ensure consistent style. Together, these tools give agents concrete feedback about code quality, turning subjective "is this good code?" into objective "does this pass these checks?" RuboCop (Ruby): Style and correctness checking for Ruby code RuboCop RuboCop Type checkers add another layer of validation. TypeScript for JavaScript, mypy for Python, and similar tools catch type errors that linters miss. Code formatters like Prettier or Black ensure consistent style. Together, these tools give agents concrete feedback about code quality, turning subjective "is this good code?" into objective "does this pass these checks?" TypeScript TypeScript mypy mypy Prettier Prettier Black Black Deployment platforms You need somewhere to run your agent code that scales with demand, handles failures gracefully, and doesn't require constant babysitting. The platform choice depends on your agent's runtime requirements and usage patterns. Container orchestration platforms provide the most flexibility: Kubernetes: Industry standard for container orchestration with autoscaling, service discovery, and self-healing Docker Swarm: Simpler alternative for smaller deployments with basic orchestration AWS ECS/EKS: Managed container services integrated with the AWS ecosystem Google Kubernetes Engine (GKE): Managed Kubernetes with Google Cloud integration Kubernetes: Industry standard for container orchestration with autoscaling, service discovery, and self-healing Kubernetes Kubernetes Docker Swarm: Simpler alternative for smaller deployments with basic orchestration Docker Swarm Docker Swarm AWS ECS/EKS: Managed container services integrated with the AWS ecosystem AWS ECS/EKS AWS ECS/EKS Google Kubernetes Engine (GKE): Managed Kubernetes with Google Cloud integration Google Kubernetes Engine Google Kubernetes Engine Serverless platforms remove infrastructure management: AWS Lambda: Run code without managing servers, pay per execution Vercel: Deploy functions with automatic scaling and edge distribution Google Cloud Functions: Event-driven serverless execution Azure Functions: Serverless compute integrated with Azure services AWS Lambda: Run code without managing servers, pay per execution Vercel: Deploy functions with automatic scaling and edge distribution Vercel Vercel Google Cloud Functions: Event-driven serverless execution Google Cloud Functions Google Cloud Functions Azure Functions: Serverless compute integrated with Azure services Azure Functions Azure Functions Cloud VMs remain an option for agents that need long-running processes or specific system configurations. Services like AWS EC2, Google Compute Engine, or Azure VMs give you full machine access but require more operational work for scaling and availability. AWS EC2 AWS EC2 Google Compute Engine Google Compute Engine Azure VMs Azure VMs API gateways You need a front door for your agent that handles authentication, rate limiting, request routing, and monitoring. API gateways sit between users and your agent, managing all incoming traffic. Common gateway solutions: AWS API Gateway: Managed service with built-in auth, throttling, and CloudWatch integration Kong: Open-source gateway with plugins for auth, logging, and transformation Nginx: Lightweight reverse proxy with flexible configuration Traefik: Modern proxy with automatic service discovery and Let's Encrypt support Cloudflare Workers provides an edge-based approach, running gateway logic close to users for lower latency. This works well for global agents that need fast response times regardless of user location. AWS API Gateway: Managed service with built-in auth, throttling, and CloudWatch integration AWS API Gateway: Managed service with built-in auth, throttling, and CloudWatch integration AWS API Gateway AWS API Gateway Kong: Open-source gateway with plugins for auth, logging, and transformation Kong: Open-source gateway with plugins for auth, logging, and transformation Kong Kong Nginx: Lightweight reverse proxy with flexible configuration Nginx: Lightweight reverse proxy with flexible configuration Nginx Nginx Traefik: Modern proxy with automatic service discovery and Let's Encrypt support Cloudflare Workers provides an edge-based approach, running gateway logic close to users for lower latency. This works well for global agents that need fast response times regardless of user location. Traefik: Modern proxy with automatic service discovery and Let's Encrypt support Traefik Traefik Cloudflare Workers provides an edge-based approach, running gateway logic close to users for lower latency. This works well for global agents that need fast response times regardless of user location. Cloudflare Workers Cloudflare Workers Secrets management You need secure storage for API keys, database credentials, and other sensitive data. Hardcoding secrets in code or environment variables creates security risks. Secrets management systems provide encrypted storage, access control, and audit logging. Standard solutions: AWS Secrets Manager: Managed service with automatic rotation and IAM integration HashiCorp Vault: Open-source secrets management with dynamic credentials and encryption as a service Azure Key Vault: Managed vault integrated with Azure services and Active Directory AWS Secrets Manager: Managed service with automatic rotation and IAM integration AWS Secrets Manager AWS Secrets Manager HashiCorp Vault: Open-source secrets management with dynamic credentials and encryption as a service HashiCorp Vault HashiCorp Vault Azure Key Vault: Managed vault integrated with Azure services and Active Directory Azure Key Vault Azure Key Vault For development, environment variables work but aren't suitable for production. Tools like dotenv manage local secrets but lack the encryption and audit capabilities needed for production systems. dotenv dotenv The key is never committing secrets to version control and rotating them regularly. Secrets management systems enforce these practices through technical controls rather than relying on developer discipline. Building your agent stack Building production agents is less about finding the perfect tool and more about understanding what your agent actually needs at each layer. Start with the basics: a database that provisions fast, an LLM that can reason through your use cases, and observability so you know what's happening. Add complexity only when you need it. Not every agent needs workflow orchestration or dedicated vector databases. A simple agent might just need Neon for data, Claude for reasoning, and basic logging. A complex multi-agent platform needs the whole stack with queues, sandboxing, and sophisticated monitoring. The common thread is infrastructure that matches agent behavior. Traditional tools built for steady workloads break down when agents create unpredictable spikes, need instant provisioning, or operate across multiple isolated environments. Start simple, measure what matters, and add tools as your agent's requirements become clear.