“Want streaming? I’ll parse the SSE myself, how hard can it be?”
Hard. Unreasonably hard. I spent a weekend writing a streaming parser that handled OpenAI’s format, then Anthropic changed theirs, and suddenly my “universal” parser was universal for exactly one provider.
That was the moment I stopped writing custom API wrappers. Six months and a lot of deleted code later, here are the 7 libraries that replaced my boilerplate. Each one follows the same pattern: a problem I kept solving manually, a library that solved it in a few lines, and the honest limitation nobody mentions in the README.
1. LiteLLM: One Interface for Every LLM Provider
The problem: You want to test GPT-4 vs Claude vs Llama. So you write three separate client configurations, three response parsers, three error handlers.
from litellm import completion
# Same interface. Any provider.
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Explain RAG in one paragraph"}]
)
# Switch to Claude? Change one string.
response = completion(
model="claude-3-5-sonnet-20241022",
messages=[{"role": "user", "content": "Explain RAG in one paragraph"}]
)
I ran a cost comparison last quarter: same workload across GPT-4, Claude Sonnet, and Llama 3 70B. The difference was $847/month. Without LiteLLM, testing that would have required a week of refactoring.
Replaced: 3 separate SDK integrations and a cost tracking spreadsheet.
The honest limitation: When a provider changes their API, you’re waiting on LiteLLM to update. For mission-critical production, some teams still prefer direct SDKs.
2. Instructor: Structured Outputs That Actually Work
The problem: You ask the LLM for JSON. You get JSON wrapped in markdown backticks. Or JSON with an extra field. Or a string that looks like JSON but has a trailing comma.
import instructor
from pydantic import BaseModel, Field
from openai import OpenAI
client = instructor.from_openai(OpenAI())
class ToolRecommendation(BaseModel):
name: str
category: str = Field(description="e.g. 'orchestration', 'observability'")
maturity: str = Field(description="'production-ready' or 'experimental'")
best_for: list[str]
avoid_when: list[str]
tools = client.chat.completions.create(
model="gpt-4",
messages=[{
"role": "user",
"content": "Recommend 3 Python tools for building RAG pipelines"
}],
response_model=list[ToolRecommendation]
)
# Validated list of structured objects. No parsing. No edge cases.
Before Instructor, I had a parse_llm_json() function with twelve edge cases. It still failed on production data about once a week. The nested extraction above (validated list of complex objects) is where Instructor really pulls ahead. Try doing that with regex and json.loads().
Replaced: 150-line JSON parsing function with manual validation.
The honest limitation: Adds latency. Instructor sometimes needs a retry cycle when the model’s first attempt doesn’t validate. For high-throughput pipelines, measure the overhead.
3. MarkItDown: Turn Any Document Into LLM-Ready Text
The problem: Your RAG pipeline needs to ingest PDFs, Word docs, PowerPoints, and Excel files. So you install pdfminer for PDFs, python-docx for Word, python-pptx for slides, and pandas for spreadsheets. Four libraries. Four parsing pipelines. Four sets of edge cases.
from markitdown import MarkItDown
md = MarkItDown()
# PDF? Done.
result = md.convert("quarterly_report.pdf")
# PowerPoint? Also done.
result = md.convert("client_deck.pptx")
# Excel? You get the idea.
result = md.convert("budget_2025.xlsx")
print(result.text_content) # Clean Markdown, every time
Microsoft built this (86K GitHub stars). It preserves headings, tables, and lists as Markdown. When I was building a document Q&A system for internal knowledge bases, the ingestion pipeline was 40% of the codebase. MarkItDown would have cut that to a few lines.
Replaced: Custom extraction pipeline with 4 separate parsing libraries.
The honest limitation: PDF extraction is text-layer only. Scanned images without OCR return nothing. Table extraction from PDFs is also inconsistent, so verify anything financial or heavily tabular.
4. FastMCP: Build MCP Servers in Minutes, Not Days
The problem: Model Context Protocol is becoming the standard for giving LLMs access to external tools and data. But building a compliant MCP server from scratch means hundreds of lines of transport handling, schema definition, and protocol boilerplate. I wrote about why MCP’s security model needs scrutiny, but even setting security aside, the implementation burden is real.
from fastmcp import FastMCP
mcp = FastMCP("My Data Tools")
@mcp.tool()
def search_database(query: str, limit: int = 10) -> list[dict]:
"""Search the product database."""
return db.search(query, limit=limit)
@mcp.tool()
def get_user_profile(user_id: str) -> dict:
"""Fetch user profile by ID."""
return db.get_user(user_id)
mcp.run()
That’s a complete, working MCP server. Two tools, ready for Claude Desktop or any MCP client. The decorator pattern will feel familiar if you’ve used FastAPI (not a coincidence). I prototyped a tool server for an internal search index in about 20 minutes. The raw MCP version I’d started was at 200 lines and still didn’t handle transport.
Replaced: 200+ lines of protocol boilerplate.
The honest limitation: FastMCP is still young (22K stars and growing). If your stack doesn’t speak MCP, this doesn’t help. And you still need to think carefully about what you’re exposing.
5. PydanticAI: Type-Safe Agents Without the Framework Tax
The problem: You want structured, type-safe interactions with LLMs. But full agent frameworks bring dependency bloat, opaque abstractions, and that sinking feeling when you need to debug something three layers deep in framework internals. (If you’ve felt that pain, you’re not alone.)
from pydantic_ai import Agent
from pydantic import BaseModel
class CodeReview(BaseModel):
issues: list[str]
severity: str
suggested_fix: str
confidence: float
reviewer = Agent(
"openai:gpt-4",
result_type=CodeReview,
system_prompt="You review Python code for bugs and anti-patterns."
)
result = reviewer.run_sync("def connect(url): return requests.get(url, verify=False)")
print(result.data.issues)
# ['SSL verification disabled', 'No timeout specified', 'No error handling']
PydanticAI sits in the sweet spot between raw API calls and heavyweight frameworks. Typed inputs, typed outputs, tool definitions, dependency injection. Built by the Pydantic team (the same people whose validation library powers the OpenAI SDK, LangChain, and basically everything else). I’ve started using it for any structured task that’s too complex for a single Instructor call but doesn’t need a full LangGraph workflow.
Replaced: Untyped prompt chains and manual response validation.
The honest limitation: The “agent” in PydanticAI is more like “structured LLM interaction” than “autonomous multi-step reasoning.” For complex multi-agent workflows with persistent state, LangGraph or CrewAI still own that space.
6. tiktoken: Know Your Costs Before You Hit Send
The problem: You’re building prompts by concatenating context, and you have no idea if you’re about to blow past the context window or send a $2 API call when a $0.02 call would have worked.
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
prompt = "Your very long prompt with lots of context..."
tokens = enc.encode(prompt)
print(f"Token count: {len(tokens)}")
# Estimate cost before calling
input_cost = len(tokens) * 0.00003 # GPT-4 input pricing
print(f"Estimated input cost: ${input_cost:.4f}")
I started using tiktoken after a recursive context-building function created a 45,000-token prompt. The API call cost $1.35. For a single request. Queried 200 times per hour. Now I add a token check before every LLM call that includes dynamic context.
Replaced: Blind API calls and surprise invoices.
The honest limitation: tiktoken is OpenAI-specific. For Anthropic, you need their token counting API. LiteLLM’s token_counter() wraps multiple tokenizers but accuracy varies. And hardcoding rates (like I did above) is a shortcut that will eventually lie to you.
7. Pydantic Evals: Stop Vibes-Testing Your Prompts
The problem: You changed your system prompt. Did it get better? You run three examples manually, they look fine, you ship it. Two days later, users complain about a regression you didn’t test for.
from pydantic_evals import Case, Dataset
dataset = Dataset(
cases=[
Case(
name="basic_extraction",
inputs="Extract the company name: Apple announced new MacBooks",
expected_output="Apple",
),
Case(
name="ambiguous_input",
inputs="The orange company released a new phone",
expected_output="Orange",
),
]
)
report = dataset.evaluate(your_extraction_function)
report.print()
I resisted eval frameworks for months. “My prompts work fine, I test them manually.” Then I changed a system prompt to improve one edge case, and it silently broke three others I wasn’t checking. Pydantic Evals is the simplest eval framework I’ve found. Define cases, run them, get a pass/fail report. Simple enough that you’ll actually use it (which is the entire point).
Replaced: “Looks good to me” testing and post-deploy surprises.
The honest limitation: It’s new and the API is still evolving. For production evaluation pipelines at scale, Braintrust or Langfuse offer deeper integrations. But for “do my prompts still work after this change?” it’s exactly right.
What I Learned the Hard Way
A year ago I would have told you the best stack is the one you build yourself. Full control. No dependencies.
I was wrong. I was spending more time maintaining my API wrappers than building the actual product. These 7 libraries share one philosophy: do one thing, do it well, get out of the way. None of them require buying into a framework. All of them delete code you shouldn’t have been writing.
The best AI engineering stack isn’t the most sophisticated. It’s the one that ships. That’s the whole lesson.
Follow me for more no-BS breakdowns of tools that actually work.
Building agents? Start with the complete LangGraph guide. 115K readers found it useful.
