I Built an Open-Source Tool to Attack-Test LLMs. Here's What Breaks

I spend most of my time breaking into things for a living. For the last year or so, a growing chunk of that work has been pointed at LLMs.

Not the models themselves, exactly. The deployments. The API gateways with a language model behind them. The customer-facing chatbots. The internal tools that got "an AI feature" bolted on in Q3 because someone's VP saw a demo and said "we need this." The RAG pipelines connected to document stores full of sensitive data.

These things are everywhere now. And almost none of them have been adversarially tested.

I don't mean "does the model refuse if you ask it something bad." That's safety training. Safety training is important. But safety training and security testing are fundamentally different disciplines, and the industry is conflating them in ways that are going to cause real problems.

Safety training teaches a model to refuse. Security testing asks whether that refusal actually holds up when someone is actively trying to break it. The answer, overwhelmingly, is no.

The numbers are bad

OWASP ranked prompt injection as the number one security risk in LLM applications. That ranking is earned.

FlipAttack, a technique that simply reorders characters in prompts, achieves a 98% bypass rate against GPT-4o. DeepSeek R1 showed a 100% bypass rate against 50 HarmBench jailbreak prompts in testing by Cisco and the University of Pennsylvania. A study of 36 production LLM-integrated applications found that 86% were vulnerable to prompt injection. PoisonedRAG demonstrated that just five malicious documents in a corpus of millions can manipulate AI outputs 90% of the time.

These aren't theoretical attacks against research models. These are attacks against production systems that real organizations are running right now.

So I built a scanner

Augustus is an open-source LLM vulnerability scanner. You point it at a model endpoint and it throws 210+ adversarial probes at it across 47 attack categories. It tells you what's vulnerable and what's not.

go install github.com/praetorian-inc/augustus/cmd/augustus@latest

augustus scan openai.OpenAI \
  --all \
  --verbose

It ships as a single Go binary. No Python. No npm. No runtime dependencies. One install command and you're scanning.

I built it in Go because I needed something that fits into penetration testing workflows without requiring me to set up a Python environment on every engagement. go install, run, done. The concurrency model also matters: goroutine pools running probes in parallel across the target, not bottlenecked by Python's GIL.

It's inspired by garak, NVIDIA's Python-based LLM vulnerability scanner. garak is excellent and has a longer research pedigree with a published paper. Augustus is the same concept reimplemented for a different set of trade-offs: portability, speed, and zero-dependency distribution. Different tools for different workflows.

What it actually tests

Here's where it gets interesting. When most people think about LLM attacks, they think about jailbreaks. "Pretend you're DAN." "My grandmother used to tell me how to..." Those matter, and Augustus tests all of them (DAN variants through v11.0, AIM, AntiGPT, Grandma exploits, ArtPrompts). But jailbreaks are just the surface layer.

Encoding bypasses are where things start to get ugly. Augustus tests across Base64, ROT13, Morse code, hex, Braille, Klingon, leet speak, and about 12 other encoding schemes. The question being asked: if you wrap a harmful instruction in Base64, does the model decode it and follow it even though the plain-text version would be blocked?

In a lot of cases, yes. The gap between what input filters see (encoded text that looks benign) and what the model understands (the decoded malicious intent) is consistently exploitable. This is one of the most reliable attack vectors I've seen in production.

FlipAttack (16 variants) reverses or reorders characters to evade input filters. The research showed 98% bypass on GPT-4o. Augustus implements all the published variants.

Tag smuggling embeds instructions inside XML or HTML tags. Models that are trained to process structured input will sometimes follow instructions embedded in tags that look like formatting rather than commands.

Data extraction is where things get operationally dangerous. Augustus probes whether the model can be tricked into leaking API keys or credentials from its context window. It tests for PII extraction. It checks for training data regurgitation.

The package hallucination probes are one of my favorites. These cover Python, JavaScript, Ruby, Rust, Dart, Perl, and Raku. They ask the model to recommend packages for various tasks and then check whether any of the recommended packages don't actually exist. This matters because it's a real supply chain attack vector: adversaries monitor for hallucinated package names, register them, and wait for developers to pip install or npm install the fake package. The model becomes an unwitting accomplice in a supply chain attack.

RAG poisoning probes test whether an attacker can inject malicious content into the retrieval pipeline, both through document content and metadata injection. If your RAG system pulls from a corpus that an attacker can influence (and most can be influenced more easily than you'd think), the model's outputs can be manipulated.

Agent attacks are the newest category and arguably the most concerning. As LLMs gain tool access (browsing, code execution, database queries, API calls), the attack surface expands dramatically. Augustus tests multi-agent manipulation (can one agent influence another's behavior?), browsing exploits (can adversarial web content hijack a model with web access?), and latent injection (can instructions embedded in documents that a RAG-enabled agent processes cause it to take unintended actions?).

Format exploits target structured output. If a model generates markdown, can an attacker inject malicious links that render as legitimate? If it produces HTML, are XSS payloads possible? If downstream systems parse YAML or JSON from model output, can that parsing be exploited? These are real risks when LLM output gets rendered in browsers or consumed by other systems.

Evasion techniques test the model's ability to recognize adversarial intent regardless of how it's presented. ObscurePrompt uses an LLM to rewrite known jailbreaks into harder-to-detect forms. Character substitution probes use homoglyphs (characters that look identical but have different Unicode codepoints), zero-width characters, and bidirectional text markers. These are inputs that look completely benign to text-based filters but are interpreted differently by the model.

Safety benchmarks round it out. DoNotAnswer (941 questions across 5 risk areas), RealToxicityPrompts, Snowball (plausible-sounding but factually wrong outputs), and LMRC harmful content probes.

In total: 210+ probes across 47 attack categories.

The buff system is where it gets real

Here's the thing about adversarial testing: real attackers don't send attacks in plain text. They encode, translate, rephrase, and obfuscate. A DAN prompt that gets caught by every filter in the world might sail right through when it's been paraphrased, translated into Zulu, and reformatted as a haiku.

Augustus has a buff system that applies transformations to any probe before it's sent. Seven transformations across five categories:

Encoding buffs wrap prompts in Base64 or character codes. Testing the gap between what filters see and what models understand.

Paraphrase buffs use a Pegasus model to rephrase prompts while preserving adversarial intent. Same meaning, different surface form. This tests whether safety training generalizes beyond the specific patterns it was trained on, or whether it's essentially pattern matching on known bad inputs.

Poetry buffs reformat prompts as haiku, sonnets, limericks, free verse, or rhyming couplets. I know this sounds absurd. But models that robustly block a direct harmful request will sometimes comply when the same request arrives as verse. I've seen it happen repeatedly. Something about the stylistic framing seems to shift how the model processes the intent.

Low-resource language translation exploits the fact that safety training is overwhelmingly concentrated on English. A request that's blocked in English may succeed in Zulu, Hmong, or Scots Gaelic. Augustus translates probes via DeepL to test this.

Case transforms simply lowercase everything. Some input filters and keyword blocklists are case-sensitive. It's dumb. It works.

You can chain these. Encode a probe in Base64, then paraphrase it, then translate it to a low-resource language. Layered evasion that tests whether defenses hold up against inputs that don't match any expected pattern.

augustus scan openai.OpenAI \
  --probe dan.Dan \
  --buff encoding.Base64

augustus scan ollama.OllamaChat \
  --probe dan.Dan \
  --buffs-glob "paraphrase.*,lrl.*" \
  --config '{"model":"llama3.2:3b"}'

28 providers, one interface

Augustus connects to OpenAI (including o1/o3 reasoning models), Anthropic (Claude 3/3.5/4), Azure OpenAI, AWS Bedrock, Google Vertex AI, Cohere, Replicate, HuggingFace, Together AI, Groq, Mistral, Fireworks, DeepInfra, NVIDIA NIM, Ollama, LiteLLM, and more.

For anything else, there's a REST connector:

augustus scan rest.Rest \
  --probe dan.Dan \
  --config '{
    "uri": "https://your-api.example.com/v1/chat/completions",
    "headers": {"Authorization": "Bearer YOUR_KEY"},
    "req_template_json_object": {
      "model": "your-model",
      "messages": [{"role": "user", "content": "$INPUT"}]
    },
    "response_json": true,
    "response_json_field": "$.choices[0].message.content"
  }'

Custom request templates with $INPUT placeholders, JSONPath response extraction, SSE streaming, and proxy routing. If your endpoint speaks HTTP, Augustus can test it.

Detection isn't just pattern matching

On the detection side, Augustus has 90+ detectors. Pattern matching catches known jailbreak indicators. LLM-as-a-judge uses a second model to evaluate whether the response is harmful. HarmJudge (based on arXiv:2511.15304) provides semantic harm assessment aligned with the MLCommons AILuminate taxonomy. The Perspective API measures toxicity.

For iterative attacks like PAIR and TAP, a dedicated attack engine handles multi-turn conversations, candidate pruning, and judge-based scoring. These aren't single-shot tests. They're adaptive attacks that refine their approach across multiple attempts, mimicking how a real attacker would actually operate. They're computationally expensive (many LLM calls per test) but they represent the current state of the art in automated red-teaming.

What I've learned from building this

A few things became clear over the course of building Augustus and running it against production systems:

Safety training is not security. I keep coming back to this because it's the fundamental misconception driving the gap. Safety training is a behavioral overlay. It teaches the model patterns for refusal. Security testing asks whether those patterns hold up under adversarial conditions. They almost never do, at least not comprehensively.

Encoding bypasses are embarrassingly effective. The fact that wrapping a harmful request in Base64 still works against many production deployments in 2026 is wild. Input filters and the model itself are operating on different representations of the same input, and that gap is exploitable.

Low-resource languages are an underappreciated attack vector. Safety training is concentrated on English. The drop-off in refusal quality for low-resource languages is significant and consistent.

Agent-level attacks are going to be the next big thing. As models gain tool access, every tool becomes part of the attack surface. A model with browsing access can be manipulated by adversarial web content. A model with database access can be tricked into exfiltrating data. A model that processes documents can follow latent instructions embedded in those documents. We're in the very early innings of understanding this attack surface.

The tooling gap is real and it's getting wider. Organizations are deploying LLMs faster than they're testing them. The models ship fast. The security testing doesn't happen at all. Something has to close that gap, and it needs to be accessible enough that it doesn't require a specialized AI red team to run.

Get it

Augustus is Apache 2.0 licensed and available now.

Repo: https://github.com/praetorian-inc/augustus

go install github.com/praetorian-inc/augustus/cmd/augustus@latest

augustus scan ollama.OllamaChat \
  --all \
  --config '{"model":"llama3.2:3b"}'

It's the second tool in a 12-tool open-source series I'm releasing over 12 weeks. One tool per week, each doing one thing well. The first was Julius, which handles LLM fingerprinting (identifying what model is running behind an endpoint). The rest of the series will continue building out the offensive security toolkit for AI systems.

If you run it against your models and find something interesting, I'd like to hear about it. And if you want to contribute probes for attack vectors we haven't covered yet, the repo has a CONTRIBUTING.md that explains the probe definition format and development workflow.

The models are shipping. The testing needs to catch up.