Anthropic has lifted the lid on Claude Opus 4.5, pitched as a marked upgrade on last spring’s Opus 4 release, and designed to handle real software projects more reliably, especially the long, multi-step tasks that tend to expose a model’s limits. lifted the lid on Claude Opus 4.5 last spring’s Opus 4 release Rather than focusing on "bigger" or "faster", Anthropic presents Opus 4.5 as an attempt to shore up the weak spots that emerge when a model is asked to operate inside actual codebases -- work that goes beyond talking about code, to working more effectively within it. Opus 4.5: Tools of the trade for agent workflows Alongside the new model, Anthropic also unveiled a handful of features on the Claude Developer Platform, bundled under what it calls advanced tool use. advanced tool use The update introduces three components: Tool Search Tool for discovering capabilities without flooding the context window; Tool Use Examples to standardise how models learn to invoke tools correctly; and, perhaps most notably, Programmatic Tool Calling, which lets the model write a small script and hand it off to the runtime instead of issuing dozens of incremental tool calls. The effect is that multi-step operations happen in one controlled execution rather than as a token-hungry drip-feed of actions — exactly the difference between a model walking through a task step by step, and one that delegates the whole thing to the runtime in a single run. As this demo shows, the contrast becomes obvious even in something as simple as a puzzle-style task that requires trying multiple combinations. A traditional agent drives the tool one attempt at a time, burning tokens and context with each incremental step; the programmatic approach lets the model generate a short loop that runs entirely inside the tool layer, collapsing the whole sequence into a single execution. It’s a small example, but it captures the underlying shift: from chattering through a workflow to actually carrying it out. Early feedback suggests that the upgrade lands squarely with developers building agent-heavy systems. For example, HubSpot CTO Dharmesh Shah said these changes address the deeper structural issues that shape whether agents can scale beyond simple demos and into real workflows. Dharmesh Shah “As agent architectures get more complex, the bottleneck isn’t the model – it's the orchestration,” Shah wrote. “These features move us closer to agents that can reason, retrieve, call tools, and coordinate real work at scale.” Shah wrote Shah also specifically highlighted the new Tool Search Tool, which shifts a lot of the overhead out of the model’s context window. “Fewer tokens. Faster responses. Less clutter. More joy,” he wrote. Benchmarking Opus 4.5 Across Anthropic’s benchmark suite, Opus 4.5 posts clear gains over its predecessors, particularly in agent-style tasks. It hits 80.9% on SWE-bench Verified for agentic coding (up from 77.2% in Sonnet 4.5 and 74.5% in Opus 4.1), 59.3% on Terminal-bench 2.0, and 88.9% / 98.2% on retail and telecom variants of t2-bench for agentic tool use. The model also improves on scaled tool use (62.3% on MCP Atlas) and computer-use tests (66.3% on OSWorld), with moderate bumps in reasoning, visual understanding, and multilingual Q&A. It is worth noting that figures come from Anthropic’s internal evaluations and, like most agentic and tool-use benchmarks, will need independent verification as public submissions open up. Even so, it’s clear that Opus 4.5 is striving to close the gap between code-generation prowess and consistent performance on long, tool-heavy tasks — a gap that has historically separated impressive demos from systems that can actually ship work. Performance is only part of the picture, though. Anthropic also released new robustness data, focusing on how the model holds up under prompt-injection and other adversarial pressure. Evaluations run by Gray Swan, an external red-team group, show Opus 4.5 with lower susceptibility to prompt-injection attempts than the other models Anthropic tested, with attack success rates of 4.7% at one query, 33.6% at ten, and 63.0% at one hundred. These are iterative, adversarial probes that mimic how failures typically emerge in long, tool-driven workflows, not one-shot jailbreaks. Gray Swan Anthropic also claims more stable behaviour across multi-file and multi-context operations, the kinds of conditions that matter once a model is embedded inside a real system. The accompanying system card lays out the evaluations behind those claims, along with the limitations that remain for practical use. system card Claude Code descends on the desktop Beyond the model itself, Anthropic pushed a handful of additional updates across the Claude product line. Claude Code — which first lived in the terminal before expanding to the browser and mobile apps in October — is now available inside the Claude desktop app tool, giving developers the same project-aware environment across every client. expanding to the browser and mobile apps in October Claude desktop app It also picks up a more deliberate Plan Mode: Claude now asks clarifying questions up front and drafts a plan.md file before executing, making the workflow more predictable and easier to steer. Plan Mode plan.md The main Claude apps received some smaller but useful upgrades. Long conversations now summarise earlier context automatically, so threads don’t break when they get lengthy. Claude for Chrome is opening up to all Max users, giving the model access across browser tabs, while the Excel integration announced in October is expanding to Max, Team and Enterprise plans. Anthropic has also raised usage limits for Opus 4.5 and removed the model-specific caps that previously constrained heavier workflows. Community cost-counting: Opus 4.5’s price drop Sentiment in the immediate aftermath of the launch skewed positive, particularly among developers who have been waiting for a high-end model that’s affordable enough to use in day-to-day systems rather than reserved for exceptional tasks. A big part of that reaction comes down to cost: Opus 4.5 drops to $5 per million input tokens and $25 per million output tokens, roughly a third of the pricing for Opus 4. That shift changes the economics of where a flagship model can be deployed, making it more viable for production workloads that previously defaulted to cheaper variants. roughly a third of the pricing for Opus 4 Indeed, one Hacker News member said the pricing aspect had been “buried in the lede,” meaning the scale of the price cut was significant enough to reshape how the model might be used, yet easy to miss amid the broader brouhaha surrounding the model launch. one Hacker News member “The burying of the lede here is insane – $5/$25 per MTok is a 3x price drop from Opus 4,” they exclaimed. “At that price point, Opus stops being ‘the model you use for important things’ and becomes actually viable for production workloads.” Another commenter pointed out that the changes don’t stop at pricing, highlighting another potentially hidden tidbit that Anthropic has lifted the model-specific caps that previously constrained Claude Code and raised overall usage limits for paid tiers, making the new model far more practical for sustained daily work. pointed out model-specific caps that previously constrained This very same thread surfaced another concern: the risk that the current burst of price cuts and lifted caps is temporary, and that the market could drift toward what one commenter called a “cartel equilibrium.” In their view, the intense competition between model providers is what’s keeping prices low and capabilities moving quickly, but that dynamic could fade if a few dominant labs eventually settle into stable positions and stop undercutting one another. what one commenter called “I like that for this brief moment we actually have a competitive market working in favor of consumers,” they wrote. “I ditched my Claude subscription in favor of Gemini just last week. It won't be great when we enter the cartel equilibrium.” How to access Claude Opus 4.5 Claude Opus 4.5 is available today through the Claude API under the claude-opus-4.5-20251101 model ID, and it’s already live across Anthropic’s own products, including the web, mobile, and desktop apps. Developers using Claude Code will get the upgrade automatically, and enterprise teams can enable it through their existing Claude workspaces. claude-opus-4.5-20251101 On the industry side, most of the major platforms added support at launch, though in different ways depending on how they integrate models. GitHub Copilot, Microsoft Foundry, Amazon Bedrock have made Opus 4.5 available as a selectable model within their ecosystems, while the likes of Vercel AI Gateway, Amp, Warp, and Databricks have introduced support through their respective orchestration and inference layers. GitHub Copilot Microsoft Foundry Amazon Bedrock Vercel AI Gateway Amp Warp Databricks

Amazon

Claude Opus 4.5 Targets Long, Tool-Heavy Tasks in Real Codebases

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

7 Major Learnings from The AI Engineering SF World Fair 2025

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

7 Major Learnings from The AI Engineering SF World Fair 2025

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps