Anthropic says its new AI model “maintained focus” for 30 hours on multistep tasks

https://arstechnica.com/ai/2025/09/anthropic-says-its-new-ai-model-maintained-focus-for-30-hours-on-multistep-tasks/

Benj Edwards Sep 29, 2025 · 6 mins read
Anthropic says its new AI model “maintained focus” for 30 hours on multistep tasks
Share this

On Monday, Anthropic released Claude Sonnet 4.5, a new AI language model the company calls its "most capable model to date," with improved coding and computer use capabilities. The company also revealed Claude Code 2.0, a command-line AI agent for developers, and the Claude Agent SDK, which is a tool developers can use to build their own AI coding agents.

Anthropic says it has witnessed Sonnet 4.5 working continuously on the same project "for more than 30 hours on complex, multi-step tasks," though the company did not provide specific details about the tasks. In the past, agentic models have been known to typically lose coherence over long periods of time as errors accumulate and context windows (a type of short-term memory for the model) fill up. In the past, Anthropic has mentioned that previous Claude 4.0 models have played Pokémon for over 24 hours or refactored code for seven hours.

To understand why Sonnet exists, you need to know a bit about how AI language models work. Traditionally, Anthropic has produced three differently sized AI models in the Claude family: Haiku (the smallest), Sonnet (mid-range), and Opus (the largest). Anthropic last updated Haiku in November 2024 (to 3.5), Sonnet this past May (to 4.0), and Opus in August (to 4.1). Model size in parameters, which are values stored in its neural network, is roughly proportional to overall contextual depth (the number of multidimensional connections between concepts, which you might call "knowledge") and better problem-solving capability, but larger models are also slower and more expensive to run. So AI companies always seek a sweet spot in the middle with reasonable performance-cost trade-offs. Claude Sonnet has filled that role for Anthropic quite well for several years now.

Claude is popular with some software developers thanks to Claude Code, and Anthropic is confident about the latest version of Sonnet's coding capability: "Claude Sonnet 4.5 is the best coding model in the world," the company boasts on its website. "It's the strongest model for building complex agents. It’s the best model at using computers. And it shows substantial gains in reasoning and math."

Anthropic backs up those claims with strong benchmark performance. Sonnet 4.5 model achieved a reported 77.2 percent score on SWE-bench Verified, a benchmark that attempts to measure real-world software coding abilities, and it currently leads the OSWorld benchmark at 61.4 percent, which tests AI models on real-world computer tasks. That beats OpenAI's GPT-5 Codex (which scored 74.5 percent) and Google's Gemini 2.5 Pro (67.2 percent).

In other testing, Claude Sonnet 4.5 showed gains across multiple other evaluations such as AIME 2024, a mathematics competition benchmark, and MMMLU, which tests subject knowledge across 14 non-English languages. On finance-specific tasks measured by Vals AI's Finance Agent benchmark, which is a relatively new benchmark that "tests the ability of agents to perform tasks expected of an entry-level financial analyst," Sonnet 4.5 scored 92 percent.

Sonnet 4.5 also reportedly demonstrated improved computer use capabilities compared to its predecessor in testing. Four months ago, Claude Sonnet 4 scored 42.2 percent on OSWorld. The new version increases that score to 61.4 percent. Anthropic uses these capabilities in its Claude for Chrome extension. Similar to OpenAI's ChatGPT Agent. Claude's extension can navigate websites, fill spreadsheets, and complete other browser-based tasks with various degrees of success.

As always, it's worth noting that AI benchmarks can be gamed easily, poorly designed, or suffer from dataset contamination (a scenario where the model is inadvertently trained on answers in the benchmark). So always take any benchmarks with a grain of salt until they are independently verified. Even with a skeptical eye on the self-reported numbers, it seems that Sonnet 4.5 represents a solid step up from 4.0, and given Anthropic's history of delivering more capable models over time, we have no particular reason to doubt that.

Simon Willison, a veteran software developer and frequent source of independent expert perspective on AI models for Ars Technica, wrote about Sonnet 4.5 on his blog today. He seems generally impressed: "Anthropic gave me access to a preview version of a 'new model' over the weekend which turned out to be Sonnet 4.5," he wrote. "My initial impressions were that it felt like a better model for code than GPT-5-Codex, which has been my preferred coding model since it launched a few weeks ago. This space moves so fast—Gemini 3 is rumored to land soon so who knows how long Sonnet 4.5 will continue to hold the 'best coding model' crown."

Claude 4.5 is available everywhere today. Through the API, the model maintains the same pricing as Claude Sonnet 4, at $3 per million input tokens and $15 per million output tokens. Developers can access it through the Claude API using "claude-sonnet-4-5" as the model identifier.

Other new features

Some ancillary features of the Claude family got some upgrades today, too. For example, Anthropic added code execution and file creation directly within conversations for users of Claude's web interface and dedicated apps. Along those lines, users can now generate spreadsheets, slides, and documents without leaving the chat interface.

The company also released a five-day research preview called "Imagine with Claude" for Max subscribers, which demonstrates the model generating software in real time. Anthropic describes it as "a fun demonstration showing what Claude Sonnet 4.5 can do" when combined with appropriate infrastructure.

As mentioned above, the command-line development tool Claude Code also received several updates today, alongside the new model. The company added checkpoints that save progress and allow users to roll back to previous states, refreshed the terminal interface, and shipped a native VS Code extension. The Claude API also gains a new context editing feature and memory tool for handling longer-running agent tasks.

Right now, AI companies are particularly clinging to software development benchmarks as proof of AI assistant capability because progress in other fields is difficult to objectively measure, and it's a domain where LLMs have arguably shown high utility compared to other fields that might suffer from confabulations. But people still use AI chatbots like Claude as general assistants. And given the recent news about troubles with some users going down fantasy rabbit holes with AI chatbots, it's perhaps more notable than usual that Anthropic claims that Claude Sonnet 4.5 shows reduced "sycophancy, deception, power-seeking, and the tendency to encourage delusional thinking" compared to previous models. Sycophancy, in particular, is the tendency for an AI model to praise the user's ideas, even if they are wrong or potentially dangerous.

We could quibble with how Anthropic frames some of those AI output behaviors through a decidedly anthropomorphic lens, as we have in the past, but overall, attempts to reduce sycophancy are welcome news in a world that has been increasingly turning to chatbots for far more than just coding assistance.