A new open-weights AI coding model is closing in on proprietary options

On Tuesday, French AI startup Mistral AI released Devstral 2, a 123 billion parameter open-weights coding model designed to work as part of an autonomous software engineering agent. The model achieves a 72.2 percent score on SWE-bench Verified, a benchmark that attempts to test whether AI systems can solve real GitHub issues, putting it among the top-performing open-weights models.

Perhaps more notably, Mistral didn’t just release an AI model, it released a new development app called Mistral Vibe. It’s a command line interface (CLI) similar to Claude Code, OpenAI Codex, and Gemini CLI that lets developers interact with the Devstral models directly in their terminal. The tool can scan file structures and Git status to maintain context across an entire project, make changes across multiple files, and execute shell commands autonomously. Mistral released the CLI under the Apache 2.0 license.

It’s always wise to take AI benchmarks with a large grain of salt, but we’ve heard from employees of the big AI companies that they pay very close attention to how well models do on SWE-bench Verified, which presents AI models with 500 real software engineering problems pulled from GitHub issues in popular Python repositories. The AI must read the issue description, navigate the codebase, and generate a working patch that passes unit tests. While some AI researchers have noted that around 90 percent of the tasks in the benchmark test relatively simple bug fixes that experienced engineers could complete in under an hour, it’s one of the few standardized ways to compare coding models.

At the same time as the larger AI coding model, Mistral also released Devstral Small 2, a 24 billion parameter version that scores 68 percent on the same benchmark and can run locally on consumer hardware like a laptop with no Internet connection required. Both models support a 256,000 token context window, allowing them to process moderately large codebases (although whether you consider it large or small is very relative depending on overall project complexity). The company released Devstral 2 under a modified MIT license and Devstral Small 2 under the more permissive Apache 2.0 license.

Devstral 2 is currently free to use through Mistral’s API. After the free period ends, pricing will be $0.40 per million input tokens and $2.00 per million output tokens. Devstral Small 2 will cost $0.10 per million input tokens and $0.30 per million output tokens. Mistral says it’s about “7x more cost-efficient than Claude Sonnet at real-world tasks.” Anthropic’s Sonnet 4.5 through the API costs $3 per million input tokens and $15 per million output tokens, with increases depending on the total number of tokens used.

The vibe-coding connection

The name “Mistral Vibe” references “vibe coding,” a term that AI researcher Andrej Karpathy coined in February 2025 to describe a style of programming where developers describe what they want in natural language and accept AI-generated code without reviewing it closely. As Karpathy describes it, you can “fully giv[e] in to the vibes, embrace exponentials, and forget that the code even exists.” Collins Dictionary named it Word of the Year for 2025.

The vibe coding approach has drawn both enthusiasm and concern. In an interview with Ars Technica in March, developer Simon Willison said, “I really enjoy vibe coding. It’s a fun way to try out an idea and prove if it can work.” But he also warned that “vibe coding your way to a production codebase is clearly risky. Most of the work we do as software engineers involves evolving existing systems, where the quality and understandability of the underlying code is crucial.”

Mistral is betting that Devstral 2 will be able to maintain coherency across entire projects, detect failures, and retry with corrections, and that those claimed abilities will make it suitable for more serious work than simple prototypes and in-house tools. The company says the model can track framework dependencies and handle tasks like bug fixing and modernizing legacy systems at repository scale. We have not experimented with it yet, but you might see an Ars Technica head-to-head test of several AI coding tools soon.