News

Gemini 3.1 Pro: Google's New Flagship Doubles Abstract Reasoning and Leads on 13 of 16 Benchmarks

Aan Team·March 23, 2026·3 min read
Gemini 3.1 Pro: Google's New Flagship Doubles Abstract Reasoning and Leads on 13 of 16 Benchmarks

Google released Gemini 3.1 Pro on February 19, 2026, replacing Gemini 3 Pro as the default model across AI Studio, Vertex AI, Gemini CLI, and Jules (Google's coding agent). The headline improvement is on ARC-AGI-2, a benchmark for abstract reasoning — the score jumped from 31.1% to 77.1%, a 148% increase. Google claims the model leads on 13 of 16 tracked benchmarks.

The model accepts text, code, images, audio, video, and PDF inputs with a one-million-token context window and outputs up to 65,536 tokens. Pricing stayed the same as Gemini 3 Pro — $2 per million input tokens and $12 per million output tokens — making it competitively positioned against Claude Opus 4.6 at $15/$75 and GPT-5 at $10/$30.

Where the benchmarks improved most

The abstract reasoning jump is the standout number. ARC-AGI-2 tests whether a model can identify patterns and apply rules to novel visual puzzles — something that requires genuine generalization, not memorization. Going from 31% to 77% in one generation suggests a meaningful change in how the model approaches novel problems, not just incremental training improvements.

On coding, SWE-Bench Verified reached 80.6%, up from 76.2%. LiveCodeBench Pro Elo jumped by 448 points to 2887. Terminal-Bench 2.0, which measures agentic tool use in terminal environments, improved from 56.9% to 68.5%. On scientific knowledge, GPQA Diamond reached 94.3%. These are not marginal gains — they represent consistent improvement across every category Google tracks.

Configurable reasoning effort

Gemini 3.1 Pro introduces a new reasoning parameter with three effort levels: high, medium, and low. High gives the model more compute time for complex problems. Medium is the default for balanced performance. Low trades reasoning depth for speed. This is similar to what Claude offers with extended thinking, but exposed as a simple parameter rather than requiring prompt engineering.

The practical effect is meaningful. On tasks that do not require deep reasoning — classification, summarization, simple extraction — low effort mode reduces latency and cost. On complex coding or math problems, high effort mode can improve accuracy. Google reports up to 15% efficiency improvement over the best Gemini 3 Pro runs, with fewer output tokens needed for equivalent quality.

How it compares to Claude and GPT

On SWE-Bench Verified, Gemini 3.1 Pro scores 80.6% — the highest reported score among major models. Claude Opus 4.6 and Sonnet 4.6 are competitive but Google claims the lead. On LiveCodeBench Pro, the 2887 Elo puts it ahead of Claude Sonnet and GPT-5.3 Codex. On GPQA Diamond at 94.3%, it outperforms all listed competitors.

Where Gemini 3.1 Pro has a structural advantage is multimodal input. It processes images, audio, video, and PDFs natively within a 1M token context — up to 3,000 images, 45 minutes of video, or 8.4 hours of audio per prompt. Neither Claude nor GPT matches this combination of context length and input modality breadth at the same price point.

Developer integration

The model is available across Google's full developer stack. AI Studio for prototyping, Vertex AI for enterprise deployment, Gemini CLI for command-line access, and as the default model in Jules and Android Studio. It also powers the full-stack coding features in Google AI Studio that we covered in the Stitch and AI Studio article.

For developers already in the Google ecosystem, the upgrade is automatic — Jules switched its default on March 9. For those evaluating models, the key differentiator is the combination of benchmark performance, multimodal input support, and competitive pricing. At $2/$12 per million tokens with a 1M context window, it undercuts Claude significantly while matching or exceeding performance on most benchmarks.