← Back to hard AIs

Verify critical details — pricing, licensing, availability — with the model's source before business decisions. Full methodology →

Models · Google

Gemma 4 31B

Model family: gemma-4

Size
mid (31.0B params)
Context
262,144 tokens
Released
2026-04-01
Openness
open-weight
License
Apache License 2.0 · commercial: yes
Cost tier
mixed
Rating
4.5 — Frontier-adjacent capability — strong math and coding, native multimodality, 140 languages — under clean Apache 2.0 and runnable on a single consumer GPU. One of the best open models for businesses that want US-jurisdiction weights.
Modalities
audio-input, image-input, text
Capabilities
chat, coding, function-calling, instruction-following, long-context, math, multilingual, reasoning, tool-use, vision
Access
api-first-party, api-third-party, local-runtime-llama-cpp, local-runtime-ollama, local-runtime-vllm, weights-download-hf

Quick Take

Google's open flagship: a 31B multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. model under clean Apache 2.0 that beats far larger models on math and coding — and runs on a single consumer GPUA GPU designed for desktop PCs and gaming — typically Nvidia RTX 3090, 4090, 5090 or similar. Consumer GPUs have 8-32GB of VRAM and cost a few thousand dollars each. Capable of running small and medium models, especially when quantized. The boundary between "runs on a consumer GPU" and "needs a datacenter GPU" roughly separates small from large models in the catalog..

Plain-English Description

Gemma 4 31B, released in April 2026, is the largest model in Google's open Gemma 4 family and the headline of a genuinely important release: it's the first Gemma built directly from Gemini 3 research and the first to ship under Apache 2.0 rather than Google's older custom license. In plain terms, Google took the techniques behind its closed frontier models and put a slice of them into something you can download and use commercially with no strings.

The standout trait is efficiency — capability per parameter. The 31B model posts scores you'd expect from something much larger: around 89.2% on the AIME 2026 math exam (where many models struggle to clear 60%) and about 80% on LiveCodeBench v6, and it landed near the top of public leaderboards at launch despite being a fraction of the size of the models around it. That comes from architectural work — a hybrid attentionThe mechanism inside a Transformer that lets the model weigh which parts of the input matter most when processing each word. When you read "the cat sat on the mat," attention is how the model knows that "it" in a later sentence refers back to the cat, not the mat. Attention is what made modern language models possible. design and per-layer embeddings — aimed squarely at getting more out of fewer parameters. It's also natively multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. (text, images, and audio in, text out) and covers 140 languages.

For a business, the combination is the appeal: near-frontier quality, an unrestricted Apache 2.0 license, US-jurisdiction weightsThe numerical values inside a trained model that encode everything it has learned. A model is, functionally, a giant list of weights — tens of billions of numbers for a mid-sized model, hundreds of billions for a frontier model. "Open-weight" means those numbers are published. "Downloading the weights" means getting the actual file you'd need to run the model yourself., and small enough to self-host on a single high-end GPUThe specialized chip that runs most AI models. Originally designed for 3D graphics, GPUs turned out to be excellent at the math AI requires. Nvidia dominates the AI GPU market; common datacenter models include the H100, H200, and B200. Running an AI model without a GPU is possible but painfully slow for anything but the smallest models.. It's Google's answer to the open models from Meta, Alibaba, and DeepSeek — and on licensing it's now among the cleanest of the bunch.

Best For

  • Businesses that want a capable, self-hostable open model with clean Apache 2.0 licensing and US-based provenance.
  • MultimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. work (image and audio understanding) in an open model you control.
  • Multilingual products — 140-language coverage.
  • Math, coding, and reasoning tasks where the efficiency-per-parameter pays off on modest hardware.
  • The "cloud + edge" pattern: Gemma 4 self-hostedRunning a model on hardware you control — your own servers, your own cloud instance, or your own laptop — rather than paying to access it through someone else's API. Self-hosting gives you full control over data and predictable costs, but requires the hardware and operational effort to run the model. Only possible with open-weight models. for private/local work, Gemini API for the heaviest cloud jobs.

Not For

  • Absolute maximum capability — the closed Gemini 3.5 Flash and the larger open models from other labs go higher on some tasks.
  • The lightest edge/on-deviceRunning a model directly on a consumer device — a laptop, a phone, a smart speaker — rather than in a data center. On-device inference keeps data private by never leaving the device, and works offline. Small models (under ~10B parameters, often quantized) can run on-device; larger models cannot yet. targets — for phones and IoT, drop to Gemma 4 E2B.
  • Workloads needing a mixture-of-experts model's throughput profile — see Gemma 4 26B-A4B.
  • Teams that specifically need the very longest context windows (multi-million tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words.) for single-pass retrieval.

License — Plain-English Summary

Apache 2.0 — and that's news worth noting, because earlier Gemma versions (2 and 3) used Google's custom "Gemma Terms of Use," which allowed commercial use but layered on a prohibited-use policy and wasn't true open source. Gemma 4 drops all that: unrestricted commercial use, modification, fine-tuning, and redistribution, no royalties, no carve-outs — just keep the notices and flag significant changes. If you're downloading Gemma, confirm you're getting version 4 (Apache 2.0) rather than an older one (custom terms), because the license genuinely differs between generations.

How It Compares

Against the smaller Gemma 4 sizes, the 31B is the most capable but heaviest — Gemma 4 26B-A4B trades a little quality for mixture-of-experts efficiency, and Gemma 4 E2B goes all the way down to phone-class. Against Google's own closed Gemini 3.5 Flash, Gemma 4 31B is the open, self-hostable option — less peak capability, full ownership. Against the other open flagships — Meta's Llama, Alibaba's Qwen, and DeepSeek — Gemma 4 competes on efficiency-per-parameter and now matches or beats them on license cleanliness (Apache 2.0), with US jurisdiction as a point in its favor for data-governance-sensitive buyers.

Cost

Self-hosted cost
$0.00 beyond compute
Notes
Free to self-host under Apache 2.0. Also hosted on Google AI Studio / Vertex AI and third-party providers for per-token pricing. 140-language support.

Hardware requirements

Min VRAM
20 GB
Recommended VRAM
32 GB
Runs on laptop
Yes
Notes
Runs quantized on a single 24GB consumer GPU; high-end laptops can manage it. Far more accessible than its benchmark scores would suggest.

Comparable models

Sources