Verify critical details — pricing, licensing, availability — with the model's source before business decisions. Full methodology →

Catalog entry last reviewed 92 days ago.

Mistral Small 4

Model family: mistral-small

Size

large (119.0B params)

Context

262,144 tokens

Released

2026-03-15

Openness

open-weight

License

Apache 2.0 · commercial: yes

Cost tier

mixed

Rating

4.5 ★ — Genuinely excellent price-to-capability ratio plus the configurable-reasoning feature is the kind of thoughtful abstraction that makes building on top of it actually productive. The unified model replacing three previous specialists (Magistral, Pixtral, Devstral) simplifies deployment materially.

Modalities

image-input, text

Capabilities

chat, coding, instruction-following, long-context, multilingual, reasoning, tool-use, vision

Access

api-first-party, api-third-party, local-runtime-llama-cpp, local-runtime-vllm, weights-download-hf

llm
open-weight
commercial-friendly
mid
long-context
multimodal
multilingual
eu-based
moe
apache-licensed
reasoning
coding

Quick Take

Mistral's unified mid-tier workhorse — 119B MoE with 6B active, configurable reasoning depth, multimodal, Apache 2.0, and among the cheapest capable API options available.

Plain-English Description

Mistral Small 4 is the most interesting Mistral release of 2026 because it's the first model from any lab to genuinely merge four previously separate product lines into one checkpoint. Before Small 4, if you wanted reasoning you reached for Magistral; if you wanted vision you used Pixtral; if you wanted agentic code generation you loaded Devstral. Small 4 absorbs all three specialists plus the standard Mistral Small instruct model into a single deployment. One API endpoint, one pricing line, one model to profile. For teams that were previously routing between three or four Mistral models based on query type, the operational simplification is substantial.

The killer feature is configurable reasoning. Mistral exposed a reasoning_effort parameter that controls how much the model "thinks" before responding. Set it to "none" and you get fast, lightweight chat equivalent to the old Mistral Small 3.2. Set it to "high" and you get deep step-by-step reasoning that matches what Magistral did as a standalone model. This is the kind of abstraction that's genuinely useful in production — most applications have some fast-chat queries and some hard-reasoning queries, and historically you'd either pay reasoning-model prices for everything or build routing logic to switch between models. Small 4 lets you switch on a per-request basis without swapping models.

The architecture is sparse mixture-of-experts: 119 billion total parameters but only 6 billion active per token. That's dramatically smaller activation than Mistral Large 3's 41B active, which is how Small 4 manages to be roughly 5× cheaper to run despite having a similar knowledge capacity. The 256K-token context window matches Large 3. The model is natively multimodal, handling both text and image input, and supports tool use, function calling, and structured outputs. Released under Apache 2.0 with no conditional clauses — standard Mistral permissive licensing.

Best For

Teams that were previously running multiple Mistral models. If you've got routing logic selecting between Mistral Small, Magistral, and Pixtral based on query type, Small 4 lets you collapse that to one endpoint and use reasoning_effort to control compute cost per request.
Cost-sensitive production workloads. At $0.15/M input and $0.60/M output, Small 4 is among the cheapest capable multimodal models available anywhere. For high-volume applications where token cost matters, this is the economics story.
Applications with mixed workload profiles. Chat, summarization, vision-Q&A, code generation, and reasoning all in one deployment. Configurable reasoning depth means you don't pay reasoning-model rates for casual chat turns.
Self-hosted mid-tier deployment on enterprise GPUs. Runs on a single H100 in FP8, 8×H100 node in BF16. Well within reach of any team that has serious GPU infrastructure but isn't ready for Mistral Large 3's memory requirements.
European compliance workloads. Apache 2.0 + French creator + EU jurisdiction is a procurement profile that Claude, GPT, and Gemini can't match without additional bolt-on agreements.

Not For

Running on a laptop or consumer GPU. Even the FP8 checkpoint needs 80GB of VRAM. For on-device use, reach for Ministral 3 (3B/8B/14B) instead.
Absolute top-tier reasoning on the hardest benchmarks. Small 4 with reasoning_effort: high is very good, but the specialist frontier reasoning models (DeepSeek-v3.2, Kimi K2-Thinking, GPT-5.1 with extended thinking) still edge it out on the toughest tasks.
Workflows where model provenance transparency matters deeply. Mistral publishes the weights and architecture but has not published the pretraining dataset composition. If your compliance posture requires a fully documented training data supply chain, this gap will matter.
Voice or audio workloads. Small 4 handles text and images but not audio. For speech applications, route to Voxtral models.

License — Plain-English Summary

Apache 2.0, same as Mistral Large 3. You can use it commercially, modify it, fine-tune it, redistribute modified versions, and build products on top of it without restrictions or revenue thresholds. Include the license file with redistribution and don't use Mistral trademarks without permission. Fully permissive.

How It Compares

vs. Mistral Large 3 Instruct — Large 3 is the capability ceiling; Small 4 is the cost-effective workhorse. Large 3 needs 8×H100 to self-host; Small 4 runs on a single H100 in FP8. Same license, same context window. Unless you've verified Small 4 isn't enough for your task, Small 4 is the default.
vs. Meta Llama 4 Scout — Scout has the 10M context window (versus Small 4's 256K) and a very different activation profile (17B active vs 6B). Llama 4's 700M MAU clause restricts some deployments where Mistral's Apache 2.0 is unconditional. Scout is better for extreme-context work; Small 4 is better for general-purpose deployment.
vs. Ministral 3 14B Instruct — The Ministral is dense, runs on consumer hardware, and is genuinely laptop-feasible. Small 4 is meaningfully more capable but categorically not a local-device model. If "runs on my machine" matters, Ministral; if capability matters more, Small 4.

Under the Hood

Mistral Small 4 is a granular sparse mixture-of-experts transformer with 119 billion total parameters and 6 billion active per forward pass. The 256K-token context window uses the same sparse-attention and kernel-level optimizations as Mistral Large 3. Training data has a cutoff that Mistral hasn't publicly disclosed in specifics, though the model's knowledge appears broadly consistent with other early-2026 releases.

The reasoning_effort parameter internally switches the model's decoding behavior — at "high" settings the model produces an extended internal chain-of-thought before its final response, similar to how GPT-5 reasoning and Claude's extended-thinking modes work. At "none" it behaves like a standard instruct model. The research paper notes that "high" mode is the behavior previously shipped as the standalone Magistral model, now folded in.

On benchmarks at launch, Mistral claimed Small 4 outperforms GPT-OSS 120B on LiveCodeBench while generating 20% fewer output tokens. Independent benchmarks generally corroborate that Small 4 is the strongest Apache 2.0-licensed model in the 100B-class. Artificial Analysis places it competitive with Gemini Flash and GPT-4.1 Mini on composite benchmarks at a fraction of the cost.

Mistral released several companion checkpoints alongside the main BF16 release: an FP8 quantized version for single-H100 deployment, an NVFP4 version for Blackwell-class hardware, and an Eagle speculative-decoding head for throughput acceleration. These are cataloged as listings under the Mistral Small 4 family.

Cost

Self-hosted cost: $0.00 beyond compute
API input (per 1M tokens): $0.15
API output (per 1M tokens): $0.60
API providers: mistral, openrouter, fireworks, together
Notes: Among the cheapest multimodal reasoning models available at launch — 5× cheaper than GPT-5.4 Mini on input and 7.5× cheaper on output. Self-hosting requires ~240GB VRAM for BF16, or less with the FP8 / NVFP4 quantized checkpoints.

Pricing data is 92 days old. Verify with the source before relying on it.

Hardware requirements

Min VRAM: 80 GB
Recommended VRAM: 240 GB
Runs on laptop: No
Notes: FP8 checkpoint runs on a single H100 or A100 80GB. BF16 full precision needs an 8×H100 node. Not laptop-feasible at any useful quantization level.