Verify critical details — pricing, licensing, availability — with the model's source before business decisions. Full methodology →
Mistral Small 4
Model family: mistral-small
- llm
- open-weight
- commercial-friendly
- mid
- long-context
- multimodal
- multilingual
- eu-based
- moe
- apache-licensed
- reasoning
- coding
Quick Take
Mistral's unified mid-tier workhorse — 119B MoEA model architecture that splits the model into many smaller specialized "expert" networks, only activating a handful per input rather than running the whole model every time. The practical effect: you get the knowledge capacity of a big model with the compute cost of a much smaller one. Mistral Large 3 and Mistral Small 4 are both MoE models. with 6B active, configurable reasoning depth, multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default., Apache 2.0, and among the cheapest capable API options available.
Plain-English Description
Mistral Small 4 is the most interesting Mistral release of 2026 because it's the first model from any lab to genuinely mergeA model created by mathematically combining the weights of two or more existing models. Merges don't require training — just algebra — and can produce models that inherit strengths from each parent. Common in the open-weight community on Hugging Face. Quality varies widely; the best merges genuinely improve on their parents, while careless ones produce worse outputs than either. four previously separate product lines into one checkpointA specific saved version of a model at a particular point in training. When a creator releases "Llama 3.1 8B Instruct," they're releasing a checkpoint — a frozen snapshot of the model as it existed at the end of training. Most models ship only a single public checkpoint; some creators release multiple (base, instruct, reasoning variants of the same underlying model).. Before Small 4, if you wanted reasoning you reached for Magistral; if you wanted vision you used Pixtral; if you wanted agentic code generation you loaded Devstral. Small 4 absorbs all three specialists plus the standard Mistral Small instruct model into a single deployment. One API endpoint, one pricing line, one model to profile. For teams that were previously routing between three or four Mistral models based on query type, the operational simplification is substantial.
The killer feature is configurable reasoning. Mistral exposed a reasoning_effort parameter that controls how much the model "thinks" before responding. Set it to "none" and you get fast, lightweight chat equivalent to the old Mistral Small 3.2. Set it to "high" and you get deep step-by-step reasoning that matches what Magistral did as a standalone model. This is the kind of abstraction that's genuinely useful in production — most applications have some fast-chat queries and some hard-reasoning queries, and historically you'd either pay reasoning-model prices for everything or build routing logic to switch between models. Small 4 lets you switch on a per-request basis without swapping models.
The architecture is sparse mixture-of-experts: 119 billion total parameters but only 6 billion active per tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words.. That's dramatically smaller activation than Mistral Large 3's 41B active, which is how Small 4 manages to be roughly 5× cheaper to run despite having a similar knowledge capacity. The 256K-token context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run. matches Large 3. The model is natively multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default., handling both text and image input, and supports tool use, function calling, and structured outputs. Released under Apache 2.0 with no conditional clauses — standard Mistral permissive licensing.
Best For
- Teams that were previously running multiple Mistral models. If you've got routing logic selecting between Mistral Small, Magistral, and Pixtral based on query type, Small 4 lets you collapse that to one endpoint and use
reasoning_effortto control compute cost per request. - Cost-sensitive production workloads. At $0.15/M input and $0.60/M output, Small 4 is among the cheapest capable multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. models available anywhere. For high-volume applications where tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words. cost matters, this is the economics story.
- Applications with mixed workload profiles. Chat, summarization, vision-Q&A, code generation, and reasoning all in one deployment. Configurable reasoning depth means you don't pay reasoning-model rates for casual chat turns.
- Self-hostedRunning a model on hardware you control — your own servers, your own cloud instance, or your own laptop — rather than paying to access it through someone else's API. Self-hosting gives you full control over data and predictable costs, but requires the hardware and operational effort to run the model. Only possible with open-weight models. mid-tier deployment on enterprise GPUs. Runs on a single H100 in FP8, 8×H100 node in BF16. Well within reach of any team that has serious GPUThe specialized chip that runs most AI models. Originally designed for 3D graphics, GPUs turned out to be excellent at the math AI requires. Nvidia dominates the AI GPU market; common datacenter models include the H100, H200, and B200. Running an AI model without a GPU is possible but painfully slow for anything but the smallest models. infrastructure but isn't ready for Mistral Large 3's memory requirements.
- European compliance workloads. Apache 2.0 + French creator + EU jurisdiction is a procurement profile that Claude, GPT, and Gemini can't match without additional bolt-on agreements.
Not For
- Running on a laptop or consumer GPUA GPU designed for desktop PCs and gaming — typically Nvidia RTX 3090, 4090, 5090 or similar. Consumer GPUs have 8-32GB of VRAM and cost a few thousand dollars each. Capable of running small and medium models, especially when quantized. The boundary between "runs on a consumer GPU" and "needs a datacenter GPU" roughly separates small from large models in the catalog.. Even the FP8 checkpointA specific saved version of a model at a particular point in training. When a creator releases "Llama 3.1 8B Instruct," they're releasing a checkpoint — a frozen snapshot of the model as it existed at the end of training. Most models ship only a single public checkpoint; some creators release multiple (base, instruct, reasoning variants of the same underlying model). needs 80GB of VRAMThe memory built into a GPU. VRAM size determines what models you can load and run — a model's weights must fit in VRAM (or be cleverly swapped in and out). A 7B model in 4-bit quantization needs about 6GB of VRAM; a 70B model in 4-bit needs about 40GB; full-precision frontier models need multiple high-end GPUs. When people talk about a model "fitting" on a GPU, they mean VRAM.. For on-deviceRunning a model directly on a consumer device — a laptop, a phone, a smart speaker — rather than in a data center. On-device inference keeps data private by never leaving the device, and works offline. Small models (under ~10B parameters, often quantized) can run on-device; larger models cannot yet. use, reach for Ministral 3 (3B/8B/14B) instead.
- Absolute top-tier reasoning on the hardest benchmarks. Small 4 with
reasoning_effort: highis very good, but the specialist frontier reasoning models (DeepSeek-v3.2, Kimi K2-Thinking, GPT-5.1 with extended thinking) still edge it out on the toughest tasks. - Workflows where model provenance transparency matters deeply. Mistral publishes the weightsThe numerical values inside a trained model that encode everything it has learned. A model is, functionally, a giant list of weights — tens of billions of numbers for a mid-sized model, hundreds of billions for a frontier model. "Open-weight" means those numbers are published. "Downloading the weights" means getting the actual file you'd need to run the model yourself. and architecture but has not published the pretrainingThe first and most expensive phase of training a model, where it learns general language and knowledge from enormous datasets — typically trillions of tokens of text scraped from the internet, books, code, and other sources. Pretraining produces a base model. Major labs spend millions to hundreds of millions of dollars on a single pretraining run. dataset composition. If your compliance posture requires a fully documented training data supply chain, this gap will matter.
- Voice or audio workloads. Small 4 handles text and images but not audio. For speech applications, route to Voxtral models.
License — Plain-English Summary
Apache 2.0, same as Mistral Large 3. You can use it commercially, modify it, fine-tuneA model that has been further trained on additional data to specialize it for a particular task, domain, or style. Fine-tuning a general model on medical literature produces a medical specialist; fine-tuning on your company's support tickets produces a support assistant that sounds like your team. Fine-tunes are much cheaper to create than training a model from scratch. it, redistribute modified versions, and build products on top of it without restrictions or revenue thresholds. Include the license file with redistribution and don't use Mistral trademarks without permission. Fully permissive.
How It Compares
- vs. Mistral Large 3 Instruct — Large 3 is the capability ceiling; Small 4 is the cost-effective workhorse. Large 3 needs 8×H100 to self-host; Small 4 runs on a single H100 in FP8. Same license, same context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run.. Unless you've verified Small 4 isn't enough for your task, Small 4 is the default.
- vs. Meta Llama 4 Scout — Scout has the 10M context window (versus Small 4's 256K) and a very different activation profile (17B active vs 6B). Llama 4's 700M MAU clause restricts some deployments where Mistral's Apache 2.0 is unconditional. Scout is better for extreme-context work; Small 4 is better for general-purpose deployment.
- vs. Ministral 3 14B Instruct — The Ministral is dense, runs on consumer hardware, and is genuinely laptop-feasible. Small 4 is meaningfully more capable but categorically not a local-device model. If "runs on my machine" matters, Ministral; if capability matters more, Small 4.
Under the Hood
Mistral Small 4 is a granular sparse mixture-of-experts transformerThe core model architecture that powers nearly every modern AI language model. Introduced by Google researchers in 2017, it uses a mechanism called attention to process text by looking at every word in context with every other word simultaneously, rather than one at a time. "Transformer" is the T in GPT, BERT, and most other model names. with 119 billion total parameters and 6 billion active per forward pass. The 256K-tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words. context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run. uses the same sparse-attentionThe mechanism inside a Transformer that lets the model weigh which parts of the input matter most when processing each word. When you read "the cat sat on the mat," attention is how the model knows that "it" in a later sentence refers back to the cat, not the mat. Attention is what made modern language models possible. and kernel-level optimizations as Mistral Large 3. Training data has a cutoff that Mistral hasn't publicly disclosed in specifics, though the model's knowledge appears broadly consistent with other early-2026 releases.
The reasoning_effort parameter internally switches the model's decoding behavior — at "high" settings the model produces an extended internal chain-of-thought before its final response, similar to how GPT-5 reasoning and Claude's extended-thinking modes work. At "none" it behaves like a standard instruct model. The research paper notes that "high" mode is the behavior previously shipped as the standalone Magistral model, now folded in.
On benchmarks at launch, Mistral claimed Small 4 outperforms GPT-OSS 120B on LiveCodeBench while generating 20% fewer output tokens. Independent benchmarks generally corroborate that Small 4 is the strongest Apache 2.0-licensed model in the 100B-class. Artificial AnalysisAn independent benchmarking site that runs standardized tests across commercial and open-weight models and publishes comparable results on capability, speed, and cost. Widely cited for API provider comparisons — if you want to know whether Llama 3.3 70B is faster on Groq or Together, Artificial Analysis is the reference. places it competitive with Gemini Flash and GPT-4.1 Mini on composite benchmarks at a fraction of the cost.
Mistral released several companion checkpoints alongside the main BF16 release: an FP8 quantized version for single-H100 deployment, an NVFP4 version for Blackwell-class hardware, and an Eagle speculative-decoding head for throughput acceleration. These are cataloged as listings under the Mistral Small 4 family.
Cost
- Self-hosted cost
- $0.00 beyond compute
- API input (per 1M tokens)
- $0.15
- API output (per 1M tokens)
- $0.60
- API providers
- mistral, openrouter, fireworks, together
- Notes
- Among the cheapest multimodal reasoning models available at launch — 5× cheaper than GPT-5.4 Mini on input and 7.5× cheaper on output. Self-hosting requires ~240GB VRAM for BF16, or less with the FP8 / NVFP4 quantized checkpoints.
Hardware requirements
- Min VRAM
- 80 GB
- Recommended VRAM
- 240 GB
- Runs on laptop
- No
- Notes
- FP8 checkpoint runs on a single H100 or A100 80GB. BF16 full precision needs an 8×H100 node. Not laptop-feasible at any useful quantization level.