← Back to hard AIs

Verify critical details — pricing, licensing, availability — with the model's source before business decisions. Full methodology →

Models · Mistral AI

Ministral 3 14B Instruct

Model family: ministral-3

Size
mid (14.0B params)
Context
262,144 tokens
Released
2025-12-01
Openness
open-weight
License
Apache 2.0 · commercial: yes
Cost tier
mixed
Rating
4.5 — Best balance in the Mistral lineup of capability, hardware accessibility, and license clarity. Multimodal, long context, Apache 2.0, and genuinely runs on consumer hardware — this is the Mistral model most small businesses should start with.
Modalities
image-input, text
Capabilities
chat, function-calling, instruction-following, long-context, multilingual, tool-use, vision
Access
api-first-party, api-third-party, local-runtime-llama-cpp, local-runtime-lm-studio, local-runtime-ollama, local-runtime-vllm, weights-download-direct, weights-download-hf

Quick Take

Mistral's biggest edge-class model — 14B parameters, vision-capable, 256K context, runs on a single consumer GPUA GPU designed for desktop PCs and gaming — typically Nvidia RTX 3090, 4090, 5090 or similar. Consumer GPUs have 8-32GB of VRAM and cost a few thousand dollars each. Capable of running small and medium models, especially when quantized. The boundary between "runs on a consumer GPU" and "needs a datacenter GPU" roughly separates small from large models in the catalog., and performs like a 24B model. Apache 2.0.

Plain-English Description

Ministral 3 is Mistral's family of edge-deployable models, released December 2, 2025 as part of the broader Mistral 3 generation launch. The family ships in three parameter sizes (3B, 8B, 14B), and each size comes in three post-trainingAny training that happens after pretraining to make a base model useful for real tasks. Includes instruction tuning, chat tuning, and alignment work. Post-training is dramatically cheaper than pretraining — thousands to low millions rather than tens of millions. Most of what distinguishes GPT-4 from Llama 3.1 as a product, rather than as a base capability, is post-training. variants (Base for fine-tuning, Instruct for chat, Reasoning for step-by-step problem solving). Every Ministral 3 model is multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. — text plus image input — which is notable for models in this size class; most competing edge models (Llama 3.2 1B/3B, Gemma 2B/7B) don't have vision capability built in.

The 14B Instruct variant is the largest and most capable of the family, and the one most teams will actually deploy. It's a dense modelA model where every parameter is used for every input — the entire model runs on every token. Contrast with sparse or Mixture of Experts models, which activate only a fraction of the model per input. Dense models are simpler and more predictable; MoE models are more efficient at scale. (every parameter activates on every tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words.), which makes it straightforward to reason about for self-hosting — there's no expert-routing complexity like you'd find in Mistral's MoEA model architecture that splits the model into many smaller specialized "expert" networks, only activating a handful per input rather than running the whole model every time. The practical effect: you get the knowledge capacity of a big model with the compute cost of a much smaller one. Mistral Large 3 and Mistral Small 4 are both MoE models. models. The 256K-token context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run. matches what Mistral Large 3 and Mistral Small 4 offer. At FP8 quantizationCompressing a model by reducing the numerical precision of its stored weights — for example, from 16-bit numbers to 4-bit numbers. The compressed model uses roughly a quarter of the memory and runs faster on most hardware, at the cost of slight accuracy loss. Quantization is what makes big models runnable on laptops — a 70B model in 4-bit quantization can fit on hardware that couldn't load the full-precision version. the model fits comfortably in 24GB of VRAMThe memory built into a GPU. VRAM size determines what models you can load and run — a model's weights must fit in VRAM (or be cleverly swapped in and out). A 7B model in 4-bit quantization needs about 6GB of VRAM; a 70B model in 4-bit needs about 40GB; full-precision frontier models need multiple high-end GPUs. When people talk about a model "fitting" on a GPU, they mean VRAM. (a single RTX 4090 or 3090), and Q4 GGUF quantization brings it down to 16GB with modest quality loss. On Apple Silicon, a 32GB MacBook Pro runs it via llama.cpp or LM Studio at useful speeds.

The performance story is interesting. Mistral positions Ministral 3 14B as offering "frontier capabilities and performance comparable to its larger Mistral Small 3.2 24B counterpart" — which is a real claim backed by benchmarks. The Ministral 3 14B Reasoning variant hits 85% on AIME 2025, a math-reasoning benchmark that's genuinely state-of-the-art for a model this size. The Instruct variant doesn't benchmark quite that high but remains the strongest open-weightA model where the trained weights are freely downloadable — you can run it yourself without contacting the creator. Llama, Mistral, Qwen, and Gemma are open-weight. Open-weight does not mean open-source: the training data and code often stay private. The license still governs what you can do with the weights, including whether you can use them commercially. 14B vision-capable model available. For most non-reasoning-heavy workloads, you'd reach for the Instruct variant; for reasoning-heavy workloads where the extra latency of a chain-of-thought is acceptable, the Reasoning variant is a separate listing.

Best For

  • Small-business-class local deployments. This is the Mistral model that fits the "one consumer GPUA GPU designed for desktop PCs and gaming — typically Nvidia RTX 3090, 4090, 5090 or similar. Consumer GPUs have 8-32GB of VRAM and cost a few thousand dollars each. Capable of running small and medium models, especially when quantized. The boundary between "runs on a consumer GPU" and "needs a datacenter GPU" roughly separates small from large models in the catalog., private inferenceRunning a model to get outputs — as opposed to training it. When you send a prompt to ChatGPT, that's inference. Inference is much cheaper than training per operation but adds up quickly at scale. Pricing pages almost always refer to inference costs (per million tokens, per request, etc.), not training costs., real capability" profile most cleanly. For a law firm, a consultancy, an internal business-tools team — Ministral 3 14B on a single 4090 is a practical starting point.
  • Edge and on-deviceRunning a model directly on a consumer device — a laptop, a phone, a smart speaker — rather than in a data center. On-device inference keeps data private by never leaving the device, and works offline. Small models (under ~10B parameters, often quantized) can run on-device; larger models cannot yet. vision+text applications. Document understanding, image-Q&A, visual inspection workflows. The built-in vision capability means one model handles both modalities without routing.
  • Multilingual applications in small-to-mid markets. Broad multilingual coverage (13+ languages including European, Asian, and Arabic-script) with a model light enough to run in many environments.
  • Fine-tuning for narrow domains. The 14B parameter countA rough measure of a model's size. More parameters generally mean more capability but also more compute and memory to run. Small models are under 10 billion parameters; frontier models can exceed 500 billion. Often written as "7B" (7 billion) or "70B" (70 billion). For MoE models, look for both total parameters and active parameters — they measure different things. is large enough to hold serious domain knowledge and small enough to fine-tuneA model that has been further trained on additional data to specialize it for a particular task, domain, or style. Fine-tuning a general model on medical literature produces a medical specialist; fine-tuning on your company's support tickets produces a support assistant that sounds like your team. Fine-tunes are much cheaper to create than training a model from scratch. affordably. Good starting point for a domain-specific deployment.
  • Teams who want the best local-deployment Mistral. If you've decided on Mistral and you can't commit to the 8×H100 footprint of Mistral Large 3 or the single-H100 of Mistral Small 4, this is where you land.

Not For

  • Frontier-capability workloads. Ministral 3 14B is the best in its class, but it's still a 14B model. For maximum capability, Mistral Large 3 or Mistral Small 4 are meaningfully more capable.
  • Extreme-latency reasoning tasks. The Reasoning variant exists for a reason; Instruct's outputs are faster but lack the extended chain-of-thought. If AIME-class reasoning matters, use the Reasoning variant (separate listing).
  • Agentic coding as a primary use case. Ministral 3 14B can code but it isn't post-trained for agentic coding the way Devstral Small 2 is. For coding-agent workloads, Devstral Small 2 is the better specialist.
  • Very constrained hardware (<16GB VRAMThe memory built into a GPU. VRAM size determines what models you can load and run — a model's weights must fit in VRAM (or be cleverly swapped in and out). A 7B model in 4-bit quantization needs about 6GB of VRAM; a 70B model in 4-bit needs about 40GB; full-precision frontier models need multiple high-end GPUs. When people talk about a model "fitting" on a GPU, they mean VRAM.). At aggressive quantizationCompressing a model by reducing the numerical precision of its stored weights — for example, from 16-bit numbers to 4-bit numbers. The compressed model uses roughly a quarter of the memory and runs faster on most hardware, at the cost of slight accuracy loss. Quantization is what makes big models runnable on laptops — a 70B model in 4-bit quantization can fit on hardware that couldn't load the full-precision version. the 14B can technically squeeze into 12GB, but quality degrades noticeably. For truly small hardware targets, Ministral 3 8B or 3B are purpose-built.

License — Plain-English Summary

Apache 2.0. Commercial use unrestricted, modifications and redistribution allowed, fine-tuning allowed without special terms. Include the license file. This is the standard Mistral open-weightA model where the trained weights are freely downloadable — you can run it yourself without contacting the creator. Llama, Mistral, Qwen, and Gemma are open-weight. Open-weight does not mean open-source: the training data and code often stay private. The license still governs what you can do with the weights, including whether you can use them commercially. posture — no user caps, no revenue thresholds.

How It Compares

  • vs. Mistral Small 4 — Small 4 is more capable but requires an H100 to self-host; Ministral 3 14B runs on a single consumer GPUA GPU designed for desktop PCs and gaming — typically Nvidia RTX 3090, 4090, 5090 or similar. Consumer GPUs have 8-32GB of VRAM and cost a few thousand dollars each. Capable of running small and medium models, especially when quantized. The boundary between "runs on a consumer GPU" and "needs a datacenter GPU" roughly separates small from large models in the catalog.. If you can afford the infrastructure for Small 4, take it; if you can't, Ministral 3 14B is the next best Mistral.
  • vs. Devstral Small 2 24B — Devstral is a coding specialist; Ministral 3 14B is a generalist with vision. Both are Apache 2.0 and both laptop-class. For mixed workloads, Ministral; for agentic coding specifically, Devstral.
  • vs. Ministral 3 14B Reasoning — Same base, different post-trainingAny training that happens after pretraining to make a base model useful for real tasks. Includes instruction tuning, chat tuning, and alignment work. Post-training is dramatically cheaper than pretraining — thousands to low millions rather than tens of millions. Most of what distinguishes GPT-4 from Llama 3.1 as a product, rather than as a base capability, is post-training.. Reasoning variant hits 85% on AIME 2025 but produces longer responses through extended chain-of-thought. Instruct is faster; Reasoning is more accurate on math/logic. Separate listing.
  • vs. Meta Llama 3.2 11B Vision Instruct — Similar size tier with vision capability. Ministral 3 14B has a longer context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run. (256K vs 128K), higher general benchmarks, and cleaner licensing (Apache 2.0 vs Llama Community License's 700M MAU clause). Llama 3.2 has a more mature ecosystem and larger community for tooling.

Under the Hood

Ministral 3 14B is a dense decoderThe part of a model that generates output, one token at a time, from an internal representation. Chat models are almost all decoder-only architectures — they take your prompt, process it, and stream out a response token by token. "Decoder-only" is the technical name for the family most people just call "chatbots."-only transformerThe core model architecture that powers nearly every modern AI language model. Introduced by Google researchers in 2017, it uses a mechanism called attention to process text by looking at every word in context with every other word simultaneously, rather than one at a time. "Transformer" is the T in GPT, BERT, and most other model names. with a vision encoderThe part of a model that reads input and converts it into an internal numerical representation the model can work with. In a translation model, the encoder reads the English sentence; the decoder produces the French. Modern chat models like GPT and Llama are "decoder-only" — they skip the separate encoder step. fused for native multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. input. Architecture includes rope-scaling (inspired by Llama 4) and scalable-softmax attentionThe mechanism inside a Transformer that lets the model weigh which parts of the input matter most when processing each word. When you read "the cat sat on the mat," attention is how the model knows that "it" in a later sentence refers back to the cat, not the mat. Attention is what made modern language models possible. mechanisms to support the 256K context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run. efficiently. Supports function calling in Mistral's native format, structured outputs (JSON), and tool-use orchestration out of the box.

The default Hugging Face release is FP8-quantized for efficient deployment. Mistral also releases a companion BF16 ("no-loss FP8") version, a GGUF quantizationCompressing a model by reducing the numerical precision of its stored weights — for example, from 16-bit numbers to 4-bit numbers. The compressed model uses roughly a quarter of the memory and runs faster on most hardware, at the cost of slight accuracy loss. Quantization is what makes big models runnable on laptops — a 70B model in 4-bit quantization can fit on hardware that couldn't load the full-precision version. family (Q4 / Q5 / Q8 / Q2 variants), and an ONNX export for certain deployment scenarios. The Ministral 3 - Additional Checkpoints collection on Hugging Face catalogs all of these.

Reported benchmarks at launch: comparable to Mistral Small 3.2 24B on most non-reasoning tasks. The Reasoning variant of the same 14B base achieves 85% on AIME 2025, beating Qwen3-14B's 73.7% — notable for a reasoning specialist in the small-model class. Ministral 3 14B Instruct (this model) focuses on fast, single-pass instruction following rather than extended chain-of-thought.

Cost

Self-hosted cost
$0.00 beyond compute
API input (per 1M tokens)
$0.10
API output (per 1M tokens)
$0.30
API providers
mistral, openrouter, fireworks, together
Notes
Self-hosting is free beyond compute. FP8 weights fit on a single consumer GPU with 24GB VRAM; GGUF quantizations run on 16GB or less. This is the "big model on a single prosumer GPU" tier.

Hardware requirements

Min VRAM
16 GB
Recommended VRAM
24 GB
Runs on laptop
Yes
Notes
Q4-quantized GGUF runs on 16GB VRAM (RTX 4080 / RTX 3090). FP8 native fits in 24GB (RTX 4090 / RTX 3090 24GB variants). Full BF16 precision needs ~32GB. Practical laptop deployment via Apple Silicon Macs with 32GB+ unified memory.

Comparable models

Sources