← Back to hard AIs

Verify critical details — pricing, licensing, availability — with the model's source before business decisions. Full methodology →

Models · Mistral AI

Mistral Large 3 Instruct

Model family: mistral-large-3

Size
frontier (675.0B params)
Context
262,144 tokens
Released
2025-12-01
Openness
open-weight
License
Apache 2.0 · commercial: yes
Cost tier
mixed
Rating
4.5 — Frontier-tier open-weight capability with a genuinely permissive license; loses half a star only because the hardware requirements put self-hosting out of reach for anyone without serious GPU infrastructure.
Modalities
image-input, text
Capabilities
chat, long-context, multilingual, reasoning, tool-use, vision
Access
api-first-party, api-third-party, local-runtime-vllm, weights-download-direct, weights-download-hf

Quick Take

Mistral's open-weightA model where the trained weights are freely downloadable — you can run it yourself without contacting the creator. Llama, Mistral, Qwen, and Gemma are open-weight. Open-weight does not mean open-source: the training data and code often stay private. The license still governs what you can do with the weights, including whether you can use them commercially. frontier model — 675B total parameters, Apache 2.0, 256K context, multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. — the strongest permissive license you'll find on a model this capable.

Plain-English Description

Mistral Large 3 is the company's top-of-the-line general-purpose model and the first Mistral flagship to use a mixture-of-experts (MoEA model architecture that splits the model into many smaller specialized "expert" networks, only activating a handful per input rather than running the whole model every time. The practical effect: you get the knowledge capacity of a big model with the compute cost of a much smaller one. Mistral Large 3 and Mistral Small 4 are both MoE models.) architecture since the original Mixtral series from 2024. The MoE design is what makes the math work: the model has 675 billion parameters total, but for any given tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words. only 41 billion of them activate. You get the knowledge capacity of a very large model with inferenceRunning a model to get outputs — as opposed to training it. When you send a prompt to ChatGPT, that's inference. Inference is much cheaper than training per operation but adds up quickly at scale. Pricing pages almost always refer to inference costs (per million tokens, per request, etc.), not training costs. costs closer to a mid-sized dense modelA model where every parameter is used for every input — the entire model runs on every token. Contrast with sparse or Mixture of Experts models, which activate only a fraction of the model per input. Dense models are simpler and more predictable; MoE models are more efficient at scale.. The catch is memory — you still have to load all 675B parameters into GPUThe specialized chip that runs most AI models. Originally designed for 3D graphics, GPUs turned out to be excellent at the math AI requires. Nvidia dominates the AI GPU market; common datacenter models include the H100, H200, and B200. Running an AI model without a GPU is possible but painfully slow for anything but the smallest models. memory to serve the model, which is why self-hosting requires an 8-GPU node at minimum.

The model was announced December 2, 2025, released both as a base (pretrained) variant and the instruction-tuned variant cataloged here. It supports a 256,000-token context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run., natively handles text and image input, and covers a broad swath of languages including European and major Asian languages. Mistral published it under the Apache 2.0 license — the same license that covers Linux, Kubernetes, and most of the modern open-sourceA stricter standard than open-weight: the weights, the training code, and the training data are all released publicly. Very few large language models meet the full open-source bar — most "open" models in the AI world are actually open-weight. When in doubt, check the license file and the creator's documentation. infrastructure stack — which means no user-count caps, no revenue thresholds, no industry restrictions beyond what the license itself specifies. For a model of this capability class, that level of license permissiveness is genuinely unusual; Meta's Llama 4 comes with a 700M monthly-active-user clause, and most other frontier open-weightA model where the trained weights are freely downloadable — you can run it yourself without contacting the creator. Llama, Mistral, Qwen, and Gemma are open-weight. Open-weight does not mean open-source: the training data and code often stay private. The license still governs what you can do with the weights, including whether you can use them commercially. releases carry custom licenses with at least some business-use caveats. Mistral Large 3 has none.

On capability, independent benchmarks place Mistral Large 3 in the top tier of open-weight models without quite matching the frontier proprietary tier. On the Artificial AnalysisAn independent benchmarking site that runs standardized tests across commercial and open-weight models and publishes comparable results on capability, speed, and cost. Widely cited for API provider comparisons — if you want to know whether Llama 3.3 70B is faster on Groq or Together, Artificial Analysis is the reference. Intelligence Index it sits below DeepSeek-v3.2, Kimi K2-Thinking, and GLM-4.6 but above Meta's Llama 4 Maverick, OLMo 3, and most other Western open-weight releases. It's the current top open-source coding model on the LMArena leaderboard. Mistral has announced that a reasoning variant is coming soon but as of this writing it hasn't shipped — which is why we have "Mistral Large 3 Reasoning" on the Watchlist rather than as a separate catalog entry.

Best For

  • Enterprise deployments that need frontier capability and permissive licensing. If you're building a product that embeds a frontier-class LLM and the license terms matter (regulated industry, redistribution, international deployment), Mistral Large 3 is the strongest Apache 2.0 option available.
  • Long-context work on proprietary data. 256K tokens is ~200K words — enough for whole-repository code analysis, multi-document legal review, or book-length content synthesis.
  • European data-sovereignty deployments. Mistral is a French company subject to GDPR and the EU AI Act by default. For regulated EU workloads, this posture is often the deciding factor.
  • Research and experimentation that requires model weightsThe numerical values inside a trained model that encode everything it has learned. A model is, functionally, a giant list of weights — tens of billions of numbers for a mid-sized model, hundreds of billions for a frontier model. "Open-weight" means those numbers are published. "Downloading the weights" means getting the actual file you'd need to run the model yourself.. The full weights are on Hugging Face. You can probe, fine-tuneA model that has been further trained on additional data to specialize it for a particular task, domain, or style. Fine-tuning a general model on medical literature produces a medical specialist; fine-tuning on your company's support tickets produces a support assistant that sounds like your team. Fine-tunes are much cheaper to create than training a model from scratch., distill, or modify without negotiating a license.
  • Multilingual applications. Strong performance across European, major Asian, and Arabic-script languages.

Not For

  • Running on your own machine. Full precision requires ~1.35TB of memory across GPUs. Even the heavily quantized NVFP4 checkpointA specific saved version of a model at a particular point in training. When a creator releases "Llama 3.1 8B Instruct," they're releasing a checkpoint — a frozen snapshot of the model as it existed at the end of training. Most models ship only a single public checkpoint; some creators release multiple (base, instruct, reasoning variants of the same underlying model). needs a Blackwell NVL72 or 8×H100 node. If you want an open-weightA model where the trained weights are freely downloadable — you can run it yourself without contacting the creator. Llama, Mistral, Qwen, and Gemma are open-weight. Open-weight does not mean open-source: the training data and code often stay private. The license still governs what you can do with the weights, including whether you can use them commercially. Mistral you can actually self-host on consumer hardware, Mistral Small 4, Devstral Small 2, or the Ministral 3 family are your options.
  • Cost-optimized workloads where Claude or GPT-class reasoning is overkill. For everyday chat, summarization, and instruction-following, Mistral Small 4 at $0.15/M input tokens delivers much of the capability at a fraction of the cost.
  • The absolute bleeding-edge of reasoning benchmarks. Closed proprietary frontier models (GPT-5.1, Claude Opus 4.5, Gemini 3 Pro) still hold an edge on the hardest reasoning tasks, and the open-weight reasoning-specialist models (DeepSeek-v3.2, Kimi K2-Thinking) outscore Large 3 on the Intelligence Index.
  • Use cases where the reasoning variant would matter. Mistral has announced a reasoning variant as coming but it hasn't shipped. If reasoning is your bottleneck, the Ministral 3 14B Reasoning variant or Mistral Small 4 (with reasoning_effort: high) are available today.

License — Plain-English Summary

Mistral Large 3 is Apache 2.0 — the same license that covers Linux, Kubernetes, and most of the modern open-sourceA stricter standard than open-weight: the weights, the training code, and the training data are all released publicly. Very few large language models meet the full open-source bar — most "open" models in the AI world are actually open-weight. When in doubt, check the license file and the creator's documentation. stack. You can use it commercially without restrictions, modify it, fine-tuneA model that has been further trained on additional data to specialize it for a particular task, domain, or style. Fine-tuning a general model on medical literature produces a medical specialist; fine-tuning on your company's support tickets produces a support assistant that sounds like your team. Fine-tunes are much cheaper to create than training a model from scratch. it, redistribute modified versions, and build products on top of it. The only obligations are: include the license file with redistribution, and don't use Mistral trademarks without permission. No user-count thresholds, no revenue caps, no industry bans. For a frontier-tier model, this is about as permissive as licensing gets.

How It Compares

  • vs. Meta Llama 4 Scout — Similar capability tier; Scout has a larger 10M context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run. but Llama 4's license restricts EU multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. use and has a 700M MAU clause. Mistral Large 3 wins on licensing, Scout wins on extreme context.
  • vs. Meta Llama 3.3 70B Instruct — Llama 3.3 is dense (easier to run), smaller, and has the same 700M MAU conditional license. For teams without H100-class infrastructure, Llama 3.3 is more practical; for teams who want frontier capability and fully permissive licensing, Large 3 is the pick.
  • vs. Mistral Small 4 — Small 4 is 119B MoEA model architecture that splits the model into many smaller specialized "expert" networks, only activating a handful per input rather than running the whole model every time. The practical effect: you get the knowledge capacity of a big model with the compute cost of a much smaller one. Mistral Large 3 and Mistral Small 4 are both MoE models. with 6B active; dramatically cheaper and more deployable, with configurable reasoning that covers most real-world workloads. Use Large 3 when Small 4's quality ceiling becomes a bottleneck.

Under the Hood

Mistral Large 3 is a sparse mixture-of-experts model with granular expert routing — 41B parameters activate per tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words. out of 675B total. Training was done on 3,000 NVIDIA H200 GPUs from scratch (not a continued-pretrainingThe first and most expensive phase of training a model, where it learns general language and knowledge from enormous datasets — typically trillions of tokens of text scraped from the internet, books, code, and other sources. Pretraining produces a base model. Major labs spend millions to hundreds of millions of dollars on a single pretraining run. variant of an earlier Mistral model). The base modelA model straight out of pretraining, before any fine-tuning for chat or specific tasks. Base models predict the next token but don't follow instructions well — they'll continue your prompt rather than respond to it. Most people never use base models directly; they use the instruct-tuned or chat versions built on top. Useful mostly for researchers and people doing their own fine-tuning. was released alongside the Instruct variant; both are Apache 2.0. An NVFP4-quantized checkpointA specific saved version of a model at a particular point in training. When a creator releases "Llama 3.1 8B Instruct," they're releasing a checkpoint — a frozen snapshot of the model as it existed at the end of training. Most models ship only a single public checkpoint; some creators release multiple (base, instruct, reasoning variants of the same underlying model). is co-released for Blackwell NVL72 systems, and an FP8 checkpoint fits on a single 8×H100 node using vLLM.

Architecturally, Mistral Large 3 uses a mix of sparse attentionThe mechanism inside a Transformer that lets the model weigh which parts of the input matter most when processing each word. When you read "the cat sat on the mat," attention is how the model knows that "it" in a later sentence refers back to the cat, not the mat. Attention is what made modern language models possible. and implementation-level optimizations to sustain the 256K context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run. without quadratic compute blow-up. It supports function calling, structured outputs, multi-tool orchestration, fill-in-the-middle code editing, and both OCR and audio transcription endpoints through Mistral's API — though the audio and OCR pipeline integrations ship as separate closed-APIA model that's only accessible through the creator's own API or product — you can't download it, run it yourself, or inspect its weights. GPT-4, Claude, and Gemini Pro are closed-API models. The tradeoff is convenience and often capability (closed-API models are frequently the strongest) versus loss of control over data, pricing, and availability. products (Voxtral, Mistral OCR) rather than being native Large 3 capabilities.

On independent benchmarks, Mistral Large 3 scores 73.11% on MMLUA broad knowledge test covering 57 subjects from law and medicine to mathematics and history. Scores are reported as percentage correct. A score around 85% is strong for a frontier model; above 90% is state-of-the-art. MMLU is probably the most-cited benchmark in AI model comparisons, though it has known weaknesses — models can memorize the questions, and the test reflects a specific cultural and academic context.-Pro and 93.60% on MATH-500, placing it in the top tier of open-weightA model where the trained weights are freely downloadable — you can run it yourself without contacting the creator. Llama, Mistral, Qwen, and Gemma are open-weight. Open-weight does not mean open-source: the training data and code often stay private. The license still governs what you can do with the weights, including whether you can use them commercially. models. It sits below DeepSeek-v3.2, Kimi K2-Thinking, and GLM-4.6 on the Artificial AnalysisAn independent benchmarking site that runs standardized tests across commercial and open-weight models and publishes comparable results on capability, speed, and cost. Widely cited for API provider comparisons — if you want to know whether Llama 3.3 70B is faster on Groq or Together, Artificial Analysis is the reference. Intelligence Index (a composite benchmark) but above Llama 4 Maverick and OLMo 3. On the LMArena leaderboard, it's the current top open-sourceA stricter standard than open-weight: the weights, the training code, and the training data are all released publicly. Very few large language models meet the full open-source bar — most "open" models in the AI world are actually open-weight. When in doubt, check the license file and the creator's documentation. model in the non-reasoning category and the top open-source coding model.

Cost

Self-hosted cost
$0.00 beyond compute
API input (per 1M tokens)
$0.50
API output (per 1M tokens)
$1.50
API providers
mistral, openrouter, fireworks, together, bedrock, azure
Notes
Self-hosted deployment requires an 8×A100 or 8×H100 node minimum, or a Blackwell NVL72 system for the NVFP4-quantized checkpoint. Mistral's own API is the reference implementation; third-party providers host the open weights.

Hardware requirements

Min VRAM
320 GB
Recommended VRAM
640 GB
Runs on laptop
No
Notes
BF16 weights require roughly 1.35TB (you can't actually run this on a laptop or a single workstation). NVFP4-quantized checkpoint fits on a Blackwell NVL72; the FP8 checkpoint fits on a single 8×H100 node.

Comparable models

Sources