← Back to hard AIs

Verify critical details — pricing, licensing, availability — with the model's source before business decisions. Full methodology →

Models · Meta

Llama 4 Scout Instruct

Model family: llama-4

Size
large (109.0B params)
Context
10,000,000 tokens
Released
2025-04-04
Openness
open-weight
License
Llama 4 Community License · commercial: conditional
Cost tier
mixed
Rating
4.0 — Genuinely impressive architecture and a 10M context window that changes what's practical — but the EU restriction is a real catch for a significant chunk of businesses, and the model is still new enough that independent evaluations are catching up to Meta's benchmark claims.
Modalities
image-input, text
Capabilities
chat, instruction-following, long-context, multilingual, reasoning, tool-use, vision
Access
api-third-party, local-runtime-ollama, local-runtime-vllm, weights-download-direct, weights-download-hf

Quick Take

Meta's first mixture-of-experts and first natively multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. open-weightA model where the trained weights are freely downloadable — you can run it yourself without contacting the creator. Llama, Mistral, Qwen, and Gemma are open-weight. Open-weight does not mean open-source: the training data and code often stay private. The license still governs what you can do with the weights, including whether you can use them commercially. model, with a 10-million-tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words. context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run. and an EU restriction business owners need to know about.

Plain-English Description

Llama 4 Scout launched alongside its larger sibling Llama 4 Maverick in April 2025, and it represents two architectural firsts for the Llama family. It's Meta's first model using mixture-of-experts (MoEA model architecture that splits the model into many smaller specialized "expert" networks, only activating a handful per input rather than running the whole model every time. The practical effect: you get the knowledge capacity of a big model with the compute cost of a much smaller one. Mistral Large 3 and Mistral Small 4 are both MoE models.) architecture, and it's the first Llama model natively trained to understand images as well as text. Both of those choices have real implications for how the model performs and what it costs to run.

Mixture-of-experts means the model has 109 billion total parameters divided across 16 "experts," but only about 17 billion of them activate for any given tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words.. Think of it like a hospital: you have specialists for cardiology, orthopedics, neurology, and so on, but any one patient typically only needs a couple of them. The MoE architecture lets the model hold a lot more knowledge without requiring you to activate all of it at once — so the per-token compute cost is closer to a 17B model even though the capability is closer to a larger dense modelA model where every parameter is used for every input — the entire model runs on every token. Contrast with sparse or Mixture of Experts models, which activate only a fraction of the model per input. Dense models are simpler and more predictable; MoE models are more efficient at scale.. It also means the hardware math is different: you need enough memory to hold all the experts ready to activate (which is a lot), but the actual computation per response is faster than you'd expect from a model this size.

The other headline feature is the context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run.: 10 million tokens. That's around 7.5 million words — enough to fit multiple novels, an entire codebase, or years of email history in a single prompt. Most business uses won't stretch anywhere near that ceiling, but for document-heavy workflows — legal research, long-form content analysis, multi-document synthesis — this genuinely changes what's possible. And because it's natively multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default., the model understands images as part of that context: you can put screenshots, diagrams, charts, and photos into the same conversation as text.

Best For

  • Document-heavy analysis workflows where context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run. is the bottleneck — contracts, legal briefs, research synthesis, codebase analysis
  • MultimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. use cases — image understanding, visual QA, diagram interpretation, screenshot analysis
  • Applications that need strong multilingual support — Llama 4 adds Arabic, Indonesian, Tagalog, Thai, and Vietnamese to the existing Llama language list
  • Self-hostedRunning a model on hardware you control — your own servers, your own cloud instance, or your own laptop — rather than paying to access it through someone else's API. Self-hosting gives you full control over data and predictable costs, but requires the hardware and operational effort to run the model. Only possible with open-weight models. or managed-API deployments where you want frontier-class capability without OpenAI or Anthropic pricing
  • Teams experimenting with MoEA model architecture that splits the model into many smaller specialized "expert" networks, only activating a handful per input rather than running the whole model every time. The practical effect: you get the knowledge capacity of a big model with the compute cost of a much smaller one. Mistral Large 3 and Mistral Small 4 are both MoE models. architecture before it becomes the default for open-weightA model where the trained weights are freely downloadable — you can run it yourself without contacting the creator. Llama, Mistral, Qwen, and Gemma are open-weight. Open-weight does not mean open-source: the training data and code often stay private. The license still governs what you can do with the weights, including whether you can use them commercially. models (which it appears to be heading toward)

Not For

  • EU-based businesses or those serving EU customers without verifying current legal status — the Llama 4 Acceptable Use Policy restricts EU use, particularly of the multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. features. This is a hard stop for many businesses and should be resolved before building.
  • Consumer-hardware deployment — this is an H100-class model at minimum
  • Simple chatbot use cases where Llama 3.1 8B or 3.3 70B would do the job at much lower cost
  • Businesses that need a stable, well-understood model — Scout is new enough that community tooling, fine-tuning recipes, and independent evaluation are still catching up
  • Organizations above 700M monthly active users without a separate Meta license

License — Plain-English Summary

Same 700M MAU threshold as the rest of the Llama family — which, again, affects almost no one in practice. The meaningful new restriction is the EU carve-out: Llama 4's Acceptable Use Policy restricts use in the European Union, particularly of multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. capabilities, apparently in response to EU AI Act compliance concerns. If your business is EU-based or serves EU customers, do not build on Llama 4 without verifying current legal status directly with Meta or competent legal counsel. For US, UK, and most non-EU markets, the license is otherwise permissive commercial use with the standard Llama attribution requirements.

How It Compares

  • Llama 3.3 70B Instruct (see Llama 3.3 70B Instruct — Meta's previous-generation flagship; text-only, no EU restriction, slightly smaller context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run., simpler architecture; the safer choice for EU-exposed businesses)
  • Llama 4 Maverick (Scout's larger sibling in the same family — 400B total parameters with 128 experts, positioned as the enterprise workhorse; same license terms, including the EU restriction)
  • GPT-4o (closed-source from OpenAI — comparable multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. capability, API-only, no weight access, different license profile entirely; the closed-model alternative when open-weightA model where the trained weights are freely downloadable — you can run it yourself without contacting the creator. Llama, Mistral, Qwen, and Gemma are open-weight. Open-weight does not mean open-source: the training data and code often stay private. The license still governs what you can do with the weights, including whether you can use them commercially. economics don't matter)

Under the Hood

Llama 4 Scout uses a mixture-of-experts transformerThe core model architecture that powers nearly every modern AI language model. Introduced by Google researchers in 2017, it uses a mechanism called attention to process text by looking at every word in context with every other word simultaneously, rather than one at a time. "Transformer" is the T in GPT, BERT, and most other model names. architecture with 16 experts, 17B active parametersIn a Mixture of Experts model, the number of parameters that actually run for any given input, as opposed to the total parameter count that's stored. Mistral Large 3, for example, has 675B total parameters but only 41B active per query — meaning it runs at roughly the cost of a 41B dense model while drawing on 675B worth of knowledge. per tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words., and 109B total parameters. Multimodality is implemented through what Meta calls "early fusion" — text and image tokens are integrated into a unified model rather than using separate encoders and decoders bolted together. The 10M-token context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run. was achieved through an interleaved attentionThe mechanism inside a Transformer that lets the model weigh which parts of the input matter most when processing each word. When you read "the cat sat on the mat," attention is how the model knows that "it" in a later sentence refers back to the cat, not the mat. Attention is what made modern language models possible. architecture without positional embeddings and inferenceRunning a model to get outputs — as opposed to training it. When you send a prompt to ChatGPT, that's inference. Inference is much cheaper than training per operation but adds up quickly at scale. Pricing pages almost always refer to inference costs (per million tokens, per request, etc.), not training costs.-time temperature scaling. Scout was trained from scratch (unlike Maverick, which was co-distilled from the in-training Llama 4 Behemoth model). Training data exceeded 30 trillion tokens, roughly double Llama 3's training set. Native language support includes Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese. The model supports tool use, function calling, and JSON-mode structured output.

Cost

Self-hosted cost
$0.00 beyond compute
API input (per 1M tokens)
$0.15
API output (per 1M tokens)
$0.60
API providers
together, fireworks, openrouter, aws-bedrock
Notes
API pricing is surprisingly favorable given the model's scale — the MoE architecture means providers bill closer to 17B-model economics. Self-hosted is free beyond compute, but compute is significant (H100-class GPU at minimum for practical use).

Hardware requirements

Min VRAM
80 GB
Recommended VRAM
80 GB
Runs on laptop
No
Notes
Meta positions Scout as "fits on a single H100" at Int4 quantization. That's still an ~$25K GPU or roughly $2-4/hour to rent. Not a consumer-hardware model. The MoE architecture means memory requirements are driven by the full expert count, not just active parameters.

Comparable models

Commercial-use conditions

Free for commercial use unless your product had more than 700 million monthly active users on the Llama 4 release date (April 5, 2025). ADDITIONAL restriction: the Llama 4 Acceptable Use Policy restricts use of Llama 4 models, and multimodal features specifically, within the European Union. EU-based businesses or those serving EU customers should verify current status with Meta before building on Llama 4's multimodal capabilities.

Sources