Verify critical details — pricing, licensing, availability — with the model's source before business decisions. Full methodology →

Models · Meta

Catalog entry last reviewed 94 days ago.

Llama 4 Scout Instruct

Model family: llama-4

Size

large (109.0B params)

Context

10,000,000 tokens

Released

2025-04-04

Openness

open-weight

License

Llama 4 Community License · commercial: conditional

Cost tier

mixed

Rating

4.0 ★ — Genuinely impressive architecture and a 10M context window that changes what's practical — but the EU restriction is a real catch for a significant chunk of businesses, and the model is still new enough that independent evaluations are catching up to Meta's benchmark claims.

Modalities

image-input, text

Capabilities

chat, instruction-following, long-context, multilingual, reasoning, tool-use, vision

Access

api-third-party, local-runtime-ollama, local-runtime-vllm, weights-download-direct, weights-download-hf

llm
open-weight
commercial-friendly
large
long-context
multilingual
us-based
tool-use
vision
multimodal
mixture-of-experts

Quick Take

Meta's first mixture-of-experts and first natively multimodal open-weight model, with a 10-million-token context window and an EU restriction business owners need to know about.

Plain-English Description

Llama 4 Scout launched alongside its larger sibling Llama 4 Maverick in April 2025, and it represents two architectural firsts for the Llama family. It's Meta's first model using mixture-of-experts (MoE) architecture, and it's the first Llama model natively trained to understand images as well as text. Both of those choices have real implications for how the model performs and what it costs to run.

Mixture-of-experts means the model has 109 billion total parameters divided across 16 "experts," but only about 17 billion of them activate for any given token. Think of it like a hospital: you have specialists for cardiology, orthopedics, neurology, and so on, but any one patient typically only needs a couple of them. The MoE architecture lets the model hold a lot more knowledge without requiring you to activate all of it at once — so the per-token compute cost is closer to a 17B model even though the capability is closer to a larger dense model. It also means the hardware math is different: you need enough memory to hold all the experts ready to activate (which is a lot), but the actual computation per response is faster than you'd expect from a model this size.

The other headline feature is the context window: 10 million tokens. That's around 7.5 million words — enough to fit multiple novels, an entire codebase, or years of email history in a single prompt. Most business uses won't stretch anywhere near that ceiling, but for document-heavy workflows — legal research, long-form content analysis, multi-document synthesis — this genuinely changes what's possible. And because it's natively multimodal, the model understands images as part of that context: you can put screenshots, diagrams, charts, and photos into the same conversation as text.

Best For

Document-heavy analysis workflows where context window is the bottleneck — contracts, legal briefs, research synthesis, codebase analysis
Multimodal use cases — image understanding, visual QA, diagram interpretation, screenshot analysis
Applications that need strong multilingual support — Llama 4 adds Arabic, Indonesian, Tagalog, Thai, and Vietnamese to the existing Llama language list
Self-hosted or managed-API deployments where you want frontier-class capability without OpenAI or Anthropic pricing
Teams experimenting with MoE architecture before it becomes the default for open-weight models (which it appears to be heading toward)

Not For

EU-based businesses or those serving EU customers without verifying current legal status — the Llama 4 Acceptable Use Policy restricts EU use, particularly of the multimodal features. This is a hard stop for many businesses and should be resolved before building.
Consumer-hardware deployment — this is an H100-class model at minimum
Simple chatbot use cases where Llama 3.1 8B or 3.3 70B would do the job at much lower cost
Businesses that need a stable, well-understood model — Scout is new enough that community tooling, fine-tuning recipes, and independent evaluation are still catching up
Organizations above 700M monthly active users without a separate Meta license

License — Plain-English Summary

Same 700M MAU threshold as the rest of the Llama family — which, again, affects almost no one in practice. The meaningful new restriction is the EU carve-out: Llama 4's Acceptable Use Policy restricts use in the European Union, particularly of multimodal capabilities, apparently in response to EU AI Act compliance concerns. If your business is EU-based or serves EU customers, do not build on Llama 4 without verifying current legal status directly with Meta or competent legal counsel. For US, UK, and most non-EU markets, the license is otherwise permissive commercial use with the standard Llama attribution requirements.

How It Compares

Llama 3.3 70B Instruct (see Llama 3.3 70B Instruct — Meta's previous-generation flagship; text-only, no EU restriction, slightly smaller context window, simpler architecture; the safer choice for EU-exposed businesses)
Llama 4 Maverick (Scout's larger sibling in the same family — 400B total parameters with 128 experts, positioned as the enterprise workhorse; same license terms, including the EU restriction)
GPT-4o (closed-source from OpenAI — comparable multimodal capability, API-only, no weight access, different license profile entirely; the closed-model alternative when open-weight economics don't matter)

Under the Hood

Llama 4 Scout uses a mixture-of-experts transformer architecture with 16 experts, 17B active parameters per token, and 109B total parameters. Multimodality is implemented through what Meta calls "early fusion" — text and image tokens are integrated into a unified model rather than using separate encoders and decoders bolted together. The 10M-token context window was achieved through an interleaved attention architecture without positional embeddings and inference-time temperature scaling. Scout was trained from scratch (unlike Maverick, which was co-distilled from the in-training Llama 4 Behemoth model). Training data exceeded 30 trillion tokens, roughly double Llama 3's training set. Native language support includes Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese. The model supports tool use, function calling, and JSON-mode structured output.

Cost

Self-hosted cost: $0.00 beyond compute
API input (per 1M tokens): $0.15
API output (per 1M tokens): $0.60
API providers: together, fireworks, openrouter, aws-bedrock
Notes: API pricing is surprisingly favorable given the model's scale — the MoE architecture means providers bill closer to 17B-model economics. Self-hosted is free beyond compute, but compute is significant (H100-class GPU at minimum for practical use).

Pricing data is 94 days old. Verify with the source before relying on it.

Hardware requirements

Min VRAM: 80 GB
Recommended VRAM: 80 GB
Runs on laptop: No
Notes: Meta positions Scout as "fits on a single H100" at Int4 quantization. That's still an ~$25K GPU or roughly $2-4/hour to rent. Not a consumer-hardware model. The MoE architecture means memory requirements are driven by the full expert count, not just active parameters.

Comparable models

Commercial-use conditions

Free for commercial use unless your product had more than 700 million monthly active users on the Llama 4 release date (April 5, 2025). ADDITIONAL restriction: the Llama 4 Acceptable Use Policy restricts use of Llama 4 models, and multimodal features specifically, within the European Union. EU-based businesses or those serving EU customers should verify current status with Meta before building on Llama 4's multimodal capabilities.