← Back to hard AIs

Verify critical details — pricing, licensing, availability — with the model's source before business decisions. Full methodology →

Models

Meta

4.5 ★ — Massive ecosystem impact, genuinely permissive license for the vast majority of businesses, and a steady release cadence — but the 700M MAU clause and the EU restrictions on multimodal models are real catches worth knowing.

Type
big-tech-lab
Country
US
Founded
2004
License posture
predominantly-open-weight
Website

Quick Take

Meta is the biggest force in open-weightA model where the trained weights are freely downloadable — you can run it yourself without contacting the creator. Llama, Mistral, Qwen, and Gemma are open-weight. Open-weight does not mean open-source: the training data and code often stay private. The license still governs what you can do with the weights, including whether you can use them commercially. AI — their Llama family is the most-used foundation for fine-tunes and derivatives on the planet, with a license that works for most businesses.

Who They Are

Meta is the social-media giant behind Facebook, Instagram, and WhatsApp — and, less obviously to the general public, one of the two or three most influential AI research organizations in the world. Their AI work runs through two connected groups: the Fundamental AI Research lab (FAIR), which dates back to 2013 and publishes academic research, and the product-focused Meta AI group that ships the Llama models and the Meta AI assistant embedded across their apps.

When it comes to AI models specifically, Meta has done something that almost none of their peers have: they release the actual trained weightsThe numerical values inside a trained model that encode everything it has learned. A model is, functionally, a giant list of weights — tens of billions of numbers for a mid-sized model, hundreds of billions for a frontier model. "Open-weight" means those numbers are published. "Downloading the weights" means getting the actual file you'd need to run the model yourself. of their foundation models to the public, for free, under a license that allows most commercial use. That single decision has shaped the entire open-sourceA stricter standard than open-weight: the weights, the training code, and the training data are all released publicly. Very few large language models meet the full open-source bar — most "open" models in the AI world are actually open-weight. When in doubt, check the license file and the creator's documentation. AI landscape. The majority of open-weightA model where the trained weights are freely downloadable — you can run it yourself without contacting the creator. Llama, Mistral, Qwen, and Gemma are open-weight. Open-weight does not mean open-source: the training data and code often stay private. The license still governs what you can do with the weights, including whether you can use them commercially. fine-tunes, specialized models, and locally-runnable chatbots you'll encounter anywhere — on Hugging Face, in Ollama, in LM Studio — are descended from Meta's Llama family. When a research lab, a startup, or an individual developer wants to build their own AI without paying per-tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words. fees to OpenAI or Anthropic, Llama is almost always where they start.

Meta's AI strategy is different from their peers in a way worth understanding. Google has Gemini (closed API) and Gemma (open-weight) as two tracks. OpenAI and Anthropic are primarily closed API shops. Meta has bet almost everything on open-weight — they ship the closed-model experience through their consumer products (Meta AI in WhatsApp, Instagram, etc.) but the models themselves go out to the public.

Model Philosophy

Meta's own framing is that opening up the weightsThe numerical values inside a trained model that encode everything it has learned. A model is, functionally, a giant list of weights — tens of billions of numbers for a mid-sized model, hundreds of billions for a frontier model. "Open-weight" means those numbers are published. "Downloading the weights" means getting the actual file you'd need to run the model yourself. accelerates AI progress broadly and builds an ecosystem Meta benefits from even when they're not the one charging for inferenceRunning a model to get outputs — as opposed to training it. When you send a prompt to ChatGPT, that's inference. Inference is much cheaper than training per operation but adds up quickly at scale. Pricing pages almost always refer to inference costs (per million tokens, per request, etc.), not training costs.. The more developers, researchers, and businesses standardize on Llama as their foundation, the more Meta's technical decisions become industry defaults. That's a sound long-term bet and it's why Llama keeps shipping at a steady cadence even when the economics of giving away billion-dollar training runs look strange from the outside.

What you'll find in practice: Llama models are released quickly after announcement, with weights on Hugging Face and Meta's own download site. Licenses are permissive but not unconditional (see below). Each major version introduces real improvements rather than tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words. refreshes — 3.1 added the 128K context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run. and tool use, 3.2 added vision and small on-deviceRunning a model directly on a consumer device — a laptop, a phone, a smart speaker — rather than in a data center. On-device inference keeps data private by never leaving the device, and works offline. Small models (under ~10B parameters, often quantized) can run on-device; larger models cannot yet. models, 3.3 brought 405B-class performance into a 70B model, and Llama 4 brought mixture-of-experts architecture and native multimodality. Behemoth, the largest Llama 4 model, remains in training as of this writing.

What To Know Before You Commit

Three things matter if you're thinking about building on Llama for a business.

The 700-million-user clause. The Llama Community License lets you use Llama commercially for free — unless, on the day the specific Llama version released, your product had more than 700 million monthly active users. In that case, you need a separate license from Meta. For virtually every small business, agency, startup, and mid-sized enterprise, this clause is irrelevant. For Apple, Google, Amazon, ByteDance, Tencent, and a handful of others, it matters. If you might be close to that threshold, get a lawyer. For everyone else, treat it as permissive commercial use.

The "trained on Llama outputs" clause. You cannot use Llama or its outputs to train another foundation model that isn't a Llama derivative. Meaning: you can fine-tuneA model that has been further trained on additional data to specialize it for a particular task, domain, or style. Fine-tuning a general model on medical literature produces a medical specialist; fine-tuning on your company's support tickets produces a support assistant that sounds like your team. Fine-tunes are much cheaper to create than training a model from scratch. Llama, you can distill from Llama, you can generate synthetic data with Llama to train other Llama models, but you cannot use Llama to train a from-scratch competitor. Again, this affects essentially nobody outside of other AI labs.

The EU multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. restriction (Llama 4 specifically). Llama 4's Acceptable Use Policy restricts use of its multimodal capabilities in the European Union due to EU AI Act compliance concerns. Text-only use appears unaffected. If your business is EU-based or serves EU customers and you need the multimodal features, verify current status with Meta before building on Llama 4 — this one is still evolving.

Beyond those three, Llama is among the most business-friendly open-weightA model where the trained weights are freely downloadable — you can run it yourself without contacting the creator. Llama, Mistral, Qwen, and Gemma are open-weight. Open-weight does not mean open-source: the training data and code often stay private. The license still governs what you can do with the weights, including whether you can use them commercially. licenses in existence. You can modify the models, redistribute them, fine-tune them for specific domains, and build commercial products on top of them, provided you include the license file and display "Built with Llama" in your product.

Original Models

Safeguards

Meta's multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. safety classifier for LLM applications. Screens prompts and responses (text and images) against the MLCommons hazards taxonomy. Replaces Llama Guard 3 8B and Llama Guard 3 11B-Vision with a single model.

Llama Prompt Guard

Meta's smallest prompt-injection detector — 22M parameters, sub-millisecond CPU inferenceRunning a model to get outputs — as opposed to training it. When you send a prompt to ChatGPT, that's inference. Inference is much cheaper than training per operation but adds up quickly at scale. Pricing pages almost always refer to inference costs (per million tokens, per request, etc.), not training costs., English-only. Targeted at high-throughput pipelines where the 86M variant's latency or cost is prohibitive.

Meta's prompt-injection detector — 86M parameter multilingual classifier built on mDeBERTa (not Llama architecture). Labels incoming prompts as benign or malicious to catch jailbreak attempts before they reach your primary LLM. Runs on CPU.

Llama 4

Base pretrained variant of Meta's Llama 4 Maverick. Frontier-class MoEA model architecture that splits the model into many smaller specialized "expert" networks, only activating a handful per input rather than running the whole model every time. The practical effect: you get the knowledge capacity of a big model with the compute cost of a much smaller one. Mistral Large 3 and Mistral Small 4 are both MoE models. with 17B active parametersIn a Mixture of Experts model, the number of parameters that actually run for any given input, as opposed to the total parameter count that's stored. Mistral Large 3, for example, has 675B total parameters but only 41B active per query — meaning it runs at roughly the cost of a 41B dense model while drawing on 675B worth of knowledge. out of 400B total, 128 experts, natively multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default., 1M context — designed for fine-tuning at scale rather than direct chat use.

Instruction-tuned variant of Meta's Llama 4 Maverick — Meta's largest open-weightA model where the trained weights are freely downloadable — you can run it yourself without contacting the creator. Llama, Mistral, Qwen, and Gemma are open-weight. Open-weight does not mean open-source: the training data and code often stay private. The license still governs what you can do with the weights, including whether you can use them commercially. release. 17B active / 400B total MoEA model architecture that splits the model into many smaller specialized "expert" networks, only activating a handful per input rather than running the whole model every time. The practical effect: you get the knowledge capacity of a big model with the compute cost of a much smaller one. Mistral Large 3 and Mistral Small 4 are both MoE models. with 128 experts, natively multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default., 1M context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run.. Scores 80.5 on MMLUA broad knowledge test covering 57 subjects from law and medicine to mathematics and history. Scores are reported as percentage correct. A score around 85% is strong for a frontier model; above 90% is state-of-the-art. MMLU is probably the most-cited benchmark in AI model comparisons, though it has known weaknesses — models can memorize the questions, and the test reflects a specific cultural and academic context. Pro.

Base pretrained variant of Meta's Llama 4 Scout. Mixture-of-Experts architecture with 17B active parametersIn a Mixture of Experts model, the number of parameters that actually run for any given input, as opposed to the total parameter count that's stored. Mistral Large 3, for example, has 675B total parameters but only 41B active per query — meaning it runs at roughly the cost of a 41B dense model while drawing on 675B worth of knowledge. out of 109B total, natively multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default., designed for fine-tuning rather than direct chat use.

Meta's first mixture-of-experts and first natively multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. open-weightA model where the trained weights are freely downloadable — you can run it yourself without contacting the creator. Llama, Mistral, Qwen, and Gemma are open-weight. Open-weight does not mean open-source: the training data and code often stay private. The license still governs what you can do with the weights, including whether you can use them commercially. model, with a 10-million-tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words. context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run. and an EU restriction business owners need to know about.

Llama 3 3

Meta's efficiency milestone — a 70-billion-parameter model that Meta claims matches their much larger 405B model on most tasks, at a fraction of the cost to run.

Llama 3 2 Vision

Meta's 11B Llama 3.2 Vision base modelA model straight out of pretraining, before any fine-tuning for chat or specific tasks. Base models predict the next token but don't follow instructions well — they'll continue your prompt rather than respond to it. Most people never use base models directly; they use the instruct-tuned or chat versions built on top. Useful mostly for researchers and people doing their own fine-tuning.multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. (text + image input) for research and fine-tuning. EU-domiciled entities are excluded from multimodal license rights. Most business use cases want the Instruct variant.

Meta's 11B Llama 3.2 Vision chat modelShorthand for an instruct-tuned model specifically designed for back-and-forth conversation rather than single-shot tasks. Chat models remember earlier turns in the conversation (within the context window) and respond in a conversational register. GPT-4, Claude, and most Llama Instruct variants are chat models. In practice, "chat model" and "instruct-tuned model" often mean the same thing.multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default., handles text and image input for visual reasoning, document analysis, and chart reading. EU-domiciled entities are excluded from multimodal license rights.

Meta's 90B Llama 3.2 Vision base modelA model straight out of pretraining, before any fine-tuning for chat or specific tasks. Base models predict the next token but don't follow instructions well — they'll continue your prompt rather than respond to it. Most people never use base models directly; they use the instruct-tuned or chat versions built on top. Useful mostly for researchers and people doing their own fine-tuning.multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. flagship of the 3.2 generation, for research and fine-tuning. EU-domiciled entities are excluded from multimodal license rights. Most business use cases want the Instruct variant.

Meta's 90B Llama 3.2 Vision chat modelShorthand for an instruct-tuned model specifically designed for back-and-forth conversation rather than single-shot tasks. Chat models remember earlier turns in the conversation (within the context window) and respond in a conversational register. GPT-4, Claude, and most Llama Instruct variants are chat models. In practice, "chat model" and "instruct-tuned model" often mean the same thing.multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. flagship of the 3.2 generation, strong at complex visual reasoning and document analysis. EU-domiciled entities are excluded from multimodal license rights.

Llama 3 2

Base pretrained variant of Meta's smallest Llama 3.2 model. Text-only, 1.23B parameters, designed for on-deviceRunning a model directly on a consumer device — a laptop, a phone, a smart speaker — rather than in a data center. On-device inference keeps data private by never leaving the device, and works offline. Small models (under ~10B parameters, often quantized) can run on-device; larger models cannot yet. and edge deployment. For fine-tuning rather than direct chat use.

Meta's smallest instruction-tuned Llama 3.2 model. 1.23B parameters, text-only, designed for on-deviceRunning a model directly on a consumer device — a laptop, a phone, a smart speaker — rather than in a data center. On-device inference keeps data private by never leaving the device, and works offline. Small models (under ~10B parameters, often quantized) can run on-device; larger models cannot yet. chat, summarization, and prompt rewriting on phones or edge hardware. Commercial use allowed below the 700M MAU threshold.

Base pretrained variant of Meta's Llama 3.2 3B. Text-only, 3.21B parameters, designed for edge deployment and as a starting point for domain-specific fine-tuning.

Meta's Llama 3.2 3B Instruct — small instruction-tuned text model designed for mobile and edge AI assistants, with summarization, tool use, and multilingual dialogue capabilities. Commercial use allowed below the 700M MAU threshold.

Llama 3 1

Meta's Llama 3.1 405B base modelA model straight out of pretraining, before any fine-tuning for chat or specific tasks. Base models predict the next token but don't follow instructions well — they'll continue your prompt rather than respond to it. Most people never use base models directly; they use the instruct-tuned or chat versions built on top. Useful mostly for researchers and people doing their own fine-tuning. — the largest open-weightA model where the trained weights are freely downloadable — you can run it yourself without contacting the creator. Llama, Mistral, Qwen, and Gemma are open-weight. Open-weight does not mean open-source: the training data and code often stay private. The license still governs what you can do with the weights, including whether you can use them commercially. dense modelA model where every parameter is used for every input — the entire model runs on every token. Contrast with sparse or Mixture of Experts models, which activate only a fraction of the model per input. Dense models are simpler and more predictable; MoE models are more efficient at scale. ever released at the time. Base variant for research and large-scale fine-tuning; production deployments use the Instruct variant via hosted APIs.

Meta's Llama 3.1 405B chat modelShorthand for an instruct-tuned model specifically designed for back-and-forth conversation rather than single-shot tasks. Chat models remember earlier turns in the conversation (within the context window) and respond in a conversational register. GPT-4, Claude, and most Llama Instruct variants are chat models. In practice, "chat model" and "instruct-tuned model" often mean the same thing. — frontier-scale dense modelA model where every parameter is used for every input — the entire model runs on every token. Contrast with sparse or Mixture of Experts models, which activate only a fraction of the model per input. Dense models are simpler and more predictable; MoE models are more efficient at scale. with 128K context and strong multilingual reasoning. Practical access is through hosted APIAccessing a model by sending requests to the creator's (or a provider's) servers, typically pay-per-use. Hosted APIs handle all the operational work — scaling, hardware, uptime — in exchange for a per-token or per-request fee. Every closed-API model is hosted; many open-weight models are also available via hosted APIs from providers like Together, Fireworks, or Groq. providers; self-hosting requires a multi-GPUThe specialized chip that runs most AI models. Originally designed for 3D graphics, GPUs turned out to be excellent at the math AI requires. Nvidia dominates the AI GPU market; common datacenter models include the H100, H200, and B200. Running an AI model without a GPU is possible but painfully slow for anything but the smallest models. server cluster.

Llama 3.1 70B — the 70B Llama 3.1 base, widely used as a fine-tuning foundation (e.g. for Hermes).

Meta's Llama 3.1 8B base modelA model straight out of pretraining, before any fine-tuning for chat or specific tasks. Base models predict the next token but don't follow instructions well — they'll continue your prompt rather than respond to it. Most people never use base models directly; they use the instruct-tuned or chat versions built on top. Useful mostly for researchers and people doing their own fine-tuning. — 8B parameters, 128K context, multilingual. Base variant for fine-tuning; the Instruct variant has a full catalog entry.

Meta's small-but-serious open-weightA model where the trained weights are freely downloadable — you can run it yourself without contacting the creator. Llama, Mistral, Qwen, and Gemma are open-weight. Open-weight does not mean open-source: the training data and code often stay private. The license still governs what you can do with the weights, including whether you can use them commercially. model — fast, multilingual, and runs on a decent laptop with quantizationCompressing a model by reducing the numerical precision of its stored weights — for example, from 16-bit numbers to 4-bit numbers. The compressed model uses roughly a quarter of the memory and runs faster on most hardware, at the cost of slight accuracy loss. Quantization is what makes big models runnable on laptops — a 70B model in 4-bit quantization can fit on hardware that couldn't load the full-precision version., with a commercial license that works for almost any business.

Sources