Verify critical details — pricing, licensing, availability — with the model's source before business decisions. Full methodology →
DeepSeek-V4-Flash
Model family: deepseek-v4
- llm
- open-weight
- commercial-friendly
- frontier
- long-context
- reasoning
- coding
- china-based
- mixture-of-experts
- self-hostable
Quick Take
The practical V4: most of Pro's smarts at a fraction of the cost, small enough that a mid-size team can actually self-host it, and MIT-licensed.
Plain-English Description
DeepSeek-V4-Flash is the cost-optimized sibling of DeepSeek-V4-Pro. It shares the same underlying architecture and the same one-million-tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words. context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run., but at 284 billion total parameters (with about 13 billion active per token) it is roughly a fifth the size of Pro. The result is a model that trades a little benchmark headroom for dramatically lower latency, dramatically lower cost, and — crucially — the ability to run it on hardware that doesn't require a data center.
For most teams evaluating DeepSeek, Flash is the model that actually matters. On the API it costs $0.14 per million input tokens and $0.28 per million output, which is roughly two orders of magnitude cheaper than the equivalent closed models, and prompt caching pushes the effective cost lower for repeated-context work. If you'd rather not use the hosted APIAccessing a model by sending requests to the creator's (or a provider's) servers, typically pay-per-use. Hosted APIs handle all the operational work — scaling, hardware, uptime — in exchange for a per-token or per-request fee. Every closed-API model is hosted; many open-weight models are also available via hosted APIs from providers like Together, Fireworks, or Groq. at all, community quantized builds let you run Flash on a two-GPUThe specialized chip that runs most AI models. Originally designed for 3D graphics, GPUs turned out to be excellent at the math AI requires. Nvidia dominates the AI GPU market; common datacenter models include the H100, H200, and B200. Running an AI model without a GPU is possible but painfully slow for anything but the smallest models. workstation, keeping all your data in-house.
It's worth knowing that Flash is now the engine behind DeepSeek's legacy API names: the old deepseek-chat and deepseek-reasoner endpoints are compatibility aliases pointing at Flash's non-thinking and thinking modes, and those names retire in July 2026. If you have older DeepSeek integrations, they're already running on Flash whether you realized it or not.
Best For
- Production workloads that need strong, cheap, high-volume text generation — chat, summarization, drafting, classification.
- Agentic and coding tasks where Pro is overkill and the cost difference is what matters.
- Privacy-sensitive teams that want to self-host a capable model without buying a GPUThe specialized chip that runs most AI models. Originally designed for 3D graphics, GPUs turned out to be excellent at the math AI requires. Nvidia dominates the AI GPU market; common datacenter models include the H100, H200, and B200. Running an AI model without a GPU is possible but painfully slow for anything but the smallest models. cluster.
- Long-document processing that benefits from the full million-tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words. window at a low price.
- Anyone migrating off the legacy
deepseek-chat/deepseek-reasonerendpoints before the July 2026 retirement.
Not For
- Squeezing out the absolute best coding/reasoning scores — that's DeepSeek-V4-Pro's job.
- Laptop or single-consumer-GPUThe specialized chip that runs most AI models. Originally designed for 3D graphics, GPUs turned out to be excellent at the math AI requires. Nvidia dominates the AI GPU market; common datacenter models include the H100, H200, and B200. Running an AI model without a GPU is possible but painfully slow for anything but the smallest models. deployment — even quantized, 284B parameters need workstation-class hardware.
- Image, audio, or video tasks — Flash is text-only.
- Teams that can't route data to China and aren't ready to self-host or use a Western third-party host.
License — Plain-English Summary
V4 Flash is MIT-licensed, the same permissive terms as the rest of the V4 family: commercial use, modification, fine-tuning, and redistribution are all allowed, with the only requirement being to keep the copyright notice. As with every DeepSeek model, the license isn't the constraint — data routing is. Because Flash is genuinely self-hostable on attainable hardware, it's the easiest model in DeepSeek's lineup to use while keeping data entirely within your own environment.
How It Compares
Against DeepSeek-V4-Pro, Flash is the value pick: a noticeable step down in peak capability for a large step down in cost and hardware demand. Against DeepSeek-V3.2, the prior-generation workhorse, Flash offers a much larger context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run. and the unified V4 architecture, and it's where DeepSeek's own API has migrated. Against Western cost-optimized models like the smaller GPT and Claude tiers, Flash is far cheaper and openly downloadable, with the familiar trade-off: you take on the data-governance question in exchange for the price and the open weightsThe numerical values inside a trained model that encode everything it has learned. A model is, functionally, a giant list of weights — tens of billions of numbers for a mid-sized model, hundreds of billions for a frontier model. "Open-weight" means those numbers are published. "Downloading the weights" means getting the actual file you'd need to run the model yourself.. Among open-weightA model where the trained weights are freely downloadable — you can run it yourself without contacting the creator. Llama, Mistral, Qwen, and Gemma are open-weight. Open-weight does not mean open-source: the training data and code often stay private. The license still governs what you can do with the weights, including whether you can use them commercially. peers, it competes with the mid-size Qwen and Llama releases, generally winning on cost-per-tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words. while carrying the China-jurisdiction caveat those don't.
Cost
- Self-hosted cost
- $0.00 beyond compute
- API input (per 1M tokens)
- $0.14
- API output (per 1M tokens)
- $0.28
- API providers
- deepseek, openrouter, together, fireworks
- Notes
- First-party API cache-miss input is $0.14 and output $0.28 per million tokens; cache hits drop input to about $0.0028. The legacy deepseek-chat and deepseek-reasoner API names are compatibility aliases that now point to V4 Flash's non-thinking and thinking modes respectively, and retire on 2026-07-24. Self-hosting is free beyond compute.
Hardware requirements
- Min VRAM
- 48 GB
- Recommended VRAM
- 96 GB
- Runs on laptop
- No
- Notes
- Community GGUF builds run on a dual-48GB workstation, and aggressive quantization can fit it on a single 80GB data-center card. Not a laptop model, but far more attainable than V4 Pro.