Verify critical details — pricing, licensing, availability — with the model's source before business decisions. Full methodology →

Models · DeepSeek

DeepSeek-V4-Flash

Model family: deepseek-v4

Size

frontier (284.0B params)

Context

1,048,576 tokens

Released

2026-04-23

Openness

open-weight

License

MIT License · commercial: yes

Cost tier

mixed

Rating

4.5 ★ — The accessibility sweet spot of the V4 family — most of the capability, a price that undercuts everyone, an MIT license, and small enough to actually self-host. Held off 5 only by preview status and the China-hosting question on the first-party API.

Modalities

text

Capabilities

chat, coding, function-calling, instruction-following, long-context, reasoning, tool-use

Access

api-first-party, api-third-party, local-runtime-llama-cpp, local-runtime-vllm, weights-download-hf

llm
open-weight
commercial-friendly
frontier
long-context
reasoning
coding
china-based
mixture-of-experts
self-hostable

Quick Take

The practical V4: most of Pro's smarts at a fraction of the cost, small enough that a mid-size team can actually self-host it, and MIT-licensed.

Plain-English Description

DeepSeek-V4-Flash is the cost-optimized sibling of DeepSeek-V4-Pro. It shares the same underlying architecture and the same one-million-token context window, but at 284 billion total parameters (with about 13 billion active per token) it is roughly a fifth the size of Pro. The result is a model that trades a little benchmark headroom for dramatically lower latency, dramatically lower cost, and — crucially — the ability to run it on hardware that doesn't require a data center.

For most teams evaluating DeepSeek, Flash is the model that actually matters. On the API it costs $0.14 per million input tokens and $0.28 per million output, which is roughly two orders of magnitude cheaper than the equivalent closed models, and prompt caching pushes the effective cost lower for repeated-context work. If you'd rather not use the hosted API at all, community quantized builds let you run Flash on a two-GPU workstation, keeping all your data in-house.

It's worth knowing that Flash is now the engine behind DeepSeek's legacy API names: the old deepseek-chat and deepseek-reasoner endpoints are compatibility aliases pointing at Flash's non-thinking and thinking modes, and those names retire in July 2026. If you have older DeepSeek integrations, they're already running on Flash whether you realized it or not.

Best For

Production workloads that need strong, cheap, high-volume text generation — chat, summarization, drafting, classification.
Agentic and coding tasks where Pro is overkill and the cost difference is what matters.
Privacy-sensitive teams that want to self-host a capable model without buying a GPU cluster.
Long-document processing that benefits from the full million-token window at a low price.
Anyone migrating off the legacy deepseek-chat/deepseek-reasoner endpoints before the July 2026 retirement.

Not For

Squeezing out the absolute best coding/reasoning scores — that's DeepSeek-V4-Pro's job.
Laptop or single-consumer-GPU deployment — even quantized, 284B parameters need workstation-class hardware.
Image, audio, or video tasks — Flash is text-only.
Teams that can't route data to China and aren't ready to self-host or use a Western third-party host.

License — Plain-English Summary

V4 Flash is MIT-licensed, the same permissive terms as the rest of the V4 family: commercial use, modification, fine-tuning, and redistribution are all allowed, with the only requirement being to keep the copyright notice. As with every DeepSeek model, the license isn't the constraint — data routing is. Because Flash is genuinely self-hostable on attainable hardware, it's the easiest model in DeepSeek's lineup to use while keeping data entirely within your own environment.

How It Compares

Against DeepSeek-V4-Pro, Flash is the value pick: a noticeable step down in peak capability for a large step down in cost and hardware demand. Against DeepSeek-V3.2, the prior-generation workhorse, Flash offers a much larger context window and the unified V4 architecture, and it's where DeepSeek's own API has migrated. Against Western cost-optimized models like the smaller GPT and Claude tiers, Flash is far cheaper and openly downloadable, with the familiar trade-off: you take on the data-governance question in exchange for the price and the open weights. Among open-weight peers, it competes with the mid-size Qwen and Llama releases, generally winning on cost-per-token while carrying the China-jurisdiction caveat those don't.

Cost

Self-hosted cost: $0.00 beyond compute
API input (per 1M tokens): $0.14
API output (per 1M tokens): $0.28
API providers: deepseek, openrouter, together, fireworks
Notes: First-party API cache-miss input is $0.14 and output $0.28 per million tokens; cache hits drop input to about $0.0028. The legacy deepseek-chat and deepseek-reasoner API names are compatibility aliases that now point to V4 Flash's non-thinking and thinking modes respectively, and retire on 2026-07-24. Self-hosting is free beyond compute.

Hardware requirements

Min VRAM: 48 GB
Recommended VRAM: 96 GB
Runs on laptop: No
Notes: Community GGUF builds run on a dual-48GB workstation, and aggressive quantization can fit it on a single 80GB data-center card. Not a laptop model, but far more attainable than V4 Pro.