Verify critical details — pricing, licensing, availability — with the model's source before business decisions. Full methodology →
DeepSeek-R1-0528-Qwen3-8B
distillation derivative of Qwen3-8B by DeepSeek
Distilled from the upgraded DeepSeek-R1-0528 by continuing post-training on the Qwen3-8B base using chain-of-thought generated by R1-0528 — transferring the newer R1's stronger reasoning into an 8B model. This is the second-wave R1 distill (May 2025), distinct from the January 2025 Qwen2.5- and Llama-based set.
- llm
- open-weight
- commercial-friendly
- small
- reasoning
- math
- on-device
- distillation
- china-based
- apache-2-0
Quick Take
A second-wave R1 distill: the upgraded R1-0528's reasoning compressed into an 8B Qwen3 model that posts benchmark scores rivaling models many times its size, under a clean MIT-over-Apache license.
Plain-English Description
When DeepSeek upgraded R1 to R1-0528 in May 2025, it released just one distillationA technique for training a smaller model (the "student") to imitate a larger model (the "teacher"). The result is a compact model that retains much of the larger model's capability at a fraction of the compute cost. Distilled models are common in production because they're cheaper to run than the full-size originals while performing nearly as well on most tasks. alongside it — this 8B model, built by teaching Qwen3-8B the chain-of-thought of the much larger upgraded R1. It's the successor to the January 2025 distill family, and notably DeepSeek dropped the Llama-based variants this time, distilling only onto Qwen3 (whose clean Apache 2.0 license is friendlier than Llama's terms).
On paper it's remarkable: it improves on standard Qwen3-8B by about 10 points on the AIME 2024 math exam and reportedly matches Qwen3's 235-billion-parameter "thinking" model on that benchmark — extraordinary for a model you can run on a laptop. That makes it one of the most capable small reasoning models on benchmarks.
The honest caveat is that real-world reception has been more mixed than the headline numbers. Like many reasoning distills it generates very long chains of thought, and some users find its general usefulness narrower than the benchmarks suggest. It's a reasoning specialist — strong on structured math and logic problems, less so as an all-purpose assistant.
Best For
- A small, self-hostable reasoning modelA model trained to "think through" problems step by step before answering, often by producing internal reasoning that's either shown or hidden from the user. Reasoning models trade speed for accuracy on hard problems — they're slower and more expensive per answer, but markedly better at math, logic, and complex analysis. OpenAI's o1 series and Mistral's Magistral are reasoning models. for math and logic where benchmark-grade chain-of-thought matters.
- On-deviceRunning a model directly on a consumer device — a laptop, a phone, a smart speaker — rather than in a data center. On-device inference keeps data private by never leaving the device, and works offline. Small models (under ~10B parameters, often quantized) can run on-device; larger models cannot yet. or laptop reasoning at no per-tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words. cost (runs in ~20GB, less when quantized).
- Research and experimentation with distilled reasoning at small scale.
- Fine-tuning a cleanly-licensed small reasoning model on your own data.
Not For
- General-purpose chat or writing — it's a narrow reasoning specialist that tends to over-explain.
- Workloads sensitive to long, verbose reasoning traces (and the tokens they consume).
- The strongest open reasoning overall — the larger distills and the full DeepSeek-R1 go further.
- MultimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. tasks — text only.
License — Plain-English Summary
Clean and permissive. DeepSeek released the distilled weightsThe numerical values inside a trained model that encode everything it has learned. A model is, functionally, a giant list of weights — tens of billions of numbers for a mid-sized model, hundreds of billions for a frontier model. "Open-weight" means those numbers are published. "Downloading the weights" means getting the actual file you'd need to run the model yourself. under MIT, and the Qwen3-8B base is Apache 2.0 — both layers allow commercial use, modification, fine-tuning, and redistribution with no royalties or user-count carve-outs; keep the respective notices. This is exactly the licensing advantage DeepSeek leaned into by distilling onto Qwen3 rather than Llama for this release: no community-license strings, just two permissive layers.
How It Compares
Against the January-wave DeepSeek-R1-Distill-Qwen-32B, this 8B model is far smaller and lighter, and on some math benchmarks punches surprisingly close — but the 32B is the more well-rounded performer. Against its base Qwen3-8B, it trades general versatility for a large jump in structured reasoning. Against the full DeepSeek-R1, it's the extreme-accessibility distill: a fraction of the size, runnable anywhere, carrying a slice of R1-0528's reasoning.
Cost
- Self-hosted cost
- $0.00 beyond compute
- Notes
- Free to self-host under Apache 2.0; widely available in GGUF/MLX quants for local runtimes. Also served by third-party hosts.
Comparable models
Commercial-use conditions
DeepSeek released the distilled weights under MIT; the base model (Qwen3-8B) is Apache 2.0. Both layers are permissive and allow commercial use, with no carve-outs — a contrast with the January 2025 Llama-based R1 distills.