Verify critical details — pricing, licensing, availability — with the model's source before business decisions. Full methodology →

Voxtral Small 24B

Model family: voxtral

Size

mid (24.0B params)

Context

32,768 tokens

Released

2025-07-14

Openness

open-weight

License

Apache 2.0 · commercial: yes

Cost tier

mixed

Rating

4.0 ★ — Strong speech-understanding capability with a unified audio+language backbone that lets it answer questions directly from audio rather than piping transcription through a separate LLM. Apache 2.0 licensing is the clear win over closed-API speech services. Half-star haircut reflects that most commercial speech workloads are better served by the smaller, transcription- optimized Voxtral Mini variants — Voxtral Small is for teams who specifically need the language-understanding capability on audio.

Modalities

audio-input, text

Capabilities

multilingual, speech-to-text, summarization, translation

Access

api-first-party, local-runtime-vllm, weights-download-direct, weights-download-hf

audio
speech-to-text
speech-understanding
audio-qa
multilingual
open-weight
commercial-friendly
apache-licensed
eu-based

Quick Take

Mistral's speech-understanding flagship — a 24B audio-text model that transcribes, translates, and directly answers questions from audio input. Apache 2.0.

Plain-English Description

Voxtral Small 24B is the original Voxtral model, released July 2025 as Mistral's entry into the speech-model space. What distinguishes it from standard speech-to-text systems like Whisper is that Voxtral isn't just a transcription engine — it's an audio-text language model. You can pipe raw audio in and get text out, but the "text out" can be a transcription, a translation, a summary, an answer to a question about the audio content, or any combination. The model combines a Whisper-derived audio encoder with a language-model decoder based on Mistral Small 3.1, meaning it retains general text-model capabilities and can be used as a drop-in replacement for Mistral Small 3.1 if you want a text-only chat model.

The practical implication of that unified architecture is that Voxtral Small can skip the two-step pipeline most speech-to-language workflows require. Traditional architecture: transcribe audio with Whisper, pipe the transcript into a separate LLM for understanding. With Voxtral Small, you feed the audio directly and the model answers questions about it, summarizes it, or translates it without the intermediate transcript step. This is useful for applications where the audio contains nuance — tone, speaker characteristics, or overlapping speech — that a transcript would lose.

Voxtral Small is positioned as the production-scale option; its smaller sibling Voxtral Mini 3B is the edge-deployment variant. On Mistral's API, transcription queries are routed to a transcribe-optimized version called Voxtral Mini Transcribe (a separate model, cataloged as a listing). Voxtral Small is primarily interesting for teams self-hosting because they need the larger language-model backbone for downstream audio reasoning. Apache 2.0 licensing means no conditions on commercial deployment.

Best For

Speech-understanding workflows that go beyond transcription. Summarizing meetings, answering questions about audio content, extracting structured information from recordings. The unified audio-text backbone is purpose-built for this.
Multilingual speech translation. Voxtral achieves state-of-the-art on FLEURS-Translation benchmark. Accepts audio in one language and produces text output in another.
Teams who need open weights for audio understanding. Most speech-understanding pipelines today are closed (OpenAI Whisper is open-weight but limited to transcription; audio-capable closed models like GPT-4o Audio and Gemini aren't open-weight). Voxtral Small is a rare open-weight full-stack audio-understanding option.
Self-hosted deployments in regulated industries. Healthcare, legal, and financial services where audio data sensitivity rules out closed-API routing. Voxtral Small on an H100 in a private cloud is a clean deployment path.

Not For

Cost-optimized transcription-only workflows. For straightforward speech-to-text without deeper understanding, Voxtral Mini Transcribe V2 is smaller, faster, and cheaper through Mistral's API at $0.003/min. Using Voxtral Small for pure transcription is overkill.
Edge and on-device deployments. The 24B parameter count means this isn't laptop-deployable. For edge audio, Voxtral Mini 4B Realtime is purpose-built.
Real-time streaming transcription. Voxtral Small is a batch-mode model; it processes complete audio segments rather than streaming. For live transcription, Voxtral Mini 4B Realtime's streaming architecture is the right choice.
Text-to-speech workflows. Voxtral Small does speech-to-text only. For TTS, see Voxtral TTS (separate model, different license).

License — Plain-English Summary

Apache 2.0. Commercial use unrestricted, modifications allowed, redistribution allowed. Include the license file. No conditions. This is the standard Mistral open-weight posture — Voxtral Small predates Voxtral TTS and ships under clean Apache licensing, not the CC BY-NC 4.0 that the TTS model uses.

How It Compares

vs. OpenAI Whisper large-v3 — Voxtral Small comprehensively outperforms Whisper on Mistral's own benchmarks, beating it across short-form and long-form English as well as multilingual tasks. Voxtral also handles speech understanding (not just transcription), which Whisper doesn't do. Whisper is smaller and more ecosystem-integrated; Voxtral Small is more capable on downstream audio reasoning.
vs. Voxtral Mini Transcribe V2 — Mini Transcribe is transcription-optimized and smaller; Voxtral Small has the larger language backbone and can answer questions about audio. Use Mini Transcribe when you need fast, cheap transcription; use Voxtral Small when you need the model to reason over audio content.
vs. GPT-4o Audio / Gemini Audio — Similar tier of capability (audio-understanding multimodal models) but Voxtral Small is open-weight under Apache 2.0 while the alternatives are closed-API. For self-hosted or private deployments, Voxtral Small is the practical choice.
vs. ElevenLabs Scribe — Mistral reports Voxtral Small matches ElevenLabs Scribe's transcription performance for less than half the cost via API. Scribe has deeper voice-cloning integration; Voxtral has broader understanding capabilities.

Under the Hood

Voxtral Small's architecture combines a Whisper-derived audio encoder with a Ministral-derived decoder at 24B parameters total. The audio encoder retains Whisper's fundamental structure but the decoder is replaced with Mistral's own language model stack, which is what enables the understanding capabilities — the decoder inherits text-model reasoning skills that Whisper's original decoder didn't have. A consequence of the Whisper-derived encoder is a 30-second input-chunk requirement inherited from Whisper; audio shorter than 30 seconds must be padded to the full length.

Voxtral Small and Voxtral Mini (the 3B sibling) share the fundamental architecture and training approach. Both support function calling, structured output, and drop-in substitution for their corresponding text-only Mistral base models.

The model was updated December 20, 2025 with the release of the broader Mistral 3 generation; the current Hugging Face checkpoint is mistralai/Voxtral-Small-24B-2507 with December 2025 updates applied in-place to the same checkpoint.

Cost

Self-hosted cost: $0.00 beyond compute
Notes: Mistral's hosted API routes to the smaller Voxtral Mini Transcribe variants for cost efficiency; Voxtral Small 24B is primarily intended for self-hosted deployment when you need its larger language-model backbone for downstream audio reasoning. Pricing through the API follows standard Mistral audio-endpoint rates.

Pricing data is 88 days old. Verify with the source before relying on it.

Hardware requirements

Min VRAM: 48 GB
Recommended VRAM: 80 GB
Runs on laptop: No
Notes: BF16 precision needs ~48GB VRAM (single RTX 6000 Ada or 2×RTX 4090). Recommended deployment: single H100 or A100 80GB for comfortable throughput. Not laptop-feasible at useful quantization.