← Back to hard AIs

Verify critical details — pricing, licensing, availability — with the model's source before business decisions. Full methodology →

Models · Mistral AI

Voxtral Small 24B

Model family: voxtral

Size
mid (24.0B params)
Context
32,768 tokens
Released
2025-07-14
Openness
open-weight
License
Apache 2.0 · commercial: yes
Cost tier
mixed
Rating
4.0 — Strong speech-understanding capability with a unified audio+language backbone that lets it answer questions directly from audio rather than piping transcription through a separate LLM. Apache 2.0 licensing is the clear win over closed-API speech services. Half-star haircut reflects that most commercial speech workloads are better served by the smaller, transcription- optimized Voxtral Mini variants — Voxtral Small is for teams who specifically need the language-understanding capability on audio.
Modalities
audio-input, text
Capabilities
multilingual, speech-to-text, summarization, translation
Access
api-first-party, local-runtime-vllm, weights-download-direct, weights-download-hf

Quick Take

Mistral's speech-understanding flagship — a 24B audio-text model that transcribes, translates, and directly answers questions from audio input. Apache 2.0.

Plain-English Description

Voxtral Small 24B is the original Voxtral model, released July 2025 as Mistral's entry into the speech-model space. What distinguishes it from standard speech-to-text systems like Whisper is that Voxtral isn't just a transcription engine — it's an audio-text language model. You can pipe raw audio in and get text out, but the "text out" can be a transcription, a translation, a summary, an answer to a question about the audio content, or any combination. The model combines a Whisper-derived audio encoderThe part of a model that reads input and converts it into an internal numerical representation the model can work with. In a translation model, the encoder reads the English sentence; the decoder produces the French. Modern chat models like GPT and Llama are "decoder-only" — they skip the separate encoder step. with a language-model decoderThe part of a model that generates output, one token at a time, from an internal representation. Chat models are almost all decoder-only architectures — they take your prompt, process it, and stream out a response token by token. "Decoder-only" is the technical name for the family most people just call "chatbots." based on Mistral Small 3.1, meaning it retains general text-model capabilities and can be used as a drop-in replacement for Mistral Small 3.1 if you want a text-only chat modelShorthand for an instruct-tuned model specifically designed for back-and-forth conversation rather than single-shot tasks. Chat models remember earlier turns in the conversation (within the context window) and respond in a conversational register. GPT-4, Claude, and most Llama Instruct variants are chat models. In practice, "chat model" and "instruct-tuned model" often mean the same thing..

The practical implication of that unified architecture is that Voxtral Small can skip the two-step pipeline most speech-to-language workflows require. Traditional architecture: transcribe audio with Whisper, pipe the transcript into a separate LLM for understanding. With Voxtral Small, you feed the audio directly and the model answers questions about it, summarizes it, or translates it without the intermediate transcript step. This is useful for applications where the audio contains nuance — tone, speaker characteristics, or overlapping speech — that a transcript would lose.

Voxtral Small is positioned as the production-scale option; its smaller sibling Voxtral Mini 3B is the edge-deployment variant. On Mistral's API, transcription queries are routed to a transcribe-optimized version called Voxtral Mini Transcribe (a separate model, cataloged as a listing). Voxtral Small is primarily interesting for teams self-hosting because they need the larger language-model backbone for downstream audio reasoning. Apache 2.0 licensing means no conditions on commercial deployment.

Best For

  • Speech-understanding workflows that go beyond transcription. Summarizing meetings, answering questions about audio content, extracting structured information from recordings. The unified audio-text backbone is purpose-built for this.
  • Multilingual speech translation. Voxtral achieves state-of-the-art on FLEURS-Translation benchmark. Accepts audio in one language and produces text output in another.
  • Teams who need open weightsThe numerical values inside a trained model that encode everything it has learned. A model is, functionally, a giant list of weights — tens of billions of numbers for a mid-sized model, hundreds of billions for a frontier model. "Open-weight" means those numbers are published. "Downloading the weights" means getting the actual file you'd need to run the model yourself. for audio understanding. Most speech-understanding pipelines today are closed (OpenAI Whisper is open-weightA model where the trained weights are freely downloadable — you can run it yourself without contacting the creator. Llama, Mistral, Qwen, and Gemma are open-weight. Open-weight does not mean open-source: the training data and code often stay private. The license still governs what you can do with the weights, including whether you can use them commercially. but limited to transcription; audio-capable closed models like GPT-4o Audio and Gemini aren't open-weight). Voxtral Small is a rare open-weight full-stack audio-understanding option.
  • Self-hostedRunning a model on hardware you control — your own servers, your own cloud instance, or your own laptop — rather than paying to access it through someone else's API. Self-hosting gives you full control over data and predictable costs, but requires the hardware and operational effort to run the model. Only possible with open-weight models. deployments in regulated industries. Healthcare, legal, and financial services where audio data sensitivity rules out closed-APIA model that's only accessible through the creator's own API or product — you can't download it, run it yourself, or inspect its weights. GPT-4, Claude, and Gemini Pro are closed-API models. The tradeoff is convenience and often capability (closed-API models are frequently the strongest) versus loss of control over data, pricing, and availability. routing. Voxtral Small on an H100 in a private cloud is a clean deployment path.

Not For

  • Cost-optimized transcription-only workflows. For straightforward speech-to-text without deeper understanding, Voxtral Mini Transcribe V2 is smaller, faster, and cheaper through Mistral's API at $0.003/min. Using Voxtral Small for pure transcription is overkill.
  • Edge and on-deviceRunning a model directly on a consumer device — a laptop, a phone, a smart speaker — rather than in a data center. On-device inference keeps data private by never leaving the device, and works offline. Small models (under ~10B parameters, often quantized) can run on-device; larger models cannot yet. deployments. The 24B parameter countA rough measure of a model's size. More parameters generally mean more capability but also more compute and memory to run. Small models are under 10 billion parameters; frontier models can exceed 500 billion. Often written as "7B" (7 billion) or "70B" (70 billion). For MoE models, look for both total parameters and active parameters — they measure different things. means this isn't laptop-deployable. For edge audio, Voxtral Mini 4B Realtime is purpose-built.
  • Real-time streaming transcription. Voxtral Small is a batch-mode model; it processes complete audio segments rather than streaming. For live transcription, Voxtral Mini 4B Realtime's streaming architecture is the right choice.
  • Text-to-speech workflows. Voxtral Small does speech-to-text only. For TTS, see Voxtral TTS (separate model, different license).

License — Plain-English Summary

Apache 2.0. Commercial use unrestricted, modifications allowed, redistribution allowed. Include the license file. No conditions. This is the standard Mistral open-weightA model where the trained weights are freely downloadable — you can run it yourself without contacting the creator. Llama, Mistral, Qwen, and Gemma are open-weight. Open-weight does not mean open-source: the training data and code often stay private. The license still governs what you can do with the weights, including whether you can use them commercially. posture — Voxtral Small predates Voxtral TTS and ships under clean Apache licensing, not the CC BY-NC 4.0 that the TTS model uses.

How It Compares

  • vs. OpenAI Whisper large-v3 — Voxtral Small comprehensively outperforms Whisper on Mistral's own benchmarks, beating it across short-form and long-form English as well as multilingual tasks. Voxtral also handles speech understanding (not just transcription), which Whisper doesn't do. Whisper is smaller and more ecosystem-integrated; Voxtral Small is more capable on downstream audio reasoning.
  • vs. Voxtral Mini Transcribe V2 — Mini Transcribe is transcription-optimized and smaller; Voxtral Small has the larger language backbone and can answer questions about audio. Use Mini Transcribe when you need fast, cheap transcription; use Voxtral Small when you need the model to reason over audio content.
  • vs. GPT-4o Audio / Gemini Audio — Similar tier of capability (audio-understanding multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. models) but Voxtral Small is open-weightA model where the trained weights are freely downloadable — you can run it yourself without contacting the creator. Llama, Mistral, Qwen, and Gemma are open-weight. Open-weight does not mean open-source: the training data and code often stay private. The license still governs what you can do with the weights, including whether you can use them commercially. under Apache 2.0 while the alternatives are closed-APIA model that's only accessible through the creator's own API or product — you can't download it, run it yourself, or inspect its weights. GPT-4, Claude, and Gemini Pro are closed-API models. The tradeoff is convenience and often capability (closed-API models are frequently the strongest) versus loss of control over data, pricing, and availability.. For self-hostedRunning a model on hardware you control — your own servers, your own cloud instance, or your own laptop — rather than paying to access it through someone else's API. Self-hosting gives you full control over data and predictable costs, but requires the hardware and operational effort to run the model. Only possible with open-weight models. or private deployments, Voxtral Small is the practical choice.
  • vs. ElevenLabs Scribe — Mistral reports Voxtral Small matches ElevenLabs Scribe's transcription performance for less than half the cost via API. Scribe has deeper voice-cloning integration; Voxtral has broader understanding capabilities.

Under the Hood

Voxtral Small's architecture combines a Whisper-derived audio encoderThe part of a model that reads input and converts it into an internal numerical representation the model can work with. In a translation model, the encoder reads the English sentence; the decoder produces the French. Modern chat models like GPT and Llama are "decoder-only" — they skip the separate encoder step. with a Ministral-derived decoderThe part of a model that generates output, one token at a time, from an internal representation. Chat models are almost all decoder-only architectures — they take your prompt, process it, and stream out a response token by token. "Decoder-only" is the technical name for the family most people just call "chatbots." at 24B parameters total. The audio encoder retains Whisper's fundamental structure but the decoder is replaced with Mistral's own language model stack, which is what enables the understanding capabilities — the decoder inherits text-model reasoning skills that Whisper's original decoder didn't have. A consequence of the Whisper-derived encoder is a 30-second input-chunk requirement inherited from Whisper; audio shorter than 30 seconds must be padded to the full length.

Voxtral Small and Voxtral Mini (the 3B sibling) share the fundamental architecture and training approach. Both support function calling, structured output, and drop-in substitution for their corresponding text-only Mistral base models.

The model was updated December 20, 2025 with the release of the broader Mistral 3 generation; the current Hugging Face checkpointA specific saved version of a model at a particular point in training. When a creator releases "Llama 3.1 8B Instruct," they're releasing a checkpoint — a frozen snapshot of the model as it existed at the end of training. Most models ship only a single public checkpoint; some creators release multiple (base, instruct, reasoning variants of the same underlying model). is mistralai/Voxtral-Small-24B-2507 with December 2025 updates applied in-place to the same checkpoint.

Cost

Self-hosted cost
$0.00 beyond compute
Notes
Mistral's hosted API routes to the smaller Voxtral Mini Transcribe variants for cost efficiency; Voxtral Small 24B is primarily intended for self-hosted deployment when you need its larger language-model backbone for downstream audio reasoning. Pricing through the API follows standard Mistral audio-endpoint rates.

Hardware requirements

Min VRAM
48 GB
Recommended VRAM
80 GB
Runs on laptop
No
Notes
BF16 precision needs ~48GB VRAM (single RTX 6000 Ada or 2×RTX 4090). Recommended deployment: single H100 or A100 80GB for comfortable throughput. Not laptop-feasible at useful quantization.

Comparable models

Sources