Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
ElevenLabs, the highly-valued AI voice cloning and generation startup from former Palantir alumni, today launched Scribe v1, a new speech-to-text model that reportedly achieves the highest accuracy across multiple languages. Users can try it here on the ElevenLabs site.
According to the company’s benchmarks, it outperforms Google’s Gemini 2.0 Flash, OpenAI’s Whisper v3, and Deepgram Nova-3 on accurately converting spoken speech into text on the web, achieving new record-low error rates.
The company claims that Scribe delivers state-of-the-art transcription accuracy in 99 languages, including improved performance in previously underserved languages such as Serbian, Cantonese, and Malayalam.
As Flavio Schneider, ElevenLabs Lead Researcher wrote on X, Scribe is the “smartest audio understanding model” released by ElevenLabs yet.
“Scribe doesn’t just transcribe — it understands audio,” Schneider continued in a threaded reply. “It can detect non-verbal events (like laughter, sound effects, music, and background noise) and analyze long audio contexts for accurate diarization, even in the most challenging environments.”
“Diarization” is the name given to processes of separating speakers by their vocal qualities on a recording.
In fact, ElevenLabs’ documentation states Scribe can distinguish and isolate up to 32 different speakers in the same audio file.
While ElevenLabs cautions that Scribe is “best used for when high-accuracy transcription is required rather than real-time transcription,” the company also plans to introduce a low-latency version soon, expanding its use for real-time applications.
Lowest word error rates (WER)
Scribe is designed to handle real-world audio challenges with precision. According to benchmark results from FLEURS and Common Voice, it records the lowest word error rates (WER) for many languages, including Italian (98.7%) and English (96.7%).
Key features include:
- Speaker diarization to differentiate speakers in multi-speaker recordings
- Word-level timestamps for detailed transcription accuracy
- Detection of non-speech events, such as laughter and background noises
- Structured transcript output for seamless integration via API
Pricing and availability
Scribe is available now through the ElevenLabs website and API.
Pricing is set at $0.40 per hour of input audio, with a 50% discount for the next six weeks. A low-latency version for real-time applications is also in development.
What it means for enterprises
For enterprise decision-makers, Scribe presents a tool for scalable, high-accuracy transcription, making it useful for industries relying on automated documentation, meeting transcription, and content accessibility.
The model’s ability to handle diverse languages with high precision also benefits multinational businesses, media companies, and customer support applications.
Scribe’s pricing structure makes it competitive for businesses that require high-volume transcription services, and its API-based integration allows for seamless adoption in enterprise workflows.
Additionally, the upcoming low-latency version could position Scribe as a viable option for real-time communication tools.
Coming the same day as rival Hume’s opposite text-to-speech model Octave
Timing is everything, and ElevenLabs chose to launch Scribe the same day as rival Hume AI unveiled Octave, an LLM-powered text-to-speech model that allows users to customize AI-generated voices with adjustable emotions.
It is designed for content creation, including audiobooks, podcasts, and video game voiceovers. Unlike standard TTS systems, Octave considers context beyond individual sentences, adjusting tone, rhythm, and cadence dynamically to sound more natural.
Hume AI positions Octave as a direct competitor to ElevenLabs’ text-to-speech offerings, highlighting that Octave’s pricing is about half the cost of ElevenLabs’ current AI voice services.
While Scribe and Octave serve different functions, their development reflects the growing competition in AI-driven audio models.
ElevenLabs is prioritizing precise, multi-language speech recognition, while Hume AI is advancing expressive AI-generated speech.
For enterprises, this means more specialized solutions for both transcription and synthetic voice applications, enabling more efficient content production, customer engagement, and accessibility tools.
Scribe is now live, and ElevenLabs is hosting a virtual event next week with the team behind its development. More details, benchmarks, and API documentation are available in the official blog post.