The world of voice AI, with Mati Staniszewski of ElevenLabs
- 01Voice AI Is 2-3 Years Behind Text AI, But the Gap Is Closing Fast
- 02The Cascaded Architecture vs. Speech-to-Speech Tradeoff Will Define the Next Wave of Voice Products
- 03Voice Interaction Fundamentally Changes How Humans Communicate With Businesses
1. Key Themes
Voice AI Is 2-3 Years Behind Text AI, But the Gap Is Closing Fast
The lived experience of consumers using voice is roughly a decade behind the leading edge of what's technically possible, but this gap is narrowing rapidly. The key bottleneck hasn't been the models themselves but the orchestration layer — turn-taking, context persistence, and real-time tool calls.
"The quality of voice models, for them to actually sound good, like this is only like last three years thing... Two years ago, you can start seeing the real-time version of that. And not really, like it's, I think the real break was like a year ago, where you can start seeing that in production." 00:14:37
"I would agree that that's just getting, getting, getting there. And we'll hopefully see that we... Our goal is to like pass the voice Turing test in all those cases or the Turing test for all conversational agents outside of voice too. And I hope we will all be there in the next year or so." 00:20:41
The Cascaded Architecture vs. Speech-to-Speech Tradeoff Will Define the Next Wave of Voice Products
ElevenLabs is betting heavily on the cascaded approach (speech-to-text → LLM → text-to-speech) over end-to-end speech-to-speech models, primarily because enterprise customers need visibility, reliability, and integration capability. Speech-to-speech wins only in latency-sensitive companion-type apps where hallucinations are tolerable or even desired.
"In a setup where you have a cascaded approach, you probably will not see like dramatic size changes. You inherently want the models to be quick and reliable... In a fused approach, probably that will get into like tens, hundreds, billion-parameter models, because you kind of combine, of course, the LLM side and the voice side." 00:39:52
"Maybe hallucinations aren't as important, but the latency is a little bit more. And maybe hallucinations are even a feature." 00:29:32
Voice Interaction Fundamentally Changes How Humans Communicate With Businesses
Voice doesn't just replicate text interactions in audio form — it changes the quality and depth of information exchanged. People share more, convert better, and engage more openly when interacting via voice agents compared to forms or chatbots.
"People were actually much more keen to leave the forms through speaking with the agent... But second, they would be a lot more open-ended in terms of what the use case are. So they would start giving us information about the wider set of use cases, the complexity of the use case... people were just more at ease and could, like, trust the system while doing that." 00:30:51
2. Contrarian Perspectives
The Newest Model Should Be Given Away at Cost, Not Priced at a Premium
Counterintuitively, ElevenLabs prices its newest, most expensive-to-run models at attractive economics rather than extracting margin from early adopters. The logic is that reliability is lower early on, and broad distribution generates the feedback loops needed to improve the model faster.
"When we have a new model, we try to give it at cost to a lot of the customers so they can experience the best... We try to keep the prices still competitive to that. Exactly. And over time, we might do some tricks to optimize it. But like we want the customers to like experience. Because of research, the big thing that we've seen is the reliability of the model in the early days might not be there." 00:38:19
Hallucinations Are a Feature, Not Just a Bug, in Certain Voice Applications
In companion or entertainment voice applications, the unpredictability that constitutes a "hallucination" in enterprise settings may actually be desirable — creating a more organic, engaging experience.
"Maybe hallucinations aren't as important, but the latency is a little bit more. And maybe hallucinations are even a feature." 00:29:32
Voice AI Won't Pass the Turing Test Universally — It Will Pass It Domain by Domain
Rather than a single breakthrough moment, the voice Turing test will be passed incrementally in specific domains (customer support first, gaming much later), which means there is a long runway of domain-specific opportunities before the technology becomes generalized.
"I feel like it's going to work in like specific domains. Like in customer support call — passes the voice Turing test. Works well... An interactive gaming experience, like a truly interactive as you would have with another human in that game... we haven't passed it yet there." 00:19:47
Britishness as an Emergent Property: Hard-Coding Accent Parameters Is the Wrong Approach
Rather than encoding accent and emotion as discrete, labeled parameters (the Bell Labs approach), the correct architectural move is to let the model discover these characteristics itself — meaning the richness of human speech is not an engineering specification problem but a learning problem.
"It's not going to be British, Polish, Spanish, English speaker. But the model will deduce them themselves... So it's encoding and decoding of how you create the voice. Super hard problem before and figured out too." 00:04:21
"You're saying kind of Britishness is an emergent property in your voice models. Exactly." 00:04:21
Every AI Product Needs Pay-As-You-Go — Subscription-Only Models Are Self-Defeating
Limiting AI products to subscription tiers without pay-as-you-go overages actively destroys value for both the customer and the company. The correct model is subscription as a base with unlimited metered overage.
"It is kind of very funny as a consumer to not have the option to pay more, to use the product more... I think every AI product will need... some all you can eat, most of what you can eat is subscription with limits. And then the ability to pay for overages." 00:51:26
3. Companies Identified
ElevenLabs AI audio and voice model company, founded 2022, valued at $11 billion. Why mentioned: The subject of the podcast; highlighted for its rapid ARR growth ($350M end of 2025, $100M net new ARR in one quarter), enterprise client roster, and technical leadership in voice models.
"Most recently announced was 350 at the end of 2025... This quarter was kind of one of the best for enterprise growth where we had the first quarter hit 100 million in an additional ARR growth, which is crazy." 00:44:16
Deutsche Telekom / T-Mobile Major global telecommunications company. Why mentioned: Named as an enterprise customer of ElevenLabs; their engagement expanded from marketing (Magenta campaign, podcast generation) to customer support to full network-wide voice agent deployment.
"Our work with Deutsche Telekom started with marketing side. So we did Magenta work and podcast generation. Then it kind of expanded to customer support. And then it expanded to us working in agent across the entirety of the network so people can call in and have the agent." 00:46:53
Revolut, Klarna, Meta, IBM Major global fintech, e-commerce, and technology enterprises. Why mentioned: Named as ElevenLabs enterprise clients, validating the platform's enterprise-grade reliability and scale.
"Recently we were announced our work with Deutsche Telekom and T-Mobile, with Revolut, with Klarna, with Meta, with IBM, a wide set of use cases." 00:44:45
Intercom (Finn) Customer support AI company. Why mentioned: Used as a reference point for the evolution of conversational agents from support tools to generic UI layers for entire products.
"We had Dev Trainer from Intercom on here and they have Finn, their agent... he described a very similar phenomenon that you described, which is you start maybe thinking, oh, this will help me answer customer support queries. But it becomes like a generic UI for the website." 00:42:30
Stripe / Metronome Payments infrastructure and usage-based billing platform. Why mentioned: ElevenLabs uses Stripe; Mati (ElevenLabs' finance lead) reportedly recommended Stripe acquire Metronome the day before the acquisition was announced. ElevenLabs is launching pay-as-you-go billing using the Metronome infrastructure.
"He said like, you guys should buy Metronome. And then the next day, Metronome acquisition was announced. So now you have it. So that was my most common feedback. And we'll be launching... We'll be launching user-based billing to everyone." 00:50:27
Dia (Ukraine Government App) Ukraine's centralized citizen services digital platform. Why mentioned: Cited as one of the most technically advanced government digital transformation efforts in the world; partnered with ElevenLabs to add voice capability. Each ministry has embedded technical resources building agentic versions of their services — a model ElevenLabs found validating for their own org design.
"They actually have the same in every of the ministries. So every ministry had technical resources working on creating that agentic version of their work... I thought was brilliant." 00:57:53
Neuralink Brain-computer interface company. Why mentioned: Referenced in context of ElevenLabs restoring voice to a patient — a Neuralink patient used ElevenLabs technology to speak with their own voice again.
"Just recently, there was an example of a patient that had Neuralink. We worked with them to bring the voice that that person could speak with their own voice back to the family around." 00:33:40
4. People Identified
Piotr (ElevenLabs Co-founder) Co-founder and research lead at ElevenLabs. Why mentioned: Credited with the core architectural innovations that made ElevenLabs' voice models superior — specifically abstracting the mel-spectrogram encoding/decoding steps and introducing contextual prediction into voice generation.
"Here, credit to my co-founder, Piotr, who effectively came with that new idea of how you can now create voice models which are both reliable, high-quality, quick, where you would bring a lot of the ideas from transformer models, from diffusion models, into the speech space." 00:01:28 "Probably the most proud thing that Piotr and I are is as we scale the 11 labs, the people that are at 11 labs, it's been like just the culture and seeing the expansion of the culture." 00:59:28
Maciek (ElevenLabs Finance Lead) Finance lead at ElevenLabs. Why mentioned: Reportedly recommended that Stripe acquire Metronome the day before the acquisition was publicly announced — a notable display of strategic foresight.
"One of our finance lead, Maciek... He said like, you guys should buy Metronome. And then the next day, Metronome acquisition was announced." 00:50:27
5. Operating Insights
Embed a Technical Resource in Every Non-Technical Team
ElevenLabs has a "tech lead" embedded in operations, talent, and other non-engineering functions. This person both automates work and uplevels the rest of the team — Ukraine's government independently arrived at the same model across every ministry, which Mati found validating.
"We will have a person in ops or in talent that will, we have effectively a tech lead for that team. That helps them automate a lot of that work and helps up level the rest of the team too." 00:54:38
Use Voice Agents at the Top of the Funnel to Capture Richer Lead Data
Replacing or supplementing forms with voice agents at the point of lead capture results in higher completion rates and significantly richer qualitative data from prospects. This is an immediately actionable go-to-market tactic.
"We would go through the form a lot easier. But second, they would be a lot more open-ended in terms of what the use case are. So they would start giving us information about the wider set of use cases, the complexity of the use case... the writing out was tedious and tricky." 00:30:51
Use a Culture Agent to Pre-Screen and Prep Candidates
ElevenLabs built an internal voice agent that lets candidates explore company culture and prepare for interviews — reducing friction in recruiting while reinforcing culture before a candidate ever speaks to a human.
"We wanted for people to explore the culture at 11 Labs. So we created a voice agent that people can speak with and see what's the culture, but also get prepped for the interviews." 00:57:11
Land at Attractive Economics, Then Expand Across Departments
ElevenLabs deliberately prices entry into accounts at attractive terms because the expansion motion — both usage growth within a department and cross-department expansion — is where the value compounds.
"We try to make it very easy for our customers. Maybe that kind of against ourselves where we give the technology a pretty attractive economics because we so much believe in the technology providing value... And then of course, cross-department pollination is there, too." 00:46:24
6. Overlooked Insights
ElevenLabs' Speech-to-Text Model Was Built Accidentally for Internal Data Labeling — And Became a Product
This is a deeply non-obvious insight about how frontier AI companies build moats. ElevenLabs needed better annotation tools for training data, found the market offerings insufficient, built their own speech-to-text model, and then realized it was good enough to productize. Their product portfolio is partly a byproduct of the internal tooling they built to create better training data — a compounding research flywheel that competitors who outsource labeling cannot replicate.
"Speech-to-text model initially was a model we did for ourselves because the models on the market just weren't good to annotate that data. And then another brilliant researcher on our team was kind of being able to construct it so we could span it out as a model that we brought to the customers." 00:07:16
This implies that ElevenLabs has accumulated proprietary, richly annotated voice datasets — with emotional context, speaker identity, and acoustic characteristics — that are structurally impossible for new entrants to replicate quickly. The data moat is the real moat.
Person-Specific Transcription Is Shipping Within Weeks — And Healthcare Is the Killer App
Mati briefly mentions that speaker-specific fine-tuned transcription (feed in an hour of someone's voice, get dramatically better transcription of that person) is "hopefully in the next month." This is underplayed. In healthcare, this capability is mission-critical — surgeons issuing OR commands, doctors dictating notes — and represents a high-value, defensible vertical where accuracy is a regulatory and safety requirement, not just a quality preference.
"Like we think we can roll it out in one of the next versions, which is like hopefully in the next month... In healthcare setup, such an important part. You're in an operating room, you're a doctor, you want to say a command, then you want to really be able to listen to that one person-specific piece." 00:24:19
The combination of person-specific transcription + custom medical vocabulary (keyword detection) + compliance infrastructure positions ElevenLabs to move aggressively into healthcare voice infrastructure — a massive market that has historically been severely underserved by generic voice models.