State of Voice AI and Video in Gaming, AI Agents Market and the Evolution of Software Engineering

Raise the stakes.

Bartek Pucek

Jan 19, 2025

∙ Paid

Good morning,

In today's edition, among other things:

State of Voice AI
The State of Video Gaming in 2025
Al Agents Market Landscape - Ecosystem
European Founders Compensation
The Evolution of Software Engineering
Is IT the new HR Department?
The EU regulations to look out for in 2025

Onwards!

State of Voice AI

In 2024, voice AI shattered expectations (ElevenLabs being a favorite example), stepping into mainstream industries and everyday life. Breakthroughs in Speech-to-Speech (S2S) systems, enterprise-scale projects, and tailored applications have transformed how businesses and individuals interact with voice and voice integrations in their products.

Couple of critical technological breakthroughs converged in 2024 to make “enterprise” voice AI viable:

1. Evolution of Model Architectures

Unified Conversational Pipelines: The standard conversational pipeline (“STT → LLM → TTS”) matured with new innovations.
- OpenAI’s Voice Mode integrates Speech-to-Text (STT) and Text-to-Speech (TTS) with Large Language Models (LLMs).
- Moshi’s Fully Duplex S2S Systems: Real-time models enable natural interruptions, streaming inputs, and simultaneous speech processing.

2. Enterprise-Grade Voice AI APIs

Speech-to-Text Enhancements:
- Deepgram’s Nova-2 achieved a 30% Word Error Rate (WER) reduction, critical for domain-specific accuracy.
- Far-field transcription and niche jargon remain challenging but solvable with customization tools.
Text-to-Speech Innovations:
- Improved prosody and naturalness in synthetic voices now rival human intonation.
- Advanced training on diverse data expanded TTS engines’ ability to handle nuanced expressions like acronyms or emotional speech.
LLM Integration:
- Faster reasoning models (e.g., GPT-4o, Llama 3.2) with input streaming for seamless real-time interaction.
- Costs dropped dramatically: GPT-4’s token pricing at $45/million tokens reduced to $2.75/million tokens for Llama 3.1.

3. Compact, On-Device Models

On-device models allow processing without internet connectivity, improving latency and privacy.
- Edge AI frameworks like TensorFlow Lite and specialized chips are enabling scalable local voice processing.

The progress on tech and product side led to fast adoption across verticals.

Via Cartesia:

Vertical voice agent startups saw explosive growth, with Y Combinator reflecting this trend as the number of voice-native companies grew by 70% between the Winter and Fall cohorts. Initial adoption focused on expanding capacity for previously understaffed services like 24/7 customer operations and seasonal volume surges.
Loan Servicing: Salient and Kastle’s agents help service loans, manage payoffs, and handle outreach for reactivating dormant accounts or cross-selling other financial products—all while maintaining high compliance standards for handling sensitive data like PII.
Insurance: Liberate and Skit’s agents handle 24/7 claims processing, policy renewals, and provide clear explanations of coverage options.
Healthcare: Abridge first brought transcription to healthcare in response to the high demand for medical scribes in 2019. Now, clinics worldwide are adopting AI assistants for scheduling appointments, providing medication reminders, and answering billing queries thanks to companies like Hello Patient, Hippocratic, Assort Health, and Superdial—all while safeguarding patient information.
Logistics: Freight brokers, third-party logistics providers (3PLs), and carriers utilize Happy Robot and Fleetworks to manage check calls, load updates, payment statuses, and appointment scheduling.
Hospitality: Use cases range from Host AI’s omnichannel AI assistant for hotels to Nowadays’ AI event planner. Elise AI’s AI assistant and CRM working hand in hand to handle everything from leasing inquiries to maintenance and renewals.
SMBs: Goodcall allows smaller franchises to easily set up AI agents to handle all their inbound calls seamlessly, as owners currently miss 60% of phone calls due to capacity constraints. Slang has purpose-built solutions for restaurants and Numa has integrated with car dealership CRMs to leverage past customer interaction data to drive retention. Avoca powers 24/7 AI call centers for HVAC, plumbing, and other field services.

And there’s more coming this year. What are the most important things that are happening now:

1. Mainstreaming of Speech-to-Speech Models

Latency Reduction: voice systems are narrowing latency to 160ms, closer to human conversational speeds (~230ms).
Context Preservation: Unified models capture nuances of tone, emotion, and prosody, enhancing interaction quality.
Challenges: Overlapping speech handling and model cost efficiency remain bottlenecks for enterprise adoption.

2. Deeper Integration into Workflows

Voice AI is increasingly trusted to handle multi-step tasks like:

End-to-end rebooking in travel (leveraging RAG for real-time policy lookups).
Automating onboarding processes with personalized, contextual guidance.

3. Greater Adoption of Fine-Grained Controls

SSML improvements allow voice output to align with visual expressions in multimodal settings.
Use cases in education (e.g., AI tutors) and elder care (e.g., always-on companions) are set to proliferate.

With the (obvious?) key challenges:

Ethical Use and Trust:
- Biased responses and security vulnerabilities pose risks for regulatory scrutiny.
Scalability:
- Supporting high concurrency with low latency in real-time systems is resource-intensive.
Data Complexity:
- Domain-specific transcription and synthesis still face limitations in capturing rare terminologies or regional dialects.

But the #1, I think, will be real-time hyper-customized voice interactions that are deepening user engagement.

European Founders Compensation

Keep reading with a 7-day free trial

Subscribe to Bartek Pucek to keep reading this post and get 7 days of free access to the full post archives.