×
Sesame’s CTO reveals how they’re building real-time voice AI that talks like humans
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

Andreessen Horowitz’s latest episode of AI + a16z features Sesame’s CTO Ankit Kumar delving into the technical foundations of their voice technology with a16z partner Anjney Midha. This conversation offers a rare glimpse into the engineering complexities behind real-time conversational AI, exploring how voice interfaces might fundamentally change human-computer interaction as the technology continues to evolve from research labs into everyday applications.

The big picture: Sesame’s voice technology represents a significant advancement in AI-powered conversational interfaces, with the company taking the unusual step of open-sourcing key components of their underlying models.

  • Kumar and Midha explore the technical challenges involved in creating voice AI that can maintain natural conversation flow while balancing personality expression with computational efficiency.
  • The discussion highlights how multimodal AI systems must integrate speech recognition, natural language processing, and speech synthesis in real-time to create convincing voice interactions.

Key technical challenges: Developing real-time voice AI requires overcoming several complex engineering hurdles that balance performance with computational constraints.

  • Full-duplex conversation modeling, which allows the AI to both listen and speak simultaneously like humans do, represents a particularly difficult problem that Sesame has addressed in their technology.
  • The team has implemented specific computational optimizations to achieve the low-latency interactions necessary for natural-feeling conversations without requiring excessive processing power.

Why open-sourcing matters: Sesame’s decision to release key components of their model architecture reflects a strategic approach to advancing voice AI technology within the broader ecosystem.

  • Open-sourcing creates opportunities for community contributions while potentially accelerating adoption of their underlying technical approach.
  • The move suggests Sesame believes their competitive advantage lies in implementation and product experience rather than solely in proprietary model architecture.

In plain English: Sesame is building AI that can talk with people naturally in real-time, and they’re sharing some of their technical blueprints with the broader developer community rather than keeping everything proprietary.

Technical deep dives: The conversation explores advanced concepts in speech AI that explain how modern voice interfaces are evolving beyond simple command-response patterns.

  • Kumar breaks down how multimodal AI systems must integrate different types of intelligence – processing audio input, understanding language context, and generating natural-sounding speech – all while maintaining conversation flow.
  • The discussion addresses scaling laws in speech synthesis, examining how larger models affect voice quality and expressiveness compared to more optimized smaller models.

Where voice interfaces are heading: The conversation positions natural language as potentially the most intuitive user interface, capable of redefining how humans interact with technology.

  • Voice AI’s evolution toward more contextual understanding and human-like conversational abilities could make technology more accessible to people regardless of technical literacy.
  • The discussion suggests voice interfaces may eventually become the primary way people interact with digital systems, supplementing or replacing screen-based interfaces in many contexts.
Building the Next Generation of Conversational AI

Recent News

Musk-backed DOGE project targets federal workforce with AI automation

DOGE recruitment effort targets 300 standardized roles affecting 70,000 federal employees, sparking debate over AI readiness for government work.

AI tools are changing workflows more than they are cutting jobs

Counterintuitively, the Danish study found that ChatGPT and similar AI tools created new job tasks for workers and saved only about three hours of labor monthly.

Disney abandons Slack after hacker steals terabytes of confidential data using fake AI tool

A Disney employee fell victim to malware disguised as an AI art tool, enabling the hacker to steal 1.1 terabytes of confidential data and forcing the company to abandon Slack entirely.