AI Center - Research Seminar Series - Edouard Grave
Event details
Date | 16.12.2024 |
Hour | 14:00 › 15:00 |
Speaker | Edouard Grave, Nicolas Flammarion |
Location | Online |
Category | Conferences - Seminars |
Event Language | English |
The talk is preceded by a coffee session, at 13:15 in the adjacent space (BC 430).
For on-site logistics, please use the following form to register: registration form.
Hosting professor: Prof. Nicolas Flammarion
Title
Moshi: a foundation model for conversational speech
Abstract
In this talk, I will present Moshi, a joint speech-text foundation model and full-duplex spoken dialogue system. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning—such as emotion or non-speech sounds—is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections.
Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model, Moshi generates speech as tokens from the quantizer of a neural audio codec, and separately models its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We extend the hierarchical semantic-to-acoustic token generation of previous work, by predicting time-aligned text tokens as a prefix to audio tokens. Our resulting model is the first real-time full-duplex spoken large language model, with a latency of around 200 ms in practice.
Bio
Edouard Grave is a researcher and a member of the founding team at Kyutai, where he works on artificial intelligence, natural language processing and large language models (LLMs). Before joining Kyutai, he spent eight years in industry, first at Facebook AI Research and then at Apple MLR. Edouard also completed a postdoc at Columbia University, where he worked with Noémie Elhadad and Chris Wiggins, and at UC Berkeley, where he worked with Laurent El Ghaoui. He received his PhD in computer science from Université Paris VI and graduated from École Polytechnique with a M.Sc. in machine learning and computer vision.
For on-site logistics, please use the following form to register: registration form.
Hosting professor: Prof. Nicolas Flammarion
Title
Moshi: a foundation model for conversational speech
Abstract
In this talk, I will present Moshi, a joint speech-text foundation model and full-duplex spoken dialogue system. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning—such as emotion or non-speech sounds—is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections.
Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model, Moshi generates speech as tokens from the quantizer of a neural audio codec, and separately models its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We extend the hierarchical semantic-to-acoustic token generation of previous work, by predicting time-aligned text tokens as a prefix to audio tokens. Our resulting model is the first real-time full-duplex spoken large language model, with a latency of around 200 ms in practice.
Bio
Edouard Grave is a researcher and a member of the founding team at Kyutai, where he works on artificial intelligence, natural language processing and large language models (LLMs). Before joining Kyutai, he spent eight years in industry, first at Facebook AI Research and then at Apple MLR. Edouard also completed a postdoc at Columbia University, where he worked with Noémie Elhadad and Chris Wiggins, and at UC Berkeley, where he worked with Laurent El Ghaoui. He received his PhD in computer science from Université Paris VI and graduated from École Polytechnique with a M.Sc. in machine learning and computer vision.
Links
Practical information
- General public
- Free
- This event is internal