Fast Masked Diffusion Models For Large-Scale Reasoning

Event details
Date | 18.06.2025 |
Hour | 09:00 › 11:00 |
Speaker | Justin Deschenaux |
Location | |
Category | Conferences - Seminars |
EDIC candidacy exam
Exam president: Prof. Volkan Cevher
Thesis advisor: Prof. Caglar Gulcehre
Co-examiner: Prof. Nicolas Flammarion
Abstract
Autoregressive (AR) large language models (LLMs) currently dominate generative sequence modeling, demonstrating remarkable success across natural language processing tasks, including complex domains like mathematics and coding.
On the other hand, the AR decomposition imposes certain constraints, for example on the neural network architecture, requiring causal masking in transformer decoders. This necessity can introduce artifacts, such as the "reverse curse" in reasoning tasks. Moreover, sequential token generation from LLMs is notably slow. While techniques like speculative decoding can alleviate this issue, they add complexity and are somewhat ad-hoc within the causal generative modeling paradigm.
Recently, alternative sequence modeling paradigms, particularly masked diffusion models (MDMs), have demonstrated performance that rivals autoregressive (AR) models. Hence, these emerging approaches have the potential to shape the future of sequence generative modeling. Notably, MDMs inherently provide a flexible balance between generation speed and quality. Given that the enhanced performance of MDMs is a recent advancement, their capabilities and scalability are not as thoroughly explored as those of AR models.
This thesis investigates the trade-offs of non-AR sequence models, specifically focusing on masked diffusion models. Our primary objective is to enhance the discrete diffusion framework to challenge the dominance of AR models in reasoning tasks. We focus on improving the decoding latency, optimizing neural network architectures, and developing novel decoding algorithms, all with an emphasis on reasoning capabilities.
Selected papers
coming soon
Exam president: Prof. Volkan Cevher
Thesis advisor: Prof. Caglar Gulcehre
Co-examiner: Prof. Nicolas Flammarion
Abstract
Autoregressive (AR) large language models (LLMs) currently dominate generative sequence modeling, demonstrating remarkable success across natural language processing tasks, including complex domains like mathematics and coding.
On the other hand, the AR decomposition imposes certain constraints, for example on the neural network architecture, requiring causal masking in transformer decoders. This necessity can introduce artifacts, such as the "reverse curse" in reasoning tasks. Moreover, sequential token generation from LLMs is notably slow. While techniques like speculative decoding can alleviate this issue, they add complexity and are somewhat ad-hoc within the causal generative modeling paradigm.
Recently, alternative sequence modeling paradigms, particularly masked diffusion models (MDMs), have demonstrated performance that rivals autoregressive (AR) models. Hence, these emerging approaches have the potential to shape the future of sequence generative modeling. Notably, MDMs inherently provide a flexible balance between generation speed and quality. Given that the enhanced performance of MDMs is a recent advancement, their capabilities and scalability are not as thoroughly explored as those of AR models.
This thesis investigates the trade-offs of non-AR sequence models, specifically focusing on masked diffusion models. Our primary objective is to enhance the discrete diffusion framework to challenge the dominance of AR models in reasoning tasks. We focus on improving the decoding latency, optimizing neural network architectures, and developing novel decoding algorithms, all with an emphasis on reasoning capabilities.
Selected papers
coming soon
Practical information
- General public
- Free
Contact
- edic@epfl.ch