Emergent Capabilities in Modern Sequence Models: Phase Transitions, Memory in Shallow Transformers, and Bidirectional State-Space Architectures.

Event details
Date | 17.06.2025 |
Hour | 14:00 › 16:00 |
Speaker | Fabrizio Boncoraglio |
Location | |
Category | Conferences - Seminars |
EDIC candidacy exam
Exam president: Prof. Michael Gastpar
Thesis advisor: Prof. Lenka Zdeborova
Co-examiner: Prof. Matthieu Wyart
Abstract
Recent theory has begun to expose the statistical principles that underlie modern sequence models. Neural sequence modeling has recently progressed along three complementary axes : (i) high-dimensional analyses can provide a solvable statistical-physics framework for attention models.The work done by Cui et. al, reveals abrupt semanticâpositional phase transitions in low-rank attention, offering closed-form generalization predictions as data scale varies;
(ii) associative-memory studies prove that shallow Transformers can store O(parameters) facts. Nichani et al. work reveals that shallow transformers achieve near-optimal factual storage by allocating capacity between self-attention and MLP blocks and
(iii) unified matrix-mixer perspectives that frame most mixers, including attention. Hwang et al. introduced Hydra, a bidirectional, quasiseparable SSM (State Space Model) which matches Transformer accuracy under certain conditions with linear inference cost. The objective of this write-up is to knit those axes into a single statistical framework. Building on these insights and on our own work in Boncoraglio et al., we propose an integrated framework that (a) explains when and why models transition from positional heuristics to semantic abstraction, (b) leverages dual associative memories to guarantee linear-in-parameters factual recall, (c) extends quasiseparable mixers to support data-dependent parameterization with provable memory efficiency and (d) integrates these results and highlights further perspectives.
The resulting roadmap connects phase-transition theory, memory capacity and structured mixers, charting a principled path toward scalable, memory-rich sequence models.
Selected papers
- A Phase Transition between Positional and Semantic Learning in a Solvable Model of Dot-Product Attention: https://arxiv.org/pdf/2402.03902
- Hydra: Bidirectional State Space Models Through Generalized Matrix Mixers: https://arxiv.org/pdf/2407.09941
- Understanding Factual Recall in Transformers via Associative Memories: https://arxiv.org/pdf/2412.06538
Exam president: Prof. Michael Gastpar
Thesis advisor: Prof. Lenka Zdeborova
Co-examiner: Prof. Matthieu Wyart
Abstract
Recent theory has begun to expose the statistical principles that underlie modern sequence models. Neural sequence modeling has recently progressed along three complementary axes : (i) high-dimensional analyses can provide a solvable statistical-physics framework for attention models.The work done by Cui et. al, reveals abrupt semanticâpositional phase transitions in low-rank attention, offering closed-form generalization predictions as data scale varies;
(ii) associative-memory studies prove that shallow Transformers can store O(parameters) facts. Nichani et al. work reveals that shallow transformers achieve near-optimal factual storage by allocating capacity between self-attention and MLP blocks and
(iii) unified matrix-mixer perspectives that frame most mixers, including attention. Hwang et al. introduced Hydra, a bidirectional, quasiseparable SSM (State Space Model) which matches Transformer accuracy under certain conditions with linear inference cost. The objective of this write-up is to knit those axes into a single statistical framework. Building on these insights and on our own work in Boncoraglio et al., we propose an integrated framework that (a) explains when and why models transition from positional heuristics to semantic abstraction, (b) leverages dual associative memories to guarantee linear-in-parameters factual recall, (c) extends quasiseparable mixers to support data-dependent parameterization with provable memory efficiency and (d) integrates these results and highlights further perspectives.
The resulting roadmap connects phase-transition theory, memory capacity and structured mixers, charting a principled path toward scalable, memory-rich sequence models.
Selected papers
- A Phase Transition between Positional and Semantic Learning in a Solvable Model of Dot-Product Attention: https://arxiv.org/pdf/2402.03902
- Hydra: Bidirectional State Space Models Through Generalized Matrix Mixers: https://arxiv.org/pdf/2407.09941
- Understanding Factual Recall in Transformers via Associative Memories: https://arxiv.org/pdf/2412.06538
Practical information
- General public
- Free