Decoder-only Autoregressive Multimodal Modeling

Event details
Date | 04.08.2025 |
Hour | 08:30 › 10:30 |
Speaker | Mingqiao Ye |
Location | |
Category | Conferences - Seminars |
EDIC candidacy exam
Exam president: Prof. Antoine Bosselut
Thesis advisor: Prof. Amir Zamir
Co-examiner: Prof. Maria Brbic
Abstract
This direction explores extending the next-token prediction framework of LLM pretraining to sequences that interleave text, images, and other modalities. Decoder-only multimodal models unify understanding and generation tasks under a single causal transformer, offering benefits such as simple training pipelines, strong zero-shot performance, and efficient inference.
A few background papers I will be discussing include:
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation (HKU)
https://peizesun.github.io/llamagen/
A scalable decoder-only image generation model that challenges diffusion-based approaches.
Chameleon: Mixed-Modal Early-Fusion Foundation Models (Meta)
https://arxiv.org/pdf/2405.09818
A unified decoder-only model trained on trillions of tokens for both understanding and generation.
BAGEL: The Open-Source Unified Multimodal Model (Bytedance)
https://bagel-ai.org/
An open-source SOTA model that supports generation and editing with a unified architecture.
Selected papers
Exam president: Prof. Antoine Bosselut
Thesis advisor: Prof. Amir Zamir
Co-examiner: Prof. Maria Brbic
Abstract
This direction explores extending the next-token prediction framework of LLM pretraining to sequences that interleave text, images, and other modalities. Decoder-only multimodal models unify understanding and generation tasks under a single causal transformer, offering benefits such as simple training pipelines, strong zero-shot performance, and efficient inference.
A few background papers I will be discussing include:
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation (HKU)
https://peizesun.github.io/llamagen/
A scalable decoder-only image generation model that challenges diffusion-based approaches.
Chameleon: Mixed-Modal Early-Fusion Foundation Models (Meta)
https://arxiv.org/pdf/2405.09818
A unified decoder-only model trained on trillions of tokens for both understanding and generation.
BAGEL: The Open-Source Unified Multimodal Model (Bytedance)
https://bagel-ai.org/
An open-source SOTA model that supports generation and editing with a unified architecture.
Selected papers
- Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation (HKU)
https://peizesun.github.io/llamagen/
A scalable decoder-only image generation model that challenges diffusion-based approaches. - Chameleon: Mixed-Modal Early-Fusion Foundation Models (Meta)
https://arxiv.org/pdf/2405.09818
A unified decoder-only model trained on trillions of tokens for both understanding and generation. - BAGEL: The Open-Source Unified Multimodal Model (Bytedance)
https://bagel-ai.org/
An open-source SOTA model that supports generation and editing with a unified architecture.
Practical information
- General public
- Free
Contact
- edic@epfl.ch