Decoder-only Autoregressive Multimodal Modeling

Event details

Date	04.08.2025
Hour	08:30 › 10:30
Speaker	Mingqiao Ye
Location	BC 229
Category	Conferences - Seminars

EDIC candidacy exam
Exam president: Prof. Antoine Bosselut
Thesis advisor: Prof. Amir Zamir
Co-examiner: Prof. Maria Brbic

Abstract
This direction explores extending the next-token prediction framework of LLM pretraining to sequences that interleave text, images, and other modalities. Decoder-only multimodal models unify understanding and generation tasks under a single causal transformer, offering benefits such as simple training pipelines, strong zero-shot performance, and efficient inference.
A few background papers I will be discussing include:
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation (HKU)
https://peizesun.github.io/llamagen/
A scalable decoder-only image generation model that challenges diffusion-based approaches.
Chameleon: Mixed-Modal Early-Fusion Foundation Models (Meta)
https://arxiv.org/pdf/2405.09818
A unified decoder-only model trained on trillions of tokens for both understanding and generation.
BAGEL: The Open-Source Unified Multimodal Model (Bytedance)
https://bagel-ai.org/
An open-source SOTA model that supports generation and editing with a unified architecture.

Selected papers

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation (HKU)
https://peizesun.github.io/llamagen/
A scalable decoder-only image generation model that challenges diffusion-based approaches.
Chameleon: Mixed-Modal Early-Fusion Foundation Models (Meta)
https://arxiv.org/pdf/2405.09818
A unified decoder-only model trained on trillions of tokens for both understanding and generation.
BAGEL: The Open-Source Unified Multimodal Model (Bytedance)
https://bagel-ai.org/
An open-source SOTA model that supports generation and editing with a unified architecture.

Practical information

General public
Free

Contact

edic@epfl.ch

Export Event

Event broadcasted in