Decoder-only Autoregressive Multimodal Modeling

Thumbnail

Event details

Date 04.08.2025
Hour 08:3010:30
Speaker Mingqiao Ye
Location
Category Conferences - Seminars
EDIC candidacy exam
Exam president: Prof. Antoine Bosselut
Thesis advisor: Prof. Amir Zamir
Co-examiner: Prof. Maria Brbic

Abstract
This direction explores extending the next-token prediction framework of LLM pretraining to sequences that interleave text, images, and other modalities. Decoder-only multimodal models unify understanding and generation tasks under a single causal transformer, offering benefits such as simple training pipelines, strong zero-shot performance, and efficient inference.
A few background papers I will be discussing include:
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation (HKU)
https://peizesun.github.io/llamagen/
A scalable decoder-only image generation model that challenges diffusion-based approaches.
Chameleon: Mixed-Modal Early-Fusion Foundation Models (Meta)
https://arxiv.org/pdf/2405.09818
A unified decoder-only model trained on trillions of tokens for both understanding and generation.
BAGEL: The Open-Source Unified Multimodal Model (Bytedance)
https://bagel-ai.org/
An open-source SOTA model that supports generation and editing with a unified architecture.

Selected papers
  • Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation (HKU)
    https://peizesun.github.io/llamagen/
    A scalable decoder-only image generation model that challenges diffusion-based approaches.
  • Chameleon: Mixed-Modal Early-Fusion Foundation Models (Meta)
    https://arxiv.org/pdf/2405.09818
    A unified decoder-only model trained on trillions of tokens for both understanding and generation.
  • BAGEL: The Open-Source Unified Multimodal Model (Bytedance)
    https://bagel-ai.org/
    An open-source SOTA model that supports generation and editing with a unified architecture.

Practical information

  • General public
  • Free

Contact

  • edic@epfl.ch

Tags

EDIC candidacy exam

Share