Learning 3D Multi-Modal Visual Representations for Scene-Level Understanding

Event details
Date | 20.08.2024 |
Hour | 15:00 › 17:00 |
Speaker | Jason Toskov |
Location | |
Category | Conferences - Seminars |
EDIC candidacy exam
Exam president: Prof. Mathieu Salzmann
Thesis advisor: Prof. Amir Zamir
Co-examiner: Prof. Alexandre Alahi
Abstract
Humans have a much greater understanding of 3D space than computational systems, despite computational models having the capacity to operate directly on explicit models of 3D space, while humans only have access to two 2D projections of space. Improving the capabilities of computational models to learn scene-level representations by exploiting this advantage may unlock more human-like abilities for computational agents.
In this proposal we discuss current methods of learning using 3D models, how multi-modal and multi-task learning may improve our scene representations and potentially useful modalities in such learning. Finally, we propose a path forward to achieving more informed and useful scene-level representations through employing multi-modal and multi-task learning on a much larger scale.
Background papers
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, https://arxiv.org/abs/1612.00593
SUGAR: Pre-training 3D Visual Representations for Robotics, https://arxiv.org/abs/2404.01491
LERF: Language Embedded Radiance Fields, https://arxiv.org/abs/2303.09553
Exam president: Prof. Mathieu Salzmann
Thesis advisor: Prof. Amir Zamir
Co-examiner: Prof. Alexandre Alahi
Abstract
Humans have a much greater understanding of 3D space than computational systems, despite computational models having the capacity to operate directly on explicit models of 3D space, while humans only have access to two 2D projections of space. Improving the capabilities of computational models to learn scene-level representations by exploiting this advantage may unlock more human-like abilities for computational agents.
In this proposal we discuss current methods of learning using 3D models, how multi-modal and multi-task learning may improve our scene representations and potentially useful modalities in such learning. Finally, we propose a path forward to achieving more informed and useful scene-level representations through employing multi-modal and multi-task learning on a much larger scale.
Background papers
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, https://arxiv.org/abs/1612.00593
SUGAR: Pre-training 3D Visual Representations for Robotics, https://arxiv.org/abs/2404.01491
LERF: Language Embedded Radiance Fields, https://arxiv.org/abs/2303.09553
Practical information
- General public
- Free