Multi-Task Scene Representations

Event details
Date | 24.08.2021 |
Hour | 13:00 › 15:00 |
Speaker | Roman Bachmann |
Category | Conferences - Seminars |
EDIC candidacy exam
exam president: Prof. Nicolas Boumal
thesis advisor: Prof. Amir Zamir
co-examiner: Prof. Mackenzie Mathis
Abstract
Current supervised and self-supervised representation learning literature focuses heavily on using large-scale classification datasets to train a network to produce image-level features that can be used for transfer learning. The following questions arise: Does training on classification tasks/datasets really produce the best representations for diverse downstream task learning, and why do we perform transfers from independent image-level features instead of scene-level representations that aggregate information over time and space? Indeed, there is evidence that no pre-training task is the best single choice for all other visual downstream tasks. We propose to learn scene-level representations by merging image-level representations of multiple diverse tasks over the spatial and temporal dimensions, with the goal of creating powerful visual priors for downstream learning. Using such multi-task priors should improve the coverage of the space of features that are useful for visual tasks. Furthermore, the use of scene representations can allow for global and out-of-sight reasoning.
Background papers
1) Big Transfer (BiT): General Visual Representation Learning. Kolesnikov et al. 2019. https://arxiv.org/abs/1912.11370
2) Neural scene representation and rendering. Eslami et al. 2018.: https://storage.googleapis.com/deepmind-media/papers/Neural_Scene_Representation_and_Rendering_preprint.pdf)
3) On the Theory of Transfer Learning: The Importance of Task Diversity. Tripuraneni et al. 2020. https://arxiv.org/abs/2006.11650
exam president: Prof. Nicolas Boumal
thesis advisor: Prof. Amir Zamir
co-examiner: Prof. Mackenzie Mathis
Abstract
Current supervised and self-supervised representation learning literature focuses heavily on using large-scale classification datasets to train a network to produce image-level features that can be used for transfer learning. The following questions arise: Does training on classification tasks/datasets really produce the best representations for diverse downstream task learning, and why do we perform transfers from independent image-level features instead of scene-level representations that aggregate information over time and space? Indeed, there is evidence that no pre-training task is the best single choice for all other visual downstream tasks. We propose to learn scene-level representations by merging image-level representations of multiple diverse tasks over the spatial and temporal dimensions, with the goal of creating powerful visual priors for downstream learning. Using such multi-task priors should improve the coverage of the space of features that are useful for visual tasks. Furthermore, the use of scene representations can allow for global and out-of-sight reasoning.
Background papers
1) Big Transfer (BiT): General Visual Representation Learning. Kolesnikov et al. 2019. https://arxiv.org/abs/1912.11370
2) Neural scene representation and rendering. Eslami et al. 2018.: https://storage.googleapis.com/deepmind-media/papers/Neural_Scene_Representation_and_Rendering_preprint.pdf)
3) On the Theory of Transfer Learning: The Importance of Task Diversity. Tripuraneni et al. 2020. https://arxiv.org/abs/2006.11650
Practical information
- General public
- Free
Contact
- edic@epfl.ch