RCP Workshop: Advanced Deep Learning with PyTorch and NVIDIA Ecosystem
Event details
| Date | 29.09.2025 › 30.09.2025 |
| Hour | 13:00 › 17:00 |
| Category | Internal trainings |
| Event Language | English |
This event is organized by the Research Computing Platform (RCP), which provides campus-wide IT infrastructure for the research community.
This two half-day workshop is designed for practitioners and researchers eager to deepen their understanding of advanced deep learning concepts, tools, and best practices using PyTorch, NVIDIA NeMo, CUDA, and other state-of-the-art frameworks. The agenda blends lectures, demonstrations, and collaborative discussions to provide a thorough exploration of generative AI, parallelism, checkpointing, resiliency and model deployment across multi-GPU and multi-node environments.
! Registration is required to attend this workshop. Access is restricted to EPFL participants. !
Registration form (restricted to EPFL email addresses): https://forms.office.com/e/g0sPMJzm2h
The two half-day sessions cover different content, and we strongly encourage participants to attend both sessions for the full learning experience.
AGENDA
Day 1: Sept 29 Afternoon (Half Day) - Room SV 1717
13:00 – 13:15 | Registration and Welcome
- Participant check-in
- Workshop objectives and introductions
13:30 – 14:00 | Fundamentals of GPU Architecture and Accelerated Computing
- Introduction to modern GPU architectures
- Understanding memory hierarchies, bandwidth, and compute units
- Overview of CUDA C/C++ and CUDA Python
- An Introduction to Deep Learning
- How a Neural Network Trains
- Data Augmentation
- Pre-Trained Models
- Generative AI
- An introduction to our project on large-scale self-supervised pre-training with graphs for histopathology.
- Concepts of data parallelism and distributed computing
- Implementing data-parallel strategies with PyTorch’s DDP (Distributed Data Parallel)
- Scaling and parallelizing large neural networks
- Techniques for managing large-model memory footprints and optimizing training performance
- Introduction to model parallelism frameworks
- Example a transformer model using model parallelism
- Recap and open Q&A
- Preview of the afternoon session
Day 2: Sept 30 Afternoon (Half Day) - Room: BC 420
13:00 – 13:15 | Welcome Back and Review
- Summary of Day 1 key learnings
- Outline of Day 2 agenda
13:15 – 13:30 | Curating legally compliant and transparent training data at scale: insights from SwissAI's Apertus data collection and preparation - by Sven Najem-Meyer (EPFL PhD)
This presentation explores how SwissAI addresses legal compliance and data transparency in training the Apertus LLM. It outlines the challenges encountered and the methodologies applied to curate regulation-aligned datasets, as well as the tools used to enable efficient parallel data preprocessing.
13:30 – 14:00 | Generative AI with Diffusion Models
- Introduction to generative AI concepts and applications
- Exploring diffusion models and their significance
- Overview of NVIDIA’s generative AI tools
- Advanced NLP techniques
- Best practices for training and fine-tuning large language models
- Best practices for optimizations: speed and memory
- Snacks and networking
- How GPU-accelerated, large-scale protein design can help us learn more about biology.
- Deep dive into checkpointing + resiliency for model recovery, and efficient training
- Overview of checkpointing + resiliency tools in PyTorch, and NVIDIA frameworks
- Example of robust checkpointing based on Nemo and PyTorch
- Multi-GPU and Multi-Node Programming: Frameworks and Libraries
- Considerations for scaling across GPU cluster
- Profiling and performance optimization strategies
- Open discussion and feedback
- Meet the experts. Talk 1:1 or 1:N about your projects and challenge.
- Resources for continued learning – DLI, teacher kit, ambassador program (Cristel)
- Certificate distribution and farewell
! Registration is required to attend this workshop. Access is restricted to EPFL participants. !
Links
Practical information
- Expert
- Registration required
- This event is internal