RCP Workshop: Advanced Deep Learning with PyTorch and NVIDIA Ecosystem

Event details

Date	29.09.2025 › 30.09.2025
Hour	13:00 › 17:00
Category	Internal trainings
Event Language	English

This event is organized by the Research Computing Platform (RCP), which provides campus-wide IT infrastructure for the research community.

This two half-day workshop is designed for practitioners and researchers eager to deepen their understanding of advanced deep learning concepts, tools, and best practices using PyTorch, NVIDIA NeMo, CUDA, and other state-of-the-art frameworks. The agenda blends lectures, demonstrations, and collaborative discussions to provide a thorough exploration of generative AI, parallelism, checkpointing, resiliency and model deployment across multi-GPU and multi-node environments.

! Registration is required to attend this workshop. Access is restricted to EPFL participants. !

Registration form (restricted to EPFL email addresses): https://forms.office.com/e/g0sPMJzm2h

The two half-day sessions cover different content, and we strongly encourage participants to attend both sessions for the full learning experience.

AGENDA

Day 1: Sept 29 Afternoon (Half Day) - Room SV 1717

13:00 – 13:15 | Registration and Welcome

Participant check-in
Workshop objectives and introductions

13:15 – 13:30 | Welcome talk, by Prof. Martin Jaggi (EPFL)

13:30 – 14:00 | Fundamentals of GPU Architecture and Accelerated Computing

Introduction to modern GPU architectures
Understanding memory hierarchies, bandwidth, and compute units
Overview of CUDA C/C++ and CUDA Python

14:00 – 15:00 | Fundamentals of Deep Learning

An Introduction to Deep Learning
How a Neural Network Trains
Data Augmentation
Pre-Trained Models
Generative AI

15:00 – 15:15 | Towards a Graph Foundation Model for Digital Pathology – by Sevda Ögüt (EPFL LTS4 PhD Student)

An introduction to our project on large-scale self-supervised pre-training with graphs for histopathology.

15:30 – 16:30 | Data Parallelism: Training Deep Learning Models on Multiple GPUs

Concepts of data parallelism and distributed computing
Implementing data-parallel strategies with PyTorch’s DDP (Distributed Data Parallel)

16:30 – 17:30 | Model Parallelism and Large Model Deployment

Scaling and parallelizing large neural networks
Techniques for managing large-model memory footprints and optimizing training performance
Introduction to model parallelism frameworks
Example a transformer model using model parallelism

17:30 | End of Day 1 (Half Day)

Recap and open Q&A
Preview of the afternoon session

Day 2: Sept 30 Afternoon (Half Day) - Room: BC 420

13:00 – 13:15 | Welcome Back and Review

Summary of Day 1 key learnings
Outline of Day 2 agenda

13:15 – 13:30 | Curating legally compliant and transparent training data at scale: insights from SwissAI's Apertus data collection and preparation - by Sven Najem-Meyer (EPFL PhD)

This presentation explores how SwissAI addresses legal compliance and data transparency in training the Apertus LLM. It outlines the challenges encountered and the methodologies applied to curate regulation-aligned datasets, as well as the tools used to enable efficient parallel data preprocessing.

13:30 – 14:00 | Generative AI with Diffusion Models

Introduction to generative AI concepts and applications
Exploring diffusion models and their significance
Overview of NVIDIA’s generative AI tools

14:00 – 14:45 | Building Transformer-Based Natural Language Processing Pipelines

Advanced NLP techniques
Best practices for training and fine-tuning large language models
Best practices for optimizations: speed and memory

14:45 – 15:00 | Coffee Break

Snacks and networking

15:00 – 15:15 | Protein Design on RCP – by Julius Wenckstern (EPFL PhD Student)

How GPU-accelerated, large-scale protein design can help us learn more about biology.

15:15 – 16:00 | Checkpointing & Resiliency: Concepts, Strategies, and Frameworks

Deep dive into checkpointing + resiliency for model recovery, and efficient training
Overview of checkpointing + resiliency tools in PyTorch, and NVIDIA frameworks
Example of robust checkpointing based on Nemo and PyTorch

16:00 – 16:30 | Scaling CUDA Applications to Multiple Nodes –

Multi-GPU and Multi-Node Programming: Frameworks and Libraries
Considerations for scaling across GPU cluster
Profiling and performance optimization strategies

16:30 – 17:00 | Closing Remarks, Q&A, and Next Steps

Open discussion and feedback
Meet the experts. Talk 1:1 or 1:N about your projects and challenge.
Resources for continued learning – DLI, teacher kit, ambassador program (Cristel)
Certificate distribution and farewell

! Registration is required to attend this workshop. Access is restricted to EPFL participants. !

Practical information

Expert
Registration required
This event is internal

Organizer

Research Computing Platform (RCP)

Contact

Khadidja Malleck

Export Event

Event broadcasted in

Send a reminder