RCP Workshop: Advanced Deep Learning with PyTorch and NVIDIA Ecosystem

Thumbnail

Event details

Date 29.09.2025 30.09.2025
Hour 13:0017:00
Category Internal trainings
Event Language English

This event is organized by the Research Computing Platform (RCP), which provides campus-wide IT infrastructure for the research community.

This two half-day workshop is designed for practitioners and researchers eager to deepen their understanding of advanced deep learning concepts, tools, and best practices using PyTorch, NVIDIA NeMo, CUDA, and other state-of-the-art frameworks. The agenda blends lectures, demonstrations, and collaborative discussions to provide a thorough exploration of generative AI, parallelism, checkpointing, resiliency and model deployment across multi-GPU and multi-node environments.

! Registration is required to attend this workshop. Access is restricted to EPFL participants. !

Registration form (restricted to EPFL email addresses): https://forms.office.com/e/g0sPMJzm2h

The two half-day sessions cover different content, and we strongly encourage participants to attend both sessions for the full learning experience.


AGENDA

Day 1: Sept 29 Afternoon (Half Day) - Room SV 1717

13:00 – 13:15 | Registration and Welcome

  • Participant check-in
  • Workshop objectives and introductions
13:15 – 13:30 | Welcome talk, by Prof. Martin Jaggi (EPFL)

13:30 – 14:00 | Fundamentals of GPU Architecture and Accelerated Computing 
  • Introduction to modern GPU architectures
  • Understanding memory hierarchies, bandwidth, and compute units
  • Overview of CUDA C/C++ and CUDA Python
14:00 – 15:00 | Fundamentals of Deep Learning 
  • An Introduction to Deep Learning
  • How a Neural Network Trains
  • Data Augmentation 
  • Pre-Trained Models
  • Generative AI
15:00 – 15:15 | Towards a Graph Foundation Model for Digital Pathology – by Sevda Ögüt (EPFL LTS4 PhD Student)
  • An introduction to our project on large-scale self-supervised pre-training with graphs for histopathology.
15:30 – 16:30 | Data Parallelism: Training Deep Learning Models on Multiple GPUs 
  • Concepts of data parallelism and distributed computing
  • Implementing data-parallel strategies with PyTorch’s DDP (Distributed Data Parallel)
16:30 – 17:30 | Model Parallelism and Large Model Deployment
  • Scaling and parallelizing  large neural networks
  • Techniques for managing large-model memory footprints and optimizing training performance
  • Introduction to model parallelism frameworks
  • Example a transformer model using model parallelism
17:30 | End of Day 1 (Half Day)
  • Recap and open Q&A
  • Preview of the afternoon session

Day 2: Sept 30 Afternoon (Half Day) - Room: BC 420

13:00 – 13:15 | Welcome Back and Review
  • Summary of Day 1 key learnings
  • Outline of Day 2 agenda

13:15 – 13:30 | Curating legally compliant and transparent training data at scale: insights from SwissAI's Apertus data collection and preparation - by Sven Najem-Meyer (EPFL PhD)

This presentation explores how SwissAI addresses legal compliance and data transparency in training the Apertus LLM. It outlines the challenges encountered and the methodologies applied to curate regulation-aligned datasets, as well as the tools used to enable efficient parallel data preprocessing.

13:30 – 14:00 | Generative AI with Diffusion Models
  • Introduction to generative AI concepts and applications
  • Exploring diffusion models and their significance
  • Overview of NVIDIA’s generative AI tools
14:00 – 14:45 | Building Transformer-Based Natural Language Processing Pipelines 
  • Advanced NLP techniques
  • Best practices for training and fine-tuning large language models
  • Best practices for optimizations: speed and memory
14:45 – 15:00 | Coffee Break
  • Snacks and networking
15:00 – 15:15 | Protein Design on RCP – by Julius Wenckstern (EPFL PhD Student)
  • How GPU-accelerated, large-scale protein design can help us learn more about biology.
15:15 – 16:00 | Checkpointing & Resiliency: Concepts, Strategies, and Frameworks
  • Deep dive into checkpointing + resiliency for model recovery, and efficient training
  • Overview of checkpointing + resiliency tools in PyTorch, and NVIDIA frameworks
  • Example of robust checkpointing based on Nemo and PyTorch
16:00 – 16:30 | Scaling CUDA Applications to Multiple Nodes – 
  • Multi-GPU and Multi-Node Programming: Frameworks and Libraries
  • Considerations for scaling across GPU cluster
  •  Profiling and performance optimization strategies
16:30 – 17:00 | Closing Remarks, Q&A, and Next Steps
  • Open discussion and feedback
  • Meet the experts. Talk 1:1 or 1:N about your projects and challenge.
  • Resources for continued learning – DLI, teacher kit, ambassador program (Cristel)
  • Certificate distribution and farewell

! Registration is required to attend this workshop. Access is restricted to EPFL participants. !

Links

Practical information

  • Expert
  • Registration required
  • This event is internal

Tags

Deep Learning PyTorch NVIDIA Generative AI Distributed Training

Event broadcasted in

Share