Efficient gradient coding for mitigating stragglers within distributed machine learning

Thumbnail

Event details

Date 24.09.2025
Hour 16:1517:15
Speaker Prof. Aditya Ramamoorthy - Iowa State University
Location
Category Conferences - Seminars
Event Language English

Large scale distributed learning is the workhorse of modern-day machine learning algorithms. A typical scenario consists of minimizing a loss function (depending on the dataset) with respect to high-dimensional parameter. Workers typically compute gradients on their assigned dataset chunks and send them to the parameter server (PS), which aggregates them to compute either an exact or approximate version of the overall gradient of the relevant loss function. However, in large-scale clusters, many workers are prone to straggling (are slower than their promised speed or even failure-prone). A gradient coding solution introduces redundancy within the assignment of chunks to the workers and uses coding theoretic ideas to allow the PS to recover the overall gradient (exactly or approximately), even in the presence of stragglers. Unfortunately, most existing gradient coding protocols are inefficient from a computation perspective as they coarsely classify workers as operational or failed; the potentially valuable work performed by slow workers (partial stragglers) is ignored. 

Practical information

  • Informed public
  • Free

Organizer

  • IPG Seminar (Michael Gastpar)

Share