BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Memento EPFL//
BEGIN:VEVENT
SUMMARY:Scalable microsecond recovery for microsecond RDMA applications
DTSTART:20210614T130000
DTEND:20210614T150000
DTSTAMP:20260407T064156Z
UID:bfbd60caaeb76d0e8c7bc28e3c1e7ca9eee96cad50c3d3ab1800b073
CATEGORIES:Conferences - Seminars
DESCRIPTION:Antoine Murat\nEDIC candidacy exam\nexam president: Prof. Edou
 ard Bugnion\nthesis advisor: Prof. Rachid Guerraoui\nco-examiner: Prof. Br
 yan Ford\n\nAbstract\nRemote Direct Memory Access (RDMA) is a network tech
 nology that allows user space programs to access the memory of a remote ma
 chine without involving the distant CPU. Coupled with a high performance f
 abric such as Infiniband\, it allows machines to communicate at the micros
 econd scale by bypassing the kernel\, moving the network stack to the hard
 ware\, and directly modifying the remote L3 cache. RDMA gained a lot of tr
 action over the past decade and is becoming prevalent within the data cent
 er space.\nRecent works have demonstrated how to build moderate-scale syst
 ems that leverage RDMA to achieve orders of magnitude improvements in both
  latency and throughput over systems relying on traditional networking sta
 cks. Nevertheless\, while the common failure-free path has been vastly imp
 roved\, recovery is often overlooked\, barely takes advantage of new hardw
 are capabilities.\nAs RDMA deployments continue to scale both server and c
 lient side\, failures are expected to become the common case and cannot be
  neglected anymore.\nThis thesis will revisit the design of state-of-the-a
 rt RDMA systems to bring down recovery time to the microsecond scale\, and
  thus achieve increased availability and shortened tail latency.\nTo provi
 de fast failover\, all components on the recovery path will have to be rew
 orked from ground up for the microsecond scale\, ranging from failure dete
 ction to re-replication\, including leases.\nThose rethought abstractions 
 will take full advantage of RDMA features such as the M&M paradigm (i.e.\,
  using both message passing and shared memory)\, permissions\, hardware mu
 lticast\, different levels of reliability\, etc.\nThis work will study the
  impact of those microsecond scale components on the fast path of existing
  systems as well as on their overall performance and establish what are th
 e trade-offs a system should make as a function of its availability target
 .\nHopefully\, this research will demonstrate how higher availability can 
 be achieved without compromising performances or weakening abstractions by
  fully leveraging RDMA hardware.\n\nBackground papers\n-    Design gui
 delines for high performance RDMA systems https://dl.acm.org/doi/10.5555/3
 026959.3027000\n-    Microsecond Consensus for Microsecond Application
 s https://www.usenix.org/conference/osdi20/presentation/aguilera\n-   
  Hermes: A Fast\, Fault-Tolerant and Linearizable Replication Protocol ht
 tps://dl.acm.org/doi/abs/10.1145/3373376.3378496\n\n 
LOCATION:
STATUS:CONFIRMED
END:VEVENT
END:VCALENDAR