BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Memento EPFL//
BEGIN:VEVENT
SUMMARY:Machine Learning Training and Deployment in Disaggregated Architec
 tures
DTSTART:20230228T140000
DTEND:20230228T160000
DTSTAMP:20260407T182245Z
UID:f4ba3f25e398dfd1ef5c4e244d30ebfac1f54252f41e16b3ddbc7c88
CATEGORIES:Conferences - Seminars
DESCRIPTION:Diana Andreea Petrescu\nEDIC candidacy exam\nExam president: P
 rof. Jean-Yves Le Boudec\nThesis advisor: Prof. Rachid Guerraoui\nThesis c
 o-advisor: Prof. Anne-Marie Kermarrec\nCo-examiner: Prof. Boi Faltings\n\n
 Abstract\nCloud computing has taken an important role in reducing infrastr
 ucture costs by replacing on-premise data centers. The way cloud providers
  are able to achieve this cost reduction relies on an economy of scale whi
 ch is attainable with multi-tenancy and proper resource utilization. One w
 ay of achieving the latter is through disaggregation\, which consists in s
 eparating servers into their constituent resources (computing\, memory and
  storage) and interconnecting them via network. This way\, each resource c
 an be allotted as required and independently scaled\, suiting the needs of
  distinct workloads that make disproportional usage of such resources. Thi
 s potentially prevents overwhelming machines with under-provisioned resour
 ces or wasting these resources on over-provisioned ones\, apart from reduc
 ing the total cost of ownership (TCO) of providers. As a consequence\, how
 ever\, extra pressure is put on the network layer\, since it interconnects
  all disaggregated resources.\n \nIn order to leverage the improved resou
 rce utilization brought by disaggregation while preserving or improving ap
 plication performance in comparison to monolithic servers\, one has to min
 imize data movement. Typically\, this is achieved by improving data locali
 ty. In other words\, by keeping compute units near the data they process. 
 Noticeably\, the most common approaches for enhancing data locality consis
 t of manipulating either data (e.g.\, prefetching and caching) or code (ne
 ar-data processing (NDP) or pushdown). This however calls both for some co
 mpute capability in the storage tier (e.g.\, GPU along with an array of di
 sks) and some memory capability in the compute tier (e.g.\, disks along wi
 th an array of GPUs). Clearly\, applications running on disaggregated clou
 d providers can greatly benefit from these locality-enforcing techniques.\
 n \nMachine learning (ML) processing is a natural fit for cloud deploymen
 t. The reason is that it requires large amounts of both data (hence storag
 e) and computing power. In case of disaggregation\, i.e.\, when storage is
  decoupled from computing tiers\, one has to decide what pieces of computa
 tion should run where. To tackle this problem\, we have to consider that t
 he internal storage bandwidth (i.e.\, between durable storage and CPU) is 
 way larger than the network bandwidth that connects storage and compute ti
 ers. A naive solution would therefore pushdown all computations to the clo
 ud object storage (COS) and send only the result back to the requesting cl
 ient. The problem is that this deters the very benefits of disaggregating 
 servers in the first place. Such an approach would quickly saturate the co
 mputing resources of the storage tier\, which are not optimized for large 
 processing jobs\, and hence hamper both the application processing time an
 d the experience of other COS users in a multi-tenant environment. At the 
 other end\, i.e.\, the computing tier\, one could think of prefetching dat
 a that is about to be used and caching it in case it is likely to be reuse
 d in the near future. Again\, the converse problem arises: as computing ma
 chines are not optimized for storing large amounts of data\, one would rap
 idly exhaust their storage capacity.\n \nThus\, there is a need for smart
  solutions that decide how to split an ML computation between the COS and 
 the compute tier\, taking into account the generality of ML tasks and the 
 concurrency and privacy aspects of the system.\n\nBackground papers\n\n	Yi
 ping Kang\, Johann Hauswald\, Cao Gao\, Austin Rovinski\, Trevor Mudge\, J
 ason Mars\, and Lingjia Tang. Neurosurgeon: Collaborative intelligence bet
 ween the cloud and mobile edge. ACM SIGARCH Computer Architecture News\, 4
 5(1):615–629\, 2017.\n	https://www.cl.cam.ac.uk/~ey204/teaching/ACS/R244
 _2019_2020/papers/kang_asplos_2017.pdf \n	Yifei Yang\, Matt Youill\, Matt
 hew Woicik\, Yizhou Liu\, Xiangyao Yu\, Marco Serafini\, Ashraf Aboulnaga\
 , and Michael Stonebraker. Flexpushdowndb: hybrid pushdown and caching in 
 a cloud dbms. Proceedings of the VLDB Endowment\, 2021.\n	https://ashraf.a
 boulnaga.me/pubs/pvldb21flexpushdowndb.pdf \n	Changho Hwang\, Taehyun Kim
 \, Sunghyun Kim\, Jinwoo Shin\, and KyoungSoo Park. Elastic resource shari
 ng for distributed deep learning. In 18th USENIX Symposium on Networked Sy
 stems Design and Implementation (NSDI 21)\, pages 721–739. USENIX Associ
 ation\, April 2021. https://www.usenix.org/system/files/nsdi21-hwang.pdf 
 \n
LOCATION:
STATUS:CONFIRMED
END:VEVENT
END:VCALENDAR
