Enabling Efficient Communication in Large Heterogeneous Processors


Event details

Date and time 10.03.2015 16:1517:30  
Place and room
Speaker Brad Beckmann, member of AMD Research in Bellevue, WA, USA.
Category Conferences - Seminars
Graphics processing units (GPUs) provide tremendous throughput with outstanding performance-to-power ratios when executing data parallel code.  Meanwhile CPUs remain the best at executing sequential control code, thus most current designs integrate both types of devices into the same processor.  In order to allow programmers to fully leverage their diverse computational power, these integrated CPU/GPU designs must be architected in a cohesive and synergistic manner. In that vein, our research builds upon the recently published Heterogeneous System Architecture (HSA) specification that provides (among other things) a system architecture where all devices within a node (e.g., CPU, GPU, and other accelerators) share a single, unified, virtual memory space. This allows applications to be written where CPU and GPU code can freely exchange pointers without expensive memory transfers over PCIe, marshalling of data structures, nor complicated device-specific memory allocation.

This talk will discuss our research that enables efficient communications across large heterogeneous systems. In particular, I will describe a set of solutions that localize communication and synchronization within an HSA-compatible heterogeneous processor.  These solutions include a novel hardware mechanism, called QuickRelease, that enables GPU memory systems to efficiently support fine-grain load-acquire/store-release synchronization between GPU threads without sacrificing throughput.  The solutions also include a set of memory consistency models, called Heterogeneous-Race-Free (HRF) memory models, that provides programmers with a well-defined framework to reason about large on-chip memory systems. Finally I will introduce a new synchronization primitive, called remote scope promotion, that allows programmers to more frequently use lower latency localized synchronization, rather than longer latency global synchronization.

Practical information

  • Informed public
  • Free


  • Babak Falsafi


  • Séphanie Baillargues

Event broadcasted in