Memory Processing Units
3D die-stacking of logic+DRAM provides a unique opportunity to revisit the ideas of in-memory processing and eliminate decades of ``inefficient glue'' like multi-level cache hierarchies, OOO processing, deep pipelining, and speculative execution, that we have built to bridge memory and processing. Compared to conventional DRAMs, 3D die-stacked DRAM (embodied by standards like HMC and HBM), have almost order of magnitude improvements in bandwidth and latency between logic and memory, and significant power reductions as well. In this talk I will cover our work on a new architecture called Memory Processing Units (MPU), which is built on two key ideas. On the programming model and execution model side, we propose memory remote-procedure calls to offload entire pieces of computation to a memory+processing unit. On the hardware side, we argue energy-efficient small caches, non-speculative, low-frequency, ultra-short pipeline processing cores integrated closely with memory provide efficient processing. Across a wide domain of workloads spanning SQL database processing, networking, and internet search, we show the MPU model handily outperforms conventional processors and emerging low-power ARM servers. Performance improvements range from 1.9X to 2.7X with energy savings ranging from 6.5X to 18X.