Challenges in Address Translation for Next-Generation Heterogeneous Manycore Systems
As systems run workloads with ever-increasing memory footprints and incorporate large amounts of on-die heterogeneity, maintaining programmer productivity and in tandem lowering memory access overheads becomes crucial. In particular, the hardware and software stack of the virtual memory system becomes a critical bottleneck because: (a) increasing memory footprints and larger last-level caches stress CPU Memory Management Units (TLBs, MMU caches, and page table walkers) aggressively; and (b) on-chip accelerators require MMU support to support a programming model with unified address spaces, at the risk of degraded performance. In response, this talk will focus on hardware/software techniques that leverage operating system page allocator patterns to increase TLB and MMU cache reach with modest hardware changes. We will show that intelligently tracking OS allocation patterns allows for exploiting "intermediate contiguity" between baseline page sizes and large pages. We will then design a first-cut MMU for GPUs, the most mature acceleration technology available today. The overall lessons from this work will show how to design next-generation MMUs for heterogeneous chips running workloads with large, multidimensional datasets.