Rethinking GPU Execution Model
Graphics processing units (GPUs) have become the architectural choice to achieve high throughput in general-purpose computing. Thread-level parallelism (TLP) in GPUs is implemented by concurrently executing a large number of threads. However, GPUs cannot often achieve the theoretical peak performance. I found that the critical performance bottlenecks on GPUs are 1) limited memory system performance and 2) limited thread scheduling resources and register file. In this talk, I will show the GPU execution model and two above performance bottlenecks on GPUs in detail. Then, I will introduce two solutions addressing these challenges. First, I will introduce a new GPU architecture, called Adaptive PREfetching and Scheduling (APRES), that overcomes the limited memory system performance by improving cache efficiency on GPUs. Second, I will introduce another work, called FineReg, that provides a solution to schedule threads over the limits of scheduling resources and register file on GPUs.