Practical and Efficient Near-Data Processing for In-Memory Analytics

Event details
Date | 10.03.2016 |
Hour | 13:30 › 15:00 |
Speaker |
Mingyu Gao, Ph.D. candidate in the Department of Electrical Engineering, Stanford University Bio : Mingyu Gao now is a Ph.D. candidate in the Department of Electrical Engineering, Stanford University. His research interest is computer architecture and system. Currently he is working with Professor Christos Kozyrakis in Multi-scale Architecture & Systems Team (MAST), inverstigating energy-efficient memory systems and accelerators for analytics applications and datacenter services. Specifically, his work focuses on efficient and practical near-data processing for DRAM-based memory systems, low-power high-density reconfigurable acceleration fabrics, and system integration of non-volatile memory technologies. Mingyu received his Master of Science degree in Electrical Engineering in Stanford University in June, 2014. Before coming to Stanford, He got Bachelor of Science degree in Microelectronics in Tsinghua University, Beijing, China, in June, 2012. |
Location | |
Category | Conferences - Seminars |
Abstract :
The end of Dennard scaling has made all systems energy-constrained. For data-intensive applications with limited temporal locality, the best way to optimize energy is to place processing near the data in main memory. In this talk, we develop the hardware and software support for a practical NDP architecture based on 3D integration. First, focusing on general-purpose cores, we develop simple but scalable hardware support for coherence, communication, and synchronization, and a runtime system that is sufficient to support analytics, graph processing, and deep neural networks frameworks with complex data patterns while hiding all the details of the NDP hardware. We also investigate the balance between processing and memory throughput, the scalability, and the importance of software optimization for spatial locality. This NDP architecture provides up to 16x performance and energy advantage over conventional approaches, and 2.5x over recently-proposed NDP systems. Next, we focus on the processing elements in the NDP stack. Processing elements based on reconfigurable logic have been proposed as a compromise between the efficiency of custom engines and the flexibility of programmable cores. Unfortunately, conventional FPGAs and CGRAs incur significant area and power overheads respectively. We develop Heterogeneous Reconfigurable Logic (HRL), a reconfigurable array for NDP systems that combines coarse-grained and fine-grained logic blocks and separates routing networks for data and control signals. HRL has the power efficiency of FPGA and the area efficiency of CGRA. It improves performance per Watt by 2.2x over FPGA and 1.7x over CGRA, and achieves 92% of the peak performance of an NDP system based on custom accelerators.
Refreshments will be available before the talk as from 1:15pm.
The end of Dennard scaling has made all systems energy-constrained. For data-intensive applications with limited temporal locality, the best way to optimize energy is to place processing near the data in main memory. In this talk, we develop the hardware and software support for a practical NDP architecture based on 3D integration. First, focusing on general-purpose cores, we develop simple but scalable hardware support for coherence, communication, and synchronization, and a runtime system that is sufficient to support analytics, graph processing, and deep neural networks frameworks with complex data patterns while hiding all the details of the NDP hardware. We also investigate the balance between processing and memory throughput, the scalability, and the importance of software optimization for spatial locality. This NDP architecture provides up to 16x performance and energy advantage over conventional approaches, and 2.5x over recently-proposed NDP systems. Next, we focus on the processing elements in the NDP stack. Processing elements based on reconfigurable logic have been proposed as a compromise between the efficiency of custom engines and the flexibility of programmable cores. Unfortunately, conventional FPGAs and CGRAs incur significant area and power overheads respectively. We develop Heterogeneous Reconfigurable Logic (HRL), a reconfigurable array for NDP systems that combines coarse-grained and fine-grained logic blocks and separates routing networks for data and control signals. HRL has the power efficiency of FPGA and the area efficiency of CGRA. It improves performance per Watt by 2.2x over FPGA and 1.7x over CGRA, and achieves 92% of the peak performance of an NDP system based on custom accelerators.
Refreshments will be available before the talk as from 1:15pm.
Practical information
- Informed public
- Free
- This event is internal
Organizer
- EcoCloud
Contact
- Ousmane Diallo