DB Seminar: Platforms and Applications for “Big and Fast” Data Analytics

Event details
Date | 07.11.2014 |
Hour | 12:30 › 13:30 |
Speaker | Prof. Yanlei Diao http://people.cs.umass.edu/~yanlei/ |
Location | |
Category | Conferences - Seminars |
Recently there has been a significant interest in building big data systems that can handle not only “big data” but also “fast data” for analytics. Our work is strongly motivated by recent real-world case studies that point to the need for a general, unified data processing framework to support analytical queries with different latency requirements. Towards this goal, our project is designed to transform the popular MapReduce computation model, originally proposed for batch processing, into distributed (near) real-time processing.
In this talk, I start by examining the widely used Hadoop system and presenting a thorough analysis to understand the causes of high latency in Hadoop. I then present a number of necessary architectural changes, as well as new resource configuration and optimization techniques to meet user-specified latency requirements while maximizing throughput. Experiments using typical workloads in click stream analysis and twitter feed analysis show that our techniques reduce the latency from tens or hundreds of seconds in Hadoop to sub-second in our system, with 2x-7x increase in throughput. Our system also outperforms state-of-the-art distributed stream systems, Twitter Storm and Spark Streaming, by a wide margin. Finally, I will show some initial results and challenges of supporting big and fast data analytics in the emerging domain of genomics.
In this talk, I start by examining the widely used Hadoop system and presenting a thorough analysis to understand the causes of high latency in Hadoop. I then present a number of necessary architectural changes, as well as new resource configuration and optimization techniques to meet user-specified latency requirements while maximizing throughput. Experiments using typical workloads in click stream analysis and twitter feed analysis show that our techniques reduce the latency from tens or hundreds of seconds in Hadoop to sub-second in our system, with 2x-7x increase in throughput. Our system also outperforms state-of-the-art distributed stream systems, Twitter Storm and Spark Streaming, by a wide margin. Finally, I will show some initial results and challenges of supporting big and fast data analytics in the emerging domain of genomics.
Practical information
- Informed public
- Free
Organizer
- Prof. Anastasia Ailamaki
Prof. Christoph Koch
Contact
- Dimitra Tsaoussis