Data-Partitioning for Stream Processing Systems

Event details
Date | 17.06.2021 |
Hour | 10:00 › 12:00 |
Speaker | Eleni Zapridou |
Category | Conferences - Seminars |
EDIC candidacy exam
exam president: Prof. Karl Aberer
thesis advisor: Prof. Anastasia Ailamaki
co-examiner: Prof. Anne-Marie Kermarrec
Abstract
Streaming applications have two, often conflicting, requirements; latency and throughput. The tuple-at-a-time architecture prioritizes minimizing latency while the micro-batch model optimizes for increasing throughput. To optimize for both requirements, parallel stream processing engines have been developed. However, the distribution in streaming workloads changes in real-time and can be very skewed. Naive data partitioning in this setting results in one of the parallel workers becoming overloaded and, thus, determining the systemâs execution time. To express the performance of different partitioning algorithms, work has been done in formalizing the optimization objectives. We consider more complex tasks that cannot be expressed with the existing modeling.
Background papers
exam president: Prof. Karl Aberer
thesis advisor: Prof. Anastasia Ailamaki
co-examiner: Prof. Anne-Marie Kermarrec
Abstract
Streaming applications have two, often conflicting, requirements; latency and throughput. The tuple-at-a-time architecture prioritizes minimizing latency while the micro-batch model optimizes for increasing throughput. To optimize for both requirements, parallel stream processing engines have been developed. However, the distribution in streaming workloads changes in real-time and can be very skewed. Naive data partitioning in this setting results in one of the parallel workers becoming overloaded and, thus, determining the systemâs execution time. To express the performance of different partitioning algorithms, work has been done in formalizing the optimization objectives. We consider more complex tasks that cannot be expressed with the existing modeling.
Background papers
- Apache Flink: Stream Analytics at Scale https://www.researchgate.net/publication/305869785_Apache_Flink_Stream_Analytics_at_Scale
- A Holistic View of Stream Partitioning Costs http://www.vldb.org/pvldb/vol10/p1286-katsipoulakis.pdf
- Prompt: Dynamic Data-Partitioning for Distributed Micro-batch Stream Processing Systems https://www.cs.purdue.edu/homes/aref/papers/sigmod2020.pdf
Practical information
- General public
- Free
Organizer
- EDIC