Scaling database systems to high-performance computers

Event details
Date | 23.04.2018 |
Hour | 14:00 › 15:00 |
Speaker | Spyros Blanas |
Location | |
Category | Conferences - Seminars |
Processing massive datasets quickly requires warehouse-scale computers. Furthermore, many massive datasets are multi-dimensional arrays which are stored in formats like HDF5 and NetCDF that cannot be directly queried using SQL. Parallel array database systems like SciDB cannot scale in this environment that offers fast networking but very limited I/O bandwidth to shared, cold storage: merely loading multi-TB array datasets in SciDB would take days--an unacceptably long time for many applications.
In this talk, we will present ArrayBridge, a common interoperability layer for array file formats. ArrayBridge allows scientists to use SciDB, TensorFlow and HDF5-based code in the same file-centric analysis pipeline without converting between file formats. Under the hood, ArrayBridge manages I/O to leverage the massive concurrency of warehouse-scale parallel file systems without modifying the HDF5 API and breaking backwards compatibility with legacy applications. Once the data has been loaded in memory, the bottleneck in many array-centric queries becomes the speed of data repartitioning between different nodes. We will present an RDMA-aware data shuffling abstraction that directly converses with the network adapter in InfiniBand verbs and can repartition data up to 4X faster than MPI. We conclude by highlighting research opportunities that need to be overcome for data processing to scale to warehouse-scale computers.
Practical information
- General public
- Free
Organizer
- Prof. Anastasia Ailamaki
Contact
- Dimitra Tsaoussis-Melissargos