Transparent Fault Tolerance for Scalable Functional Computation

Thumbnail

Event details

Date 26.07.2016
Hour 12:0013:00
Speaker Rob Stewart (Heriot-Watt University, Edinburgh)
Location
Category Conferences - Seminars
Abstract:
Reliability is set to become a major concern on emergent large-scale architectures. While there are many parallel languages, and indeed many parallel functional languages, very few address reliability. We investigate scalable transparent fault tolerance with automatic supervision and recovery of tasks with HdpH-RS, a DSL for fork/join parallelism on HPC architectures. Stateless functions are key for proving a crucial property of the semantics of HdpH-RS: fault recovery does not change the result of the program, akin to deterministic parallelism. To eliminate elusive concurrency bugs, HdpH-RS's work stealing protocol has been validated using the SPIN model checker.

HdpH-RS has been benchmarked on conventional clusters and an HPC platform: all benchmarks survive Chaos Monkey random fault injection; the system scales well e.g. up to 1400 cores on the HPC; reliability and recovery overheads are consistently low even at scale.

Bio:
Rob Stewart is a postdoc at Heriot-Watt University in Edinburgh. His research interests cover parallel functional programming language design and implementation, program transformations, and embedded systems. He has previously developed Haskell libraries for fault tolerant distributed computing, including 6 months using CloudHaskell in a start-up company. He is currently developing a parallel image processing DSL for FPGAs, along with an IDE based transformations toolkit for refactoring dataflow abstractions of FPGA circuits to increase throughput performance.

Practical information

  • Informed public
  • Free

Contact

  • Host: Martin Odersky

Tags

High-performance Computing Fault Tolerance Programming languages

Event broadcasted in

Share