IC Talk: Sparse Matrices and High Performance Computing Meet Biology

Thumbnail

Event details

Date 18.08.2021
Hour 10:1511:15
Location
Category Conferences - Seminars
Event Language English
By: Giulia Guidi - UC Berkeley

Abstract
Recently, the benefit of high-performance computing (HPC) for science has grown rapidly, beyond traditional simulations to data analysis, for example, in genomics. Given the vast amount of data and computation involved in such applications, they can require the full computational power and memory of institutional or agency-wide HPC systems.
One of the most data- and compute-intensive challenges in genomics is de novo genome assembly, i.e., reconstructing an unknown genome from redundant, erroneous genomic sequences. Here we introduce the first distributed memory assembler for long-read sequencing data, called ELBA. ELBA introduces sparse matrices as the main abstraction in this context and makes extensive use of sparse linear algebra computation and probabilistic modeling. ELBA is up to 2x faster on CPU than an algorithm based on distributed hash tables, which are harder to parallelize. ELBA integrates GPU support in the most compute-intensive stages of the pipeline to take advantage of today's HPC heterogeneous hardware.
To ensure that the genomics research community and others, in general, can benefit from HPC, the development of distributed algorithms such as ELBA must be coupled with efforts to make distributed computing more accessible, as traditional HPC resources are typically reserved for specific research communities and access to resources is limited. To this end, we conducted a performance study to investigate the gap between traditional and cloud-based HPC. Until 2018, cloud-based HPC was not an option for most computational sciences due to the lack of a low-latency network. Our results show that this is changing and that cloud-based HPC is proving to be competitive with traditional supercomputing thanks to faster hardware procurement cycles and a significant improvement in network performance.

Bio
Giulia is a PhD candidate in Computer Science at UC Berkeley and a graduate research assistant at the Computational Research Division of Lawrence Berkeley National Laboratory advised by Aydın Buluç and Kathy Yelick. Giulia is a 2020 SIGHPC Computational & Data Science Fellow and a member of the PASSION Lab, the BeBOp Group, and the Performance and Algorithms Research (PAR) Group. She received her M.Sc. and B.Sc. in Biomedical Engineering from Politecnico di Milano. Giulia’s research focuses on developing a novel algorithm for de novo assembly of genomes in distributed memory using long-read sequencing data as part of the ExaBiome project, and on how to make cloud computing more accessible for high-performance scientific computing. Giulia is interested in the intersection of High-Performance Computing (HPC), Computer Systems, and Computational Biology as enabling technologies for faster, high-quality bioinformatics and biomedical research.

More information
 

Practical information

  • General public
  • Free
  • This event is internal

Contact

  • Host: Jim Larus

Event broadcasted in

Share