ECE Seminar (ECE 2002A/ECE 8002A)
Wednesday, January 31, 2018
11:15am - 12:05pm
For More Information
Speaker: Dr. Patrick Widener -- Sandia National Labs
Speaker's Title: Principal Member of the Technical Staff
Seminar Title: Understanding the Performance Effects of Resilience Mechanisms in High-Performance Computing Applications
Fault-tolerance poses a major challenge for future large-scale high-performance computing (HPC) systems and the important applications running on them. Alarming projections of high failure rates driven by the increasing scale and complexity of HPC systems have, over the past few years, motivated significant research into methods and techniques for providing resiliency while maintaining scalability in such systems. Our group at Sandia National Laboratories has worked to develop insights into selection and tuning of these methods and techniques. In this talk, I will describe our simulation-based framework for analyzing the performance effects of resilience activity. I will also present some recent research results obtained using our framework and discuss how those results have contributed to our understanding of the performance implications of resilience strategies for HPC applications.
Patrick Widener is a Principal Member of Technical Staff in the Center for Computing Research at Sandia. Dr. Widener’s research interests include the design and development of system software to support large-scale data-centric computational science, tools for examining performance interference caused by in-situ analytics, and software architectures for describing and exchanging data in computational science workflows. He is also a Research Associate Professor in the Computer Science Department at the University of New Mexico, and prior to joining Sandia was research faculty in the Department of Biomedical Informatics at Emory University. He holds a Ph.D. in Computer Science from the Georgia Institute of Technology.
Last revised May 23, 2018