Fault-Tolerance Techniques for High-Performance Computing

Nonfiction, Computers, Computer Hardware, Input-Output Equipment, General Computing
Cover of the book Fault-Tolerance Techniques for High-Performance Computing by , Springer International Publishing
View on Amazon View on AbeBooks View on Kobo View on B.Depository View on eBay View on Walmart
Author: ISBN: 9783319209432
Publisher: Springer International Publishing Publication: July 1, 2015
Imprint: Springer Language: English
Author:
ISBN: 9783319209432
Publisher: Springer International Publishing
Publication: July 1, 2015
Imprint: Springer
Language: English

This timely text presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC). The text opens with a detailed introduction to the concepts of checkpoint protocols and scheduling algorithms, prediction, replication, silent error detection and correction, together with some application-specific techniques such as ABFT. Emphasis is placed on analytical performance models. This is then followed by a review of general-purpose techniques, including several checkpoint and rollback recovery protocols. Relevant execution scenarios are also evaluated and compared through quantitative models. Features: provides a survey of resilience methods and performance models; examines the various sources for errors and faults in large-scale systems; reviews the spectrum of techniques that can be applied to design a fault-tolerant MPI; investigates different approaches to replication; discusses the challenge of energy consumption of fault-tolerance methods in extreme-scale systems.

View on Amazon View on AbeBooks View on Kobo View on B.Depository View on eBay View on Walmart

This timely text presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC). The text opens with a detailed introduction to the concepts of checkpoint protocols and scheduling algorithms, prediction, replication, silent error detection and correction, together with some application-specific techniques such as ABFT. Emphasis is placed on analytical performance models. This is then followed by a review of general-purpose techniques, including several checkpoint and rollback recovery protocols. Relevant execution scenarios are also evaluated and compared through quantitative models. Features: provides a survey of resilience methods and performance models; examines the various sources for errors and faults in large-scale systems; reviews the spectrum of techniques that can be applied to design a fault-tolerant MPI; investigates different approaches to replication; discusses the challenge of energy consumption of fault-tolerance methods in extreme-scale systems.

More books from Springer International Publishing

Cover of the book Progress in Location Based Services 2018 by
Cover of the book Introduction to the Study of Natural History by
Cover of the book Landslides in Sensitive Clays by
Cover of the book Orofacial Pain by
Cover of the book The Energy Consumption in Refrigerated Warehouses by
Cover of the book Engineering Women: Re-visioning Women's Scientific Achievements and Impacts by
Cover of the book Linguistic and Psycholinguistic Approaches on Implicatures and Presuppositions by
Cover of the book Environmentally Responsible Supply Chains by
Cover of the book The Evolution of Mammalian Sociality in an Ecological Perspective by
Cover of the book Happy City - How to Plan and Create the Best Livable Area for the People by
Cover of the book Proceedings of the 11th International Congress for Applied Mineralogy (ICAM) by
Cover of the book Pancreatology by
Cover of the book Power Electronics for the Next Generation Wind Turbine System by
Cover of the book Fundamentals of Tensor Calculus for Engineers with a Primer on Smooth Manifolds by
Cover of the book Engineering Secure Software and Systems by
We use our own "cookies" and third party cookies to improve services and to see statistical information. By using this website, you agree to our Privacy Policy