An Architecture for Fast and General Data Processing on Large Clusters

Nonfiction, Computers, Advanced Computing, Parallel Processing, Engineering, Computer Architecture, General Computing
Cover of the book An Architecture for Fast and General Data Processing on Large Clusters by Matei Zaharia, Association for Computing Machinery and Morgan & Claypool Publishers
View on Amazon View on AbeBooks View on Kobo View on B.Depository View on eBay View on Walmart
Author: Matei Zaharia ISBN: 9781970001587
Publisher: Association for Computing Machinery and Morgan & Claypool Publishers Publication: May 1, 2016
Imprint: ACM Books Language: English
Author: Matei Zaharia
ISBN: 9781970001587
Publisher: Association for Computing Machinery and Morgan & Claypool Publishers
Publication: May 1, 2016
Imprint: ACM Books
Language: English

The past few years have seen a major change in computing systems, as growing data volumes and stalling processor speeds require more and more applications to scale out to clusters. Today, a myriad data sources, from the Internet to business operations to scientific instruments, produce large and valuable data streams. However, the processing capabilities of single machines have not kept up with the size of data. As a result, organizations increasingly need to scale out their computations over clusters.

At the same time, the speed and sophistication required of data processing have grown. In addition to simple queries, complex algorithms like machine learning and graph analysis are becoming common. And in addition to batch processing, streaming analysis of real-time data is required to let organizations take timely action. Future computing platforms will need to not only scale out traditional workloads, but support these new applications too.

This book, a revised version of the 2014 ACM Dissertation Award winning dissertation, proposes an architecture for cluster computing systems that can tackle emerging data processing workloads at scale. Whereas early cluster computing systems, like MapReduce, handled batch processing, our architecture also enables streaming and interactive queries, while keeping MapReduce's scalability and fault tolerance. And whereas most deployed systems only support simple one-pass computations (e.g., SQL queries), ours also extends to the multi-pass algorithms required for complex analytics like machine learning. Finally, unlike the specialized systems proposed for some of these workloads, our architecture allows these computations to be combined, enabling rich new applications that intermix, for example, streaming and batch processing.

We achieve these results through a simple extension to MapReduce that adds primitives for data sharing, called Resilient Distributed Datasets (RDDs). We show that this is enough to capture a wide range of workloads. We implement RDDs in the open source Spark system, which we evaluate using synthetic and real workloads. Spark matches or exceeds the performance of specialized systems in many domains, while offering stronger fault tolerance properties and allowing these workloads to be combined. Finally, we examine the generality of RDDs from both a theoretical modeling perspective and a systems perspective.

This version of the dissertation makes corrections throughout the text and adds a new section on the evolution of Apache Spark in industry since 2014. In addition, editing, formatting, and links for the references have been added.

View on Amazon View on AbeBooks View on Kobo View on B.Depository View on eBay View on Walmart

The past few years have seen a major change in computing systems, as growing data volumes and stalling processor speeds require more and more applications to scale out to clusters. Today, a myriad data sources, from the Internet to business operations to scientific instruments, produce large and valuable data streams. However, the processing capabilities of single machines have not kept up with the size of data. As a result, organizations increasingly need to scale out their computations over clusters.

At the same time, the speed and sophistication required of data processing have grown. In addition to simple queries, complex algorithms like machine learning and graph analysis are becoming common. And in addition to batch processing, streaming analysis of real-time data is required to let organizations take timely action. Future computing platforms will need to not only scale out traditional workloads, but support these new applications too.

This book, a revised version of the 2014 ACM Dissertation Award winning dissertation, proposes an architecture for cluster computing systems that can tackle emerging data processing workloads at scale. Whereas early cluster computing systems, like MapReduce, handled batch processing, our architecture also enables streaming and interactive queries, while keeping MapReduce's scalability and fault tolerance. And whereas most deployed systems only support simple one-pass computations (e.g., SQL queries), ours also extends to the multi-pass algorithms required for complex analytics like machine learning. Finally, unlike the specialized systems proposed for some of these workloads, our architecture allows these computations to be combined, enabling rich new applications that intermix, for example, streaming and batch processing.

We achieve these results through a simple extension to MapReduce that adds primitives for data sharing, called Resilient Distributed Datasets (RDDs). We show that this is enough to capture a wide range of workloads. We implement RDDs in the open source Spark system, which we evaluate using synthetic and real workloads. Spark matches or exceeds the performance of specialized systems in many domains, while offering stronger fault tolerance properties and allowing these workloads to be combined. Finally, we examine the generality of RDDs from both a theoretical modeling perspective and a systems perspective.

This version of the dissertation makes corrections throughout the text and adds a new section on the evolution of Apache Spark in industry since 2014. In addition, editing, formatting, and links for the references have been added.

More books from Association for Computing Machinery and Morgan & Claypool Publishers

Cover of the book Reactive Internet Programming by Matei Zaharia
Cover of the book The Continuing Arms Race by Matei Zaharia
Cover of the book Candidate Multilinear Maps by Matei Zaharia
Cover of the book The VR Book by Matei Zaharia
Cover of the book Shared-Memory Parallelism Can be Simple, Fast, and Scalable by Matei Zaharia
Cover of the book Text Data Management and Analysis by Matei Zaharia
Cover of the book Smarter Than Their Machines by Matei Zaharia
Cover of the book Computational Prediction of Protein Complexes from Protein Interaction Networks by Matei Zaharia
Cover of the book The Handbook of Multimodal-Multisensor Interfaces, Volume 1 by Matei Zaharia
Cover of the book The Handbook of Multimodal-Multisensor Interfaces, Volume 3 by Matei Zaharia
Cover of the book Edmund Berkeley and the Social Responsibility of Computer Professionals by Matei Zaharia
Cover of the book Communities of Computing by Matei Zaharia
Cover of the book The Sparse Fourier Transform by Matei Zaharia
Cover of the book Embracing Interference in Wireless Systems by Matei Zaharia
Cover of the book Frontiers of Multimedia Research by Matei Zaharia
We use our own "cookies" and third party cookies to improve services and to see statistical information. By using this website, you agree to our Privacy Policy