[Editor?s note: This is a guest post from Nolan Grace, a software developer and consultant at BP3 who has a passion for data science.? This post is shared with permission]
Distributed computing is definitely the cool kid in the tech world right now. From Amazon shopping recommendations, to Pandora deciding what music you may like, to AlphaGo mastering one of the most complicated board games of all time, distributed computing is at the heart of each solution. It took a lot of blood, sweat, and tears to get big data processing to the point we are at right now, and it is going to take a lot more to get where we are going; but I am as excited as can be for the ride. To really get a full respect for how and why the big data ecosystem looks the way it does you have to sit down and understand the major breakthroughs that were made, and how each step forward introduced a group of challenges just as difficult as the last. In this blog I would like to introduce you to two big papers that are important to read and understand if you are interested in getting into big data.
The need for Distributed Computing arose from the realization that one supercomputer would never be enough and more will always be better. A group of basic computers could outperform the world?s most powerful supercomputers at a fraction of the cost and multiple supercomputers would perform better than one. A solution to this problem, ?MapReduce: Simplified Data Processing on Large Clusters? by Jeffrey Dean and Sanjay Ghemawat was published by Google researchers in 2004 In this paper the MapReduce Design Pattern was documented and popularized. MapReduce involves breaking down any complicated operation in the series of simple Map and Reduce tasks. Map is the act of organizing information onto the correct machine in a cluster. Reduce is the act of performing some sort of action the data within an individual machine. By breaking down something complicated into simple pieces it makes it much easier to assign small generic pieces of work to individual machines across a cluster. This theory can be compared to the time tested concept of divide and conquer. How did the Romans perform a census? I can tell you they didn?t ask one person to count everyone, they split the empire into smaller pieces and those pieces were split into even smaller chunks, allowing individuals to count all the people in a reasonable area then come together in the end to sum up each piece. By taking something like conducting a national census and simplifying it into small simple tasks it makes it easy for everyday people to assist in a monumental operation.
?How do you eat an elephant?? *
In 2011 Apache Hadoop was released by a group of Yahoo employees. Hadoop may not have been the first application to leverage the MapReduce design pattern but it was by far the most successful. Hadoop allowed developers to build large scale MapReduce jobs that could be executed across massive clusters of commodity machines in a way that was extremely resilient and reliable. Thus far, Hadoop is still the default in massive and reliable computing. However, despite the stability and consistency of Hadoop it is still lacking in usability. Hadoop MapReduce jobs were still fairly unwieldy and problems with job optimization, scheduling, and writability were still commonplace. In addition, most real world applications of data processing require a large number of MapReduce jobs, executed in sequence on multiple and distinct sets of data in order to make useful discoveries. These data pipelines were very difficult to perform with Hadoop and led to the creation of FlumeJava.
FlumeJava was introduced in 2010 in a paper called ?FlumeJava: Easy, Efficient Data-Parallel Pipelines,? also written by a group of researchers at Google. FlumeJava introduced a framework for organizing, executing, and debugging a large scale pipeline of MapReduce jobs. This allowed developers to write code which would be used to build an execution plan for a series of MapReduce jobs. Think of it as a SQL optimizer for MapReduce. FlumeJava was able to take the MapReduce tasks that needed to be executed and build them into an ?internal execution plan graph structure? that could be evaluated as each task was needed. This allowed the same code???that would normally need to run in a large scale cluster???to be debugged piece by piece on a local machine and transplanted directly to production. The optimized execution plans also significantly decreased the execution time of MapReduce pipelines by reducing the amount of rework and making it easy for failed jobs to roll back stages, rather than restart from the beginning.
This innovation led to the creation of Apache Oozie for Hadoop and the popularization of the Directed Acyclic Graph(DAG) for large scale data processing. Many of the ideas from the FlumeJava paper have been implemented in Apache Spark and enable many of the features that have made Apache Spark a world class data processing platform.
*denotes a Hadoop joke