Slog - Sohrab's weblog
Distributed Web scale technologies
Ever wondered how a Google or Yahoo search would "instantly" give back more responses than you would desire? Better yet, that it would rank the responses to allow the most relevant material to be displayed at the top? Think about all those web crawlers that reach out into the world to capture data from every public website, and then index every word on every web page so as to get our search done quickly. The sheer volume of data and the amount of computational power required to analyze it on an on going basis is simply astounding.
Google processed 14 petabytes of data
requiring about
370 machine years of processing capability on a daily basis, in the month of Sept '07. Incredible as it may seem one has to then imagine a database capable of storing and retrieve such massive amounts of preprocessed data. The short time required to process this volume of data & the desire to keep costs low in solving for data analytics at this scale, presented a problem in search of an eloquent solution like Google's Map Reduce and BigTable. Though there are various open source implementations that attempt to address the problem of web scale data analytics, I have been paying careful attention to Hadoop, hbase in particular. I will not dwell on the implementation of Hadoop since there is ample material available to explain it in detail (http://hadoop.apache.org/). It would suffice to say that the strength of Hadoop lies in its ability to leverage the distributed storage capabilities in a cluster by breaking up very large data into smaller pieces, storing them locally and then processing that data independently, on the cluster nodes. This base execution environment in Hadoop is also extended to provide a sparsely-populated column-oriented datastore store called hBase (http://hadoop.apache.org/hbase/) with various query languages like Pig (http://hadoop.apache.org/pig/) that provide a platform for analyzing large data sets. Though a typical Hadoop cluster can be around 20 nodes, some MapReduce/Hadoop clusters run from 1,000's to 10,000's cores. The scale and size of large Hadoop clusters make them sound ominous and conjecture images of complexity and doom. It would however be important to bear in mind some governing attributes behind
MapReduce / Hadoop
1) Simplicity: Keep the programming environment simple and flexible
2) Simplicity again: Keep the execution environment simple and flexible
3) Failures: Accommodate for failures
4) Cost: Keep the system costs low
My fascination with this subject is not that its revolutionary, but really in what it has been able to show. In the past, the adoption of massively scaled distributed systems has been generally constrained by the enterprise applications not being "distributed systems centric". However the web scale requirements of today and the resulting web applications (being network aware) are much more apt to leveraging the capabilities of distributed systems. High performance computing (HPC) has bridged the gap in some area's between the academic and enterprise worlds in the use of distributed computing. It's probably appropriate to point out at this juncture that while there are other differences, there is at least one major difference between HPC and Map Reduce. Though they both attempt to solve very very large problems by leveraging the computational capabilities of a cluster, HPC tends to support a stronger cooperation between its applications (a more parallel approach), while Map Reduce applications tend to run more independently within a cluster (a more distributed approach). This fundamental difference tends to structure slightly different DNA in the 2 environments. As an example, Map Reduce creates a simpler execution & programming environment, but then puts constraints on the problem set that it can address.
Looking at what we have with Hadoop and having spent an ample amount of time in different forms of clustering (availability, load balancing, hpc), systems management & automation,... I cannot help but feel a sense of déjà vu. Today, Hadoop is relatively simple to understand (if you are geek inclined), but will require some tinkering to create your special home brew solution. Though Hadoop is relatively new, and still maturing, the simplicity of the model and the interesting problem that it attempts to solve is attracting such a strong following, that it really gives meaning to the phrase "simplicity scales". With so much enthusiasm to "fix" this simple solution, I am fairly confident (rather cynically) that it will be totally incomprehensible in a few years.
If you are new to Hadoop you will quickly realize that Hadoop is often an overloaded term. Though commonly referred to in the context of the programming model: Map & Reduce, it is also used to represent the execution environment. As a first step for the more curious, may I suggest that you take Hadoop for a test spin. George & Kevin tried to make it easy by putting together an OpenSollaris LveCD that you can boot off your laptop (http://opensolaris.org/os/project/livehadoop/). The intent of the liveCD is that it would not disrupt your current installation. The OpenSolaris image simulates a multinode environment through the use of zones, with Hadoop pre-configured (the use of zones allows us to keep the memory footprint small while representing a 3 node cluster). After having brought down and burned the LiveCD image "in-preview2-OSLH.iso" from http://opensolaris.org/os/project/livehadoop/Downloads/ you could then boot OpenSolaris on your (x86) machine from the LiveCD and run the very simple word count example in a cluster as documented in http://opensolaris.org/os/project/livehadoop/OSLHQuickStart.pdf This LiveCD is also being used to run the practicals in a Hadoop class by Chris K Wensal & Stefan Groschupf ( http://www.scaleunlimited.com/hadoop-bootcamp.html).
Welcome to Hadoop! More to come....
Posted at 03:57PM Nov 10, 2008 by sohrab in Sun | Comments[1]