Slog - Sohrab's weblog
Distributed Web scale technologies
Ever wondered how a Google or Yahoo search would "instantly" give back more responses than you would desire? Better yet, that it would rank the responses to allow the most relevant material to be displayed at the top? Think about all those web crawlers that reach out into the world to capture data from every public website, and then index every word on every web page so as to get our search done quickly. The sheer volume of data and the amount of computational power required to analyze it on an on going basis is simply astounding.
Google processed 14 petabytes of data
requiring about
370 machine years of processing capability on a daily basis, in the month of Sept '07. Incredible as it may seem one has to then imagine a database capable of storing and retrieve such massive amounts of preprocessed data. The short time required to process this volume of data & the desire to keep costs low in solving for data analytics at this scale, presented a problem in search of an eloquent solution like Google's Map Reduce and BigTable. Though there are various open source implementations that attempt to address the problem of web scale data analytics, I have been paying careful attention to Hadoop, hbase in particular. I will not dwell on the implementation of Hadoop since there is ample material available to explain it in detail (http://hadoop.apache.org/). It would suffice to say that the strength of Hadoop lies in its ability to leverage the distributed storage capabilities in a cluster by breaking up very large data into smaller pieces, storing them locally and then processing that data independently, on the cluster nodes. This base execution environment in Hadoop is also extended to provide a sparsely-populated column-oriented datastore store called hBase (http://hadoop.apache.org/hbase/) with various query languages like Pig (http://hadoop.apache.org/pig/) that provide a platform for analyzing large data sets. Though a typical Hadoop cluster can be around 20 nodes, some MapReduce/Hadoop clusters run from 1,000's to 10,000's cores. The scale and size of large Hadoop clusters make them sound ominous and conjecture images of complexity and doom. It would however be important to bear in mind some governing attributes behind
MapReduce / Hadoop
1) Simplicity: Keep the programming environment simple and flexible
2) Simplicity again: Keep the execution environment simple and flexible
3) Failures: Accommodate for failures
4) Cost: Keep the system costs low
My fascination with this subject is not that its revolutionary, but really in what it has been able to show. In the past, the adoption of massively scaled distributed systems has been generally constrained by the enterprise applications not being "distributed systems centric". However the web scale requirements of today and the resulting web applications (being network aware) are much more apt to leveraging the capabilities of distributed systems. High performance computing (HPC) has bridged the gap in some area's between the academic and enterprise worlds in the use of distributed computing. It's probably appropriate to point out at this juncture that while there are other differences, there is at least one major difference between HPC and Map Reduce. Though they both attempt to solve very very large problems by leveraging the computational capabilities of a cluster, HPC tends to support a stronger cooperation between its applications (a more parallel approach), while Map Reduce applications tend to run more independently within a cluster (a more distributed approach). This fundamental difference tends to structure slightly different DNA in the 2 environments. As an example, Map Reduce creates a simpler execution & programming environment, but then puts constraints on the problem set that it can address.
Looking at what we have with Hadoop and having spent an ample amount of time in different forms of clustering (availability, load balancing, hpc), systems management & automation,... I cannot help but feel a sense of déjà vu. Today, Hadoop is relatively simple to understand (if you are geek inclined), but will require some tinkering to create your special home brew solution. Though Hadoop is relatively new, and still maturing, the simplicity of the model and the interesting problem that it attempts to solve is attracting such a strong following, that it really gives meaning to the phrase "simplicity scales". With so much enthusiasm to "fix" this simple solution, I am fairly confident (rather cynically) that it will be totally incomprehensible in a few years.
If you are new to Hadoop you will quickly realize that Hadoop is often an overloaded term. Though commonly referred to in the context of the programming model: Map & Reduce, it is also used to represent the execution environment. As a first step for the more curious, may I suggest that you take Hadoop for a test spin. George & Kevin tried to make it easy by putting together an OpenSollaris LveCD that you can boot off your laptop (http://opensolaris.org/os/project/livehadoop/). The intent of the liveCD is that it would not disrupt your current installation. The OpenSolaris image simulates a multinode environment through the use of zones, with Hadoop pre-configured (the use of zones allows us to keep the memory footprint small while representing a 3 node cluster). After having brought down and burned the LiveCD image "in-preview2-OSLH.iso" from http://opensolaris.org/os/project/livehadoop/Downloads/ you could then boot OpenSolaris on your (x86) machine from the LiveCD and run the very simple word count example in a cluster as documented in http://opensolaris.org/os/project/livehadoop/OSLHQuickStart.pdf This LiveCD is also being used to run the practicals in a Hadoop class by Chris K Wensal & Stefan Groschupf ( http://www.scaleunlimited.com/hadoop-bootcamp.html).
Welcome to Hadoop! More to come....
Posted at 03:57PM Nov 10, 2008 by sohrab in Sun | Comments[1]
A Solaris Virtual Network Router
Sun’s Virtual Network Machines project http://opensolaris.org/os/project/vnm/ is investigating ways to process network traffic on general-purpose servers, using a general-purpose operating system. This week a LiveCD distribution of the core components needed to configure a software-based router device on OpenSolaris was made available at http://opensolaris.org/os/project/vnm/VirtualRouter/VNRP/
One can freely download these components and quickly configure and run a virtual router on any device that runs Solaris.
The virtual router runs on Project Indiana, the open source binary distribution of OpenSolaris. When the LiveCD distribution is booted on a system that supports the Solaris operating system it allows configuration of a software-based router.

This virtual router project comes with two pre-configured zones (in addition to the global zone). The zones are labeled as Internet and Intranet zones with the ability to add adapters and choose some of the routing protocols to be run in each zone (this is done through a Web interface and is seen when the Indiana based router boots up). This allows for network isolation between zones. One may also choose to deploy the router functionality without zones. The intent here is to jumpstart some thinking into alternative and creative ways to use some of the more unique Solaris functionality. If one is satisfied with the router configuration, and functionality provided by the liveCD the process to lay it out to disk can be started by a single click. A driver will also be available that developers can use to communicate either between zones or between xVM guest images.
Support for Virtualization
A general-purpose OS could take advantage of advances in server virtualization technology. Most enterprise IT departments have already experienced the benefits of server virtualization, which pools the resources of multiple physical machines and allows them to be allocated dynamically as needed. Similarly, storage virtualization pools data storage capabilities so that storage can be managed on the fly. Why not virtualize network machines? Using a general-purpose OS, multiple network devices could reside cooperatively on the same physical hardware machine.

To that end, the virtual router project also provides scripts and procedures that will allow the same ISO image to be brought up as a DomU instance on a system running Solaris xVM.
Posted at 05:40AM Jan 18, 2008 by sohrab in Sun | Comments[3]
Moore forwarding (all pun intended)
My previous blog talked about 2 possibilities.
1. That today’s general purpose computing chips with multi core architectures could process network packets fast enough to create a shift away from expensive specialized hardware back to processing network traffic on general purpose servers.
2. General purpose computing chips with multi core architectures may provide "enough" performance to consider using a general purpose Operating system in certain deployments.
Armed with my theory I set about to investigate performance related questions (point 2) with the help of some great talent in the Solars networking group (a lot of the credit goes to Garrett D, Sangeeta M, and a few other folks who volunteered their time to this effort). The question I set out to answer is, whether CMT helps forwarding performance (in particular on general purpose OS's)? If so, then by how much ?
The tests that we ran, were experimental in nature (and should be treated as such), done more to prove a theory rather than provide data in a sterile, sanitized "benchmark". At the same time the people running the tests are highly skilled engineers and have some degree of exposure to network performance.
We 1st tried to get some base line numbers on a Spirent system, generating small (64 byte UDP packets), against a x64 Sun server running Solaris 10 with 2 Sockets (2 freestanding cpu's) and each socket having 2 cores. We then compared these numbers using a Sun prototype machine with a single UltraSPARC T2 (8 core) processor. After a bit of tinkering ( a lot more than the x64 case since we had to understand the system), the initial results are quite outstanding. In a Bidirectional forwarding test the single chip UltraSPARC T2 pumped close to 300% more packets through the system than the 2 chip (2core) x64 system. It is interesting to note that the UltraSPARC T2 when using 1.5K byte packets, saturated the 10G line (single stream at 18% CPU utilization). We also ran a sanity test against Linux and got numbers where it appears that Solaris was at least 30% better on an apples to apples comparison.
The interrupt driven context of the network traffic demonstrates predictable behavior by the threads of the UltraSPARC T2 by exhibiting some uneven CPU utilization across them. This leads us to belief that we can significantly improve the throughput by creating a much more even distribution across all the threads (today there are some threads that have a high utilization and end up creating starvation to a majority of the threads). All this has generated a great level of excitement and the Crossbow project on open Solaris has an
architecture to parallelize the network stack without any overheads
so all the Niagara threads can do work in parallel. This is currently planed to be made available as part of project crossbow (http://opensolaris.org/os/project/crossbow/ ), (http://blogs.sun.com/sunay), so stay tuned for future posts around this effort.
Bear in mind though that there is also an on going effort to examine the network performance of the UltraSPARC T2 chip much closer to the metal (point 1). There, the initial data on the UltraSPARC T2 shows a multi factor improvement in throughput over the above mentioned test. More information on this can be obtained from the following paper:
http://www.sun.com/products-n-solutions/docs/SUNP_wp.pdf
Posted at 07:15AM Jan 11, 2008 by sohrab in Sun | Comments[1]
What's Moore's law got to do with the network?
Traditional wisdom is that today's network devices (routers, load balancers, firewalls,...) greatly benefit from specialized dedicated hardware to process network traffic. This is an artifact of the explosive growth in network demand in the late 80's, and early 90's when the ability of processing such high volumes of traffic out paced the capability of general purpose computing servers. There is however a disruptive effect brewing to this well established industry norm through the cumulative effect of Moore's law over this 16 year stretch, coupled with new multi-core chip architectures of today.
If the CPU's used in today’s servers are now fast enough to process tomorrow's traffic needs, then cost factors (driven by economies of scale) would suggest a shift away from expensive specialized hardware back to processing network traffic on general purpose servers. For example, building on the multi core / multi threads of the UltraSPARC T1, the Niagara 2 with its 64 threads of execution, dual 10G Ethernet on chip and advanced cryptographic support at wire speed has many of the capabilities necessary to build a network processing device from general purpose server components. Imagine the effect this has on a software router now capable of forwarding data in 64 parallel threads of execution and benefiting from the low latency of having dual 10G NIC's on a CPU. All this begs the question: Has Moore's law brought us to a point where we can replace most of the expensive, dedicated, network devices by a general purpose server?
There is a serendipitous consequence of running
these devices on a general purpose server. It opens up an important opportunity for us to run these network devices on a general purpose Operating System.
Since there is an overhead to running such performance sensitive network devices on a general purpose Operating System like Solaris, there is an argument in favor of using a special purpose Network Operating system over that of a general purpose one. We should however consider the trade-offs to any resultant loss in performance against the flexible environment offered by a general purpose Operating system. Common belief is that a general purpose operating system promotes a faster pace of innovation by offering a more accommodative environment for developers to build and debug network based applications/services. It also offers IT deployment a way of managing costs more effectively through a rich, standardized and flexible format for command and control. As an example consider the case where a general purpose Operating System would make it easier to not only increase utilization of the hardware by consolidating multiple "network devices", on the same hardware platform, but also reduce costs by providing an easier and more standardized way to manage the life-cycle (patching, upgrading,...) of these services.
I feel there is one other important redeeming quality in using a general purpose Operating System that merits mentioning. Advances in virtualization technologies on modern general purpose Operating Systems provide game changing flexibility to general purpose OS's. Multiple network devices (routers, web servers, firewalls, load balancers,...) can reside either cooperatively, or, through virtualization, as isolated identities on the same physical hardware machine. Stateful migration capabilities being built into virtualization technologies, can further optimize the resource management & utilization of physical hardware by these software network devices. Undoubtedly network devices will greatly benefit from such capabilities in a general purpose Operating System.
Eventually this comes down to the realization that an environment like networking can greatly benefit from the economies of volume and scale of a general purpose platform. All this is predicated on the belief that the cumulative effect of Moore's law and performance improvements that are made to general purpose OS's will provide a reasonable rate at which we can process network traffic.
Finally, it comes down to the question of which Operating System? Of the open source Operating Systems, FreeBSD and NetBSD are strong contenders due to their robust quality and the work put into improving network performance. Linux derivatives are also considered due to their familiarity and wide distribution/deployments.
For me, Solaris has some unique characteristics besides the numerous positive attributes like robustness, supportability,... that I feel position it well as a dominant player in this space.
* With a majority of the chip vendors moving toward multicore / multithread architectures Solaris with years of having fine tuned Symmetric Multi-Processing Capabilities is well positioned to leverage these new emerging muticore architectures.
* Project crossbow on Open Solaris, is now capable of providing a holistic approach to virtualization, by extending it to support the network. By, providing a class of service (amongst other attributes), critical performance can be tied to a "lane" with more resources.http://opensolaris.org/os/project/crossbow/
* Solaris Containers can be used with IP instances, and the above "virtual" NICs, to create virtual network machines.
http://opensolaris.org/os/community/zones/
I have little doubt in my mind that all this coupled with a strong community around open source will foster a cottage industry of innovation that will very likely change the face of networking as we know it. I have helped initiate an Open Solaris project http://opensolaris.org/os/project/vnm, in an effort to help discuss and foster such a change. There is a virtual router sub-project that leverages a lot of the thoughts discussed already and I expect to have more sub-projects mushroom subsequently. It's really disruptive times like these when I feel most alive and appreciate the industry that we are a part of.
Posted at 03:53PM Jun 13, 2007 by sohrab in Sun | Comments[0]