Peter Jenkins - Grid Things

iSCSI enabling diskless general purpose servers? (Part one)

Thursday Dec 21, 2006

I've been a fan of diskless boot in Solaris for a long while. Building a large grid (read: datacenter) with hundreds or thousands of servers each with one or more local disks seems, to me, to be fundamentally broken. The cost of the disks themselves isn't my main issue, nor is it the heat and power they use (although it does add up: Sun X2200 server (2 CPU's, 4 1GB DIMMS) without disks 248 Watts, with disks 278 Watts), nor is it the wasted storage on each system that isn't usable (how much of that 500GB disk are you using per system, and what about mirroring?).

No my concern is about having data and configurations strewn throughout your network. How are you going to achieve a flexible data center fit for the future if you are having to provision and manage the operating systems installed locally on all your servers?

With diskless booting you can centralise all of your storage and make all your servers stateless.

Diskless boot is possible in a number of ways, you can use Fiber Channel, Ethernet or InfiniBand and you can boot using storage and/or network protocols. In my mind the general purpose datacenter of the future doesn't include Fibre Channel, perhaps thats narrow minded, but if you object and want me to explain why leave a comment ;-) That leaves us with the increasingly common Ethernet vs. InfiniBand debate. Without going down that rat hole too much I'd like to compare the two in terms of their ability to do network consolidation, that is, to reduce the number of network technologies in the datacenter.

Until recently I would have told you InfiniBand (IB) is king for network consolidation and Ethernet doesn't really play in this space. Recently I'm staring to think differently.

You see the advantages of IB are great on paper, and for many HPC clusterstoo, but in real commercial datacenters the requirements are different. The advantages in the commercial space for potential performance improvements from the faster, lower-latency and RDMA capable interconnect are hard to make use of in practice.

I'll leave speed and latency for another time and take a look at RDMA. The appeal of RDMA is to take CPU cycles away from processing network traffic and put them to use for the application you are running. This is a worthy goal, but today very few commercial applications are written to take advantage of RDMA (Oracle 10g being a very notable exception), and changing customer code to make use of RDMA is very expensive.

Although utilizing RDMA is hard, emerging standards such as NFS/RDMA and Oracle's support for IB make it interesting to keep an eye on. More promising is the not-RDMA-but-not-TCP/IP-either set of protocols which can run over IB. Here SDP (Sockets Direct Protocol) is the best example as it has the potential to transparently provide a sockets interface to applications but actually bypass the host Operating System TCP/IP stack. Other examples of interest are RDS (Reliable Datagram Service), SRP (SCSI RDMA Protocol) and iSER (iSCSI Extension for RDMA). The last two are two competing protocols to run SCSI over IB and provide an option to replace Fiber Channel.

To date to uptake of either SRP or iSER has been fairly slow. There are many reasons for this, but the main ones are a lack of one single standard and lack of Operating System support. Today Solaris, Linux and Windows support SRP but Linux and Windows require commercial drivers from Cisco or QLogic (SilverStorm's new owner). iSER is supported in Linux and Windows but only by Voltaire (Solaris support is planed with support from Sun). I should also note that the OpenFabric stack purports to support both SRP and iSER but is widely regarded as not ready for production (read: buggy). I'm not sure if this is true or simply vendors spreading FUD to make their own stack more appealing (I suspect its a bit of both).

In order to boot over IB you need special firmware on your HBA, or do you? ...

I'll post the second part to this once I've written it. ;-)

 

[2] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg
Comments:

Hear, hear. Disks in servers are stupid. Even more so for 100 or even 1,000 identical compute nodes. I had a customer with a 128-node HPC cluster. Each node had two disks. Guess what? The disk firmware had a problem. 256 individual disks had to be flashed. Then there was a driver problem. 128 drivers needed to be reinstalled. Then a patch to an OS. 128 patches.

Fibre-channel boot is increasingly commonplace in blade servers, but much less so in rack servers. The customers I know who do boot over SAN with their blades love it.

I don't really see the need today for RDMA for Ethernet-based boot storage. The OS and app is generally only accessed once, at startup, and is not performance sensitive. Obviously, swap space is the exception, but increasing the RAM on each node can address that, as can right-sizing the memory, and tools to monitor memory leaks.

As for InfiniBand, yes, you need special firmware. Realize some boot over IB solutions are really boot via NAS over IP over IB. Boot over IB makes sense when one already plans to invest in IB. I do not understand why HPC IB cluster customers don't boot over IB.

Overall, I think for servers you should consider diskless solutions. Your boot solution should be based on your primary data network. So if your server is a FC attached database server, boot over SAN makes sense. If it is a web server, boot over NAS or iSCSI makes sense. Likewise, if it is an Ethernet based HPC cluster, boot over NAS or iSCSI makes sense. If it is an IB based HPC cluster, boot over IB makes sense, but boot over NAS or iSCSI may make sense if you are using Ethernet for your cluster file system.

With DDR InfiniBand now the norm, I believe data will consolidate with IPC on the IB network, and a dedicated Ethernet for a cluster filesystem will fade. This is a good opportunity for boot over IB.

Ultimately, I think RDMA based Ethernet will make boot over network the norm, probably via RDMA NAS, not RDMA iSCSI. DAFS was ahead of its time in this regard. I also think RDMA is not as popular in commercial computing because of a chicken and egg situation, but will be heavily used once it is ubiquitous (when low-latency computing "goes to free"). So much of clustered software code is dedicated to latency hiding (i.e., state replication). Look at Yousef Khalidi's original Solaris MC concepts. That was an idea of the OS providing the underlying application clustering functionality via a low-latency interconnect.

More of my thoughts on this are at the linked URL I provided.

Posted by Mark on March 12, 2007 at 10:56 PM GMT #

Mark, Thanks for your comments. I read the article you linked to too. I agree with you about the overall direction. It seems inevitable 10G Ethernet with TOE and RDMA will become standard in the next few years and you are right it will enable more (all?) applications to operate on general purpose servers. I hope these new capabilities are exploited by new companies challenging the status quo. Cheers, Peter.

Posted by Peter Jenkins on April 20, 2007 at 05:06 PM BST #

Post a Comment:
  • HTML Syntax: NOT allowed