« December 2007 »
SunMonTueWedThuFriSat
      
3
4
5
8
9
14
15
18
20
21
23
24
25
26
27
28
29
30
31
     
Today
XML

Neat blogs

Navigation

Editing

Powered by Roller Weblogger.

statcounter.com

clustrmaps.com

Locations of visitors to this page

technorati.com

20071202 Sunday December 02, 2007
The Interconnect between the MDS and DSs

In a traditional server, reliability is achieved by removing single points of failure. There is more than 1 network connection, there is more than 1 interconnect to the drives, etc. And, there is more than 1 machine:

Not shown

Both partners in this simple cluster are connected to all of the disk drives. And each connection actually represents two different loops to the drives.

The green line represents the interconnect link between the partners and at a bare minimum is a heartbeat by which the they can determine if the partner is down and if a machine should take over the disks. Note that we haven't stated which disks physically belong to which system.

The easiest problem to spot with this system is what happens if all that breaks is the interconnect? Both machines are up and ready to serve data. If we have more than 2 boxes, we could use a quorum. Since we don't, we can in effect reserve an area on the drives to act as a secondary heartbeat. I'm not interested in these mechanics, just that the problem exists.

The deep, dark secret of pnfs is that it is in effect a cluster. I'm not stating that a standby MDS is in place, that DS can have a partner, but that I should have really said that pnfs is a distributed file system. :-). And what I really mean is that to be effective, the MDS needs to communicate with the DSs. But, as designed, the DSs do not need to communicate with each other.

The question arises though as to how the MDS communicates with the DSs. Do we have an interconnect between it and each DS? Are they strung out in a chain? The answer is that no, such approaches get unwieldy, complicated, and cause the ship date to slip.

But if we think about it, we already have a physical interface in place to communicate between an arbitrary node in the distributed file system: the network. The problem then becomes that we need a protocol to allow the MDS to talk to the DSs. We need heartbeats - "Are you still alive?". We need to know bogged down a DS is - "Thank you sir, may I have another!".

But if you look at the NFSv4.1 RFC (which is the authoritative source for all things pnfs), you'll realize that the communication between the MDS and DSs is an implementation detail. If you look close enough, you'll realize nothing states that the MDS and DS(s) have to be on different physical machines.

The reason why this approach was taken was to allow each vendor to leverage their existing technologies. (For example, an OS might already have existing protocols to migrate or replicate data between boxes.)

The other reason here is that there was no desire to mix implementations in the DSs. I.e., NetApp and EMC do not have to worry about both being a DS for a Sun MDS. An interesting twist to this is that nothing prevents vendors from doing this on their own. With the OpenSolaris code available for download and the Linux reference available for download, it would be easy for a 3rd party to learn to talk the talk.

The protocol used to communicate is called control, although do not hold me to that name - it could change before final putback. I'll quote from Spencer's NFSv4.1's pNFS for Solaris (which is not as quick and dirty of an overview as my A quick and dirty overview of pnfs),

Coordination of the pNFS Community

To this point, we have a Solaris pNFS client interacting with a pNFS server over a flexible network configuration. The meta-data server is using ZFS as the underlying filesystem for the "regular" filesystem information (names, directories, attributes). The data server is using the ZFS pool to organize the data for the various layouts the meta-data server is handing out to the clients. What is coordinating all of the pNFS community members?

pNFS Control Protocol

The control protocol is the piece of the pNFS server solution that is left to the various implementations to define. Since the Solaris pNFS solution is taking a fairly straightforward approach to the construction of the pNFS community, this allows for the use of ZFS' special combination of features to organize the attached storage devices. This will allow for the control protocol to focus on higher level control of the pNFS community members.


Some of the highlights of the control protocol are:

  • Meta-data and data server reboot / network partition indication
  • Filehandle, file state, and layout validation
  • Reporting of data server resources
  • Inter data server data movement
  • Meta-data server proxy I/O
  • Data server state invalidation

I'm building up a background for understanding what I am currently working on...


Originally posted on Kool Aid Served Daily
Copyright (C) 2007, Kool Aid Served Daily