In a traditional server, reliability is achieved by removing single points of failure. There is more than 1 network connection, there is more than 1 interconnect to the drives, etc. And, there is more than 1 machine:
Both partners in this simple cluster are connected to all of the disk drives. And each connection actually represents two different loops to the drives.
The green line represents the interconnect link between the partners and at a bare minimum is a heartbeat by which the they can determine if the partner is down and if a machine should take over the disks. Note that we haven't stated which disks physically belong to which system.
The easiest problem to spot with this system is what happens if all that breaks is the interconnect? Both machines are up and ready to serve data. If we have more than 2 boxes, we could use a quorum. Since we don't, we can in effect reserve an area on the drives to act as a secondary heartbeat. I'm not interested in these mechanics, just that the problem exists.
The deep, dark secret of pnfs is that it is in effect a cluster. I'm not stating that a standby MDS is in place, that DS can have a partner, but that I should have really said that pnfs is a distributed file system. :-). And what I really mean is that to be effective, the MDS needs to communicate with the DSs. But, as designed, the DSs do not need to communicate with each other.
The question arises though as to how the MDS communicates with the DSs. Do we have an interconnect between it and each DS? Are they strung out in a chain? The answer is that no, such approaches get unwieldy, complicated, and cause the ship date to slip.
But if we think about it, we already have a physical interface in place to communicate between an arbitrary node in the distributed file system: the network. The problem then becomes that we need a protocol to allow the MDS to talk to the DSs. We need heartbeats - "Are you still alive?". We need to know bogged down a DS is - "Thank you sir, may I have another!".
But if you look at the NFSv4.1 RFC (which is the authoritative source for all things pnfs), you'll realize that the communication between the MDS and DSs is an implementation detail. If you look close enough, you'll realize nothing states that the MDS and DS(s) have to be on different physical machines.
The reason why this approach was taken was to allow each vendor to leverage their existing technologies. (For example, an OS might already have existing protocols to migrate or replicate data between boxes.)
The other reason here is that there was no desire to mix implementations in the DSs. I.e., NetApp and EMC do not have to worry about both being a DS for a Sun MDS. An interesting twist to this is that nothing prevents vendors from doing this on their own. With the OpenSolaris code available for download and the Linux reference available for download, it would be easy for a 3rd party to learn to talk the talk.
The protocol used to communicate is called control, although do not hold me to that name - it could change before final putback. I'll quote from Spencer's NFSv4.1's pNFS for Solaris (which is not as quick and dirty of an overview as my A quick and dirty overview of pnfs),
To this point, we have a Solaris pNFS client interacting with a pNFS
server over a flexible network configuration. The meta-data server is
using ZFS as the underlying filesystem for the "regular" filesystem
information (names, directories, attributes). The data server is
using the ZFS pool to organize the data for the various layouts the
meta-data server is handing out to the clients. What is coordinating
all of the pNFS community members?
The control protocol is the piece of the pNFS server solution that is
left to the various implementations to define. Since the Solaris pNFS
solution is taking a fairly straightforward approach to the
construction of the pNFS community, this allows for the use of ZFS'
special combination of features to organize the attached storage
devices. This will allow for the control protocol to focus on higher
level control of the pNFS community members.
Some of the highlights of the control protocol are:
Coordination of the pNFS Community
pNFS Control Protocol
I'm building up a background for understanding what I am currently working on...