I've started working on the pnfs (parallel NFS) project, which can be found as NFS version 4.1 pNFS. The intent is to parallelize NFSv4 traffic to a "server" - which is basically N+1 different storage units. One of the first things I am working on is the simple policy engine - which is used to decide how to parallelize the access. And the thing I am working on right now is gathering statistics on how the pnfs server is working.
We can consider the classical NFS client and server relationship in this diagram:
The client connects to a server via a network and the server has one or more disks either inside or attached to it. To get the most out of the box, we would add RAM, NVRAM, solid state disk, faster drives, etc. The reason that we don't have just one drive in there is that we can stripe the data access across multiple drives. Using RAID technology and volumes (i.e., make several physical disks appear as a logical unit), we can can parallelize data access to a set of disks.
What this means is that conceptually as one disk is busy doing I/O for us, we can be sending more work to the next disk. A very simplistic view would be to incorporate all of the disks into one large volume. You could then create project directories underneath the mount point. But, how do you enforce how much space a project gets? How do you keep people from looking where they should not? How do you get a very intensive application tied to the fastest disks?
These questions are ones of policy - how do we want to divide the resources? In a traditional server, we would create volumes dedicated to a policy. Below, we see we have /home, /builds, and /engineering. And we assume that the system (/, /usr, etc) is 'internal' to the server:
Perhaps the disks for /builds are faster. Perhaps the disks for /home are larger. All that matters here is that we made a static policy decision and changing it can be very difficult.
It is easy to see that eventually the limiting factor to data transfer is the bottleneck caused by the network connection. We went from 10Mbs to 100Mbs to 1G networks (and are working on making 10G a reality) in part because the server subsystems were faster than the network. But workloads have increased or client farms have scaled way out of proportion. Replace that single client with 1000 of them all trying to access the /home volume at once.
A solution is to parallelize the server by adding more boxes to it. A very simplistic approach would be the following:
We want the server to advertised with a single IP address, so we put a box in front to essentially route all traffic to the correct storage box. Think of the problems this presents:
We want to leave the routing to network hardware. What does our router do? It decides which of the data servers to send the data to. Why does that decision need to be repeated for every packet? What if the very first time we started file access we decided which machine the file was going to go to?
The problem with this is that traditional NFSv3 and NFSv4 clients expect that the file will be on the same machine that was queried for the open. (This is not entirely accurate, NFSv4 could handle this migration of a file.) And if we wanted to split that file across multiple machines, well that is definitely outside of the scope of these protocols.
The solution that pnfs takes is to provide a router, which is called a MDS (metadata server) and push the actually routing of the data back to the client:
Just as with RAID, we want to stripe access to the I/O. So, the client queries the MDS about the layout of the file. The layout is basically a list of DS (data servers) and the stripe size.
At a simplistic level, the client will access the first stripe size chunk on the first DS, the second chunk on the second machine, and so on. Once it has done that for all machines, it will wrap back to the first. If the client decides to start at a position other than the start of the file, it can do some simple calculations to determine which DS has the data.
A write operation can be visualized in the following:
To summarize, the pnfs client (for all intents and purposes a NFSv4.1 client) talks to the MDS to get metadata information (attributes) and the set of DSs which have the file. The client then directly talks to each of them for file access.
It is then the job of the spe to determine which set of machines that should be used. Instead of the static decision made when there were disks attached, we now make a dynamic decision. If the file access is for reading, well we don't need to consult the spe, the policy is in place. For writes, the spe will need to look at specified rules (policies) and the current state of the DSs to determine the best place to stripe the file. Note that nothing we have said (nor the above picture) dictates that the DS have adjacent IPs.
I'll talk more about the spe in a future entry.
Hi Tom,
Great write up! Since the mds is a single point of failures, does the protocol provide a mechanism to failover to another mds when the primary mds fails? If not, how is HA achieved?
Thanks,
- Ryan
Posted by Matty on December 01, 2007 at 08:35 PM CST #
Ryan,
The goal of the first pnfs implementation is to get a simple server out there. As such, if you look at the spec, you will see that we are basically pushing such issues out until a later time.
Every time we add someone new to the internal development group, they ask the type of questions you are asking. But it is a big project and we want to stay focused on delivering it in Nevada.
You are asking the right questions, we are just not as a community exploring them. And by community here, I mean the NFSv4 working group in IETF.
I may be wrong, but I believe the target "customer" for the first servers are HPC - people who in my mind are willing to sacrifice fast for almost anything else.
Thanks,
Tom
Posted by Tom Haynes on December 01, 2007 at 10:16 PM CST #