Alok Aggarwal's Weblog

All | General | Music | NFS

20050524 Tuesday May 24, 2005

 RPC Versioning for rpc.metad

One of my earliest contributions to Solaris 10 was to version the rpc.metad daemon. In this post I'll talk about the problem I was trying to solve and why I was trying to solve it as a precursor to how I actually solved it.

What's this rpc.metad?

rpc.metad is one of the SVM rpc daemons and it's entire purpose in life is to facilitate multiple hosts in sharing common storage which is going to be used for shared SVM disksets*. The daemon communicates between the hosts in question while configuring the volumes and making changes to the configuration. As an example, while adding disks to a shared diskset with two hosts A and B, it probes each of the hosts with the question, "Hey, I see a disk c1t0d0 with these characteristics. Do you see the same disk on your
side? Okay, both of us are seeing the same disk; now I'm going to add this disk to a diskset called oracle, you okay with that? Yeah, go ahead use it I'm not using it"

* Shared SVM disksets, here, refer to the concept where one and only one host can access the diskset at any given point in time. This configuration is mostly used in high availability (HA) kind of environments or in conjunction with the clustering software.

Why did it need to be versioned?

Early on in the Solaris 10 development cycle, support was added to SVM so you could create multi-terabyte volumes as well as leverage multi-terabyte LUNs. As part of this effort, changes were made to a lot of the structures used internally by SVM (not the on-disk structures). When these changes were made, it was made sure that nothing broke if you wanted to upgrade/downgrade the machines.

However, during the external code review process (I think that's what it was; anyhow, you get the idea), one of the reviewers pointed out that there was a case where backward compatibility was broken. I'll expound on this soon, but let me take this opportunity to explain a little bit about the code review process here in the Solaris organization.

Solaris code review process - A detour

Every change to the Solaris source, i.e. every bug fix (no matter how trivial it is) and every project needs to be reviewed by someone other than the engineer making the change. Typically, for a bug fix one needs to get it reviewed by one or two engineers. For a bigger change, i.e. a project or an RFE, the change needs to be reviewed not only by engineers inside the same technology group but by engineers in different (but hopefully related) technology group as well (external code review). The motivation for doing this is to get multiple sets of eyes looking at the change so as to provide healthy criticism for the change being made. This helps in making sure that - a) The right fix is being made
and the fix won't cause future bugs b) Other areas of code are not being overlooked c) It's not going to break other Solaris functionality. Code reviews are just one of the process related tasks we in the Solaris organization undertake in order to make sure the Solaris code is always high quality.

Back to the real topic ..

So, back to the original topic, one of the external code reviewers mentioned that since rpc.metad uses the changed structures, it's likely not going to be able to talk to other rpc.metad processes that have an "older" view of the changed structures. This was particularly going to be a problem when the clustering software is being used in a rolling upgrade scenario where the cluster nodes are being upgraded on a rolling basis, i.e. some of the nodes can be running an older version of solaris (say Solaris 8) whereas some of the other nodes can be running an upgraded version of solaris (say Solaris 10). Each of these nodes need to be able to communicate with each other in such a scenario. The problem this presents is - you've got a Solaris 8 rpc.metad that knows about the Solaris 8 version of the structures and a Solaris 10 rpc.metad that knows about the Solaris 10 version of the structures. The result - the two rpc.metad processes can't communicate with each other!

The solution to this problem was to version the Solaris 10 rpc.metad, i.e. make it understand the older structure definitions as well as the newer structure definitions by leveraging the versioning capabilities of the RPC framework. Simple as that.

In a future post I'll go over the details of how I implemented the versioning changes. So long!

Technorati Tag:


( Jun 10 2005, 01:27:30 PM EDT / May 24 2005, 04:04:52 PM EDT ) Permalink Comments [0]
Trackback: http://blogs.sun.com/aalok/entry/rpc_versioning_for_rpc_metad

Trackback URL: http://blogs.sun.com/aalok/entry/rpc_versioning_for_rpc_metad
Comments:

Post a Comment:

Name:
E-Mail:
URL:

Your Comment:

HTML Syntax: NOT allowed

« November 2009
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
     
       
Today


XML




    Blogroll


Today's Page Hits: 88

Locations of visitors to this page