Weblog

All | General | Java | Music
20050614 e martë qershor 14, 2005

Multi-Owner Diskset in SVM Multi-Owner Disksets in SVM I was part of the team that enhanced the Solaris Volume Manager (SVM) to allow multiple nodes to simultaneously access volumes. A new type of diskset, a multi-owner diskset, was created for this new feature. The multi-owner diskset is enabled in OpenSolaris, but only a single node is allowed to be added to the diskset. Multiple nodes are only allowed in the multi-owner diskset when running on our clustering software, but the enhancements made during this project have ramifications throughout the SVM code base with changes to the SVM commands, daemons, library and kernel modules. This post gives a brief overview of how the multi-owner diskset features have affected the SVM commands. The remaining multi-owner diskset changes, most notably to the SVM md_mirror driver, will be handled in future postings.

The multi-node simultaneous access feature is officially called a multi-owner diskset, but the comments, defines and declarations in the code will have a MN or mn in their string where the MN/mn stands for multi-node. There were many discussions about the name of multi-owner vs multi-node early in the project and we chose multi-node. As the project progressed, it became clear that multi-owner was really the better choice. The final decision was made late in the project so only the messages seen by the user were changed to multi-owner.

The multi-owner name does reflect the true nature of our changes. With the traditional (aka named) diskset in SVM, multiple nodes can be part of a diskset, but only one node can be the owner of the diskset. The owner node in the diskset can access the volumes on the shared storage. No other node can access the shared volumes until the owner node releases the diskset. In the multi-owner diskset several nodes can be owners of the diskset and can access the shared volumes simultaneously. Any owner node can also make configuration changes such as creating new volumes, adding new disks to the diskset, deleting volumes, fixing failed disks, etc.

In the multi-owner diskset, the nodes are kept synchronous with each other by having all nodes execute the same SVM configuration state changes in the same order. A configuration state change could be initiated by the user, such as creation of a new volume, or by the system, such as a hotspare action as a result of a disk failure. A node in the multi-owner diskset is designated as the master of the diskset. The master node is the point at which the state changes are serialized across the nodes in the diskset. The master node handles the state change first and then sends it to the remaining nodes in a fixed sequential manner. A new multi-threaded SVM daemon, rpc.mdcommd, was written to handle the communication between the nodes in the multi-owner diskset.

When a SVM command is run on the local set or on a traditional/named diskset, this command is only executed on the local node. The command may use the existing SVM daemon, rpc.metad, to notify any other nodes in the diskset about a configuration change.

In the multi-owner diskset most of the commands will be executed on all nodes. The command will first send a dryrun version of the command to the master node. The master node will then send the dryrun command to the remaining nodes. A dryrun command is sent first to detect if the command would have failed on any node. (An example of a failure would be if one node had a mounted file system on a disk slice when the user was attempting to create a shared volume on the same disk slice.) If the dryrun message was successful, the command then sends the actual command to the master node. The master node will then execute the command and send this command to the remaining nodes.

In the case of clearing all metadevices within a multi-owner diskset (metaclear -a), the metaclear command is broken into a series of individual smaller commands. This allows the multi-owner code to correctly select the timeout allowed for this command to complete. The smaller commands are each sent to the master node without a dryrun version of the command being sent first. The master node then sends each smaller command to the remaining nodes.

There are several diskset commands that are not run on all nodes: metastat, medstat, metaset, metadb, metadevadm, metaimport and metarecover.

The metastat and medstat commands are run locally on the node where the command was executed. Metastat reports on the configuration of the multi-owner diskset. Medstat reports specifically on the mediators of the multi-owner diskset. In both cases no communication is needed with the other nodes. As the nodes execute a configuration state change, the metastat output on different nodes may be different. At the conclusion of the state change, all nodes should show the same metastat output.

The metaset and metadb commands can change the list of owner nodes in the multi-owner diskset. These commands may stop, start or re-initialize the SVM communication daemon, rpc.mdcommd, when the list of owner nodes has been changed. These commands use the existing SVM daemon, rpc.metad, that is running on all nodes to communicate with the other nodes in the diskset.

The metadevadm and metaimport commands can only run on disksets that support device ids. Device ids are not supported in the multi-owner diskset at this time so these commands are failed on their local nodes.

The metarecover command can only be run on the master node. If the metarecover command is recovering from the watermarks on disk, the master node sends a message to the remaining nodes describing the recovered soft partitions so that the nodes are kept synchronous with each other.

Technorati Tag:
Technorati Tag: ( Qer 14 2005, 08:49:11 PD PDT ) Permalink Comments [10]

Calendar

RSS Feeds

Search

Links

Navigation

Referers