
e martë qershor 14, 2005
Multi-Owner Diskset in SVM
Multi-Owner Disksets in SVM
I was part of the team that enhanced the Solaris Volume Manager (SVM)
to allow multiple nodes to simultaneously access volumes. A new
type of diskset, a multi-owner diskset, was created for this new
feature. The multi-owner diskset is enabled in
OpenSolaris, but
only a single node is allowed to be added to the diskset.
Multiple nodes are only allowed in the multi-owner diskset when
running on our clustering software, but the enhancements made during
this project have ramifications throughout the SVM code base with
changes to the SVM commands, daemons, library and kernel modules.
This post gives a brief overview of how the multi-owner diskset features
have affected the SVM commands. The remaining multi-owner
diskset changes, most notably to the SVM md_mirror driver, will
be handled in future postings.
The multi-node simultaneous access feature is officially called a
multi-owner diskset, but
the comments, defines and declarations in the code will have a MN or mn
in their string where the MN/mn stands for multi-node. There were many
discussions about the name of multi-owner vs multi-node early
in the project and we chose multi-node. As the project progressed,
it became clear that multi-owner was really the better choice.
The final decision was made late in the project so only the messages
seen by the user were changed to multi-owner.
The multi-owner name does reflect the true nature of our changes.
With the traditional (aka named) diskset in SVM, multiple nodes can be
part of a diskset, but only one node can be the owner of the diskset.
The owner node in the diskset can access the volumes on the shared storage.
No other node can access the shared volumes until the owner node releases
the diskset.
In the multi-owner diskset several nodes can be owners of the diskset
and can access the shared volumes simultaneously. Any owner node can also
make configuration changes such as creating new volumes, adding new
disks to the diskset, deleting volumes, fixing failed disks, etc.
In the multi-owner diskset, the nodes are kept synchronous with each
other by having all nodes execute the same SVM configuration state
changes in the same order.
A configuration state change could be initiated
by the user, such as creation of a new volume, or by the system, such
as a hotspare action as a result of a disk failure.
A node in the
multi-owner diskset is designated as the master of the diskset.
The master node is the point at which
the state changes are serialized across the nodes in the diskset.
The master node handles the state change first and then sends it to the
remaining nodes in a fixed sequential manner. A new multi-threaded
SVM daemon, rpc.mdcommd, was written to handle the
communication between the nodes in the multi-owner diskset.
When a SVM command is
run on the local set or on a traditional/named diskset, this
command is only executed on the local node.
The command may use the existing SVM daemon, rpc.metad, to notify
any other nodes in the diskset about a configuration change.
In the multi-owner diskset most of the commands will be executed
on all nodes. The command will first send a dryrun
version of the command to the master node. The master node will then
send the dryrun command to the remaining nodes. A dryrun command is
sent first to detect if the command would have failed on any node.
(An example of a failure would be if one node had a mounted file system
on a disk slice when the user was attempting to create a shared volume
on the same disk slice.) If the dryrun message
was successful, the command then sends the actual command to the master node.
The master node will then execute the command and send this command
to the remaining nodes.
In the case of clearing all metadevices within a multi-owner diskset
(metaclear -a), the metaclear command is broken into a series
of individual smaller commands. This allows the multi-owner code
to correctly select the timeout allowed for this command to complete.
The smaller commands are each sent to the master node
without a dryrun version of the command being sent first.
The master node then sends each smaller command to the remaining nodes.
There are several diskset commands that are not run on all nodes:
metastat, medstat, metaset, metadb, metadevadm,
metaimport and metarecover.
The metastat and medstat commands are run locally on the node
where the command was executed. Metastat reports on the
configuration of the multi-owner diskset. Medstat reports
specifically on the mediators of the multi-owner diskset.
In both cases no communication is needed with the other nodes.
As the nodes execute a configuration state change, the metastat
output on different nodes may be different. At the conclusion
of the state change, all nodes should show the same metastat output.
The metaset and metadb commands can change the list of owner
nodes in the multi-owner diskset.
These commands may stop, start or re-initialize the SVM
communication daemon, rpc.mdcommd, when the list of owner
nodes has been changed.
These commands use the
existing SVM daemon, rpc.metad, that is running on all nodes to
communicate with the other nodes in the diskset.
The metadevadm and metaimport commands can only run
on disksets that support device ids. Device ids are
not supported in the multi-owner diskset at this time so
these commands are failed on their local nodes.
The metarecover command can only be run on the master node.
If the metarecover command is recovering from the watermarks
on disk, the master node sends a message to the remaining
nodes describing the recovered soft partitions so that the
nodes are kept synchronous with each other.
Technorati Tag:
OpenSolaris
Technorati Tag:
Solaris
( Qer 14 2005, 08:49:11 PD PDT )
Permalink