SMS, Oban and
multi-threading ioctls in SVM
Now that Solaris
is open sourced, it's time to share a bit of inside information.
All Volume Manager projects were named after various single malt scotch(SMS). Over
the last couple of years we have gone through (yes ingested) Lagavulin,
Laphroig, Springbank, Ardbeg and the most recent one was Oban. The system
names for our lab machines are named after either distilleries
that are defunct or the islands around Islay. For some of us names
like,
Ronaldsay, Eday, Stronsay, Shapinsay easily roll off our tongues.
Often there are applications, such as Oracle RAC or SAN file systems,
where multiple nodes need to access the shared storage simultaneously.
Cluster volume managers typically provide this functionality.
OpenSolaris supports this functionality with Solaris Cluster Volume
Manager(SCVM). Oban was its code name. SCVM supports striping,
mirroring and soft partitions. In addition the mirror and
soft partition drivers were enhanced to support Application Based
Recovery (ABR) ioctls for block and character devices, to improve
cluster I/O performance. Oracle for example uses this
functionality to speed up recovery after a crash.
Solaris Volume
Manager can manage storage grouped in named disksets. For instance one
could group all the storage for home directories in a diskset. This
diskset could be moved from one node to another as required. Data
stored on the disk in this type of a diskset can also be remotely
replicated there by greatly improving disaster recovery. Ardbeg, a
single malt that tastes like burnt rubber delivered capability to move
disksets and support remote replication of disksets.
A diskset always
has a host associated with it. If multiple nodes share the same set of
disks then those nodes can participate in the same diskset. Disksets
have single owner and multi-owner attributes. Single owner
attribute is akin to EVMS's private
container, that is only one node at any given time
can access the disks in the diskset. Multi-owner disksets allow
multiple nodes to access the disksets simultaneously. These
disksets support cluster volume management functionality. Multi-owner
disksets are
similar to EVMS's shared containers.&nb
sp;
Internally we called them Oban disksets. The metadevices(or
volumes) in these multi-owner disksets can be managed from any node in
a cluster.
While other
blogs on OpenSolaris have focused on describing complex code, I felt it
would be interesting to describe complex problems that were addressed
with a simple solution. Normally one multi threads sections of
code for performance. That was not the case here; multi threading
was
required for
correctness. To appreciate the
issues, one needs to understand a bit about the SVM and SCVM
functionality and the daemon rpc.mdcommd. The detailed workings
of SCVM are beyond the scope of this blog, but here are few salient
features.
Ioctls in SVM
are used to either get configuration information or change/maintain a
metadevice. Since configuration changes are infrequent and status
of the metadevices is not a critical path, multi threading SVM ioctls
was never high on our agenda. SCVM changed that. Our goal
was to
make cluster
volume management functionality a seamless extension of Solaris Volume
Manager. This meant that local, single owner and multi-owner
disksets all had to co-exist.
For all the
nodes to have a consistent view of the state of metadevices, state
changes on one node must be propagated to all the nodes in the
cluster. Similarly when a configuration change is made on any
node, it should appear on all the nodes. The daemon rpc.mdcommd
is used to transmit and marshal SCVM meta data across a
cluster. When meta data from one node needs to be propagated to
all the nodes in the cluster, the following events occur:
Messages
typically contain configuration change information. Ioctls pass the
contents of the message to the kernel through an ioctl. Since
each diskset can have a
different master node, a single node may be master for one diskset and
slave for the other. If ioctls are single threaded, then critical
messages sent to the master node for one diskset could be blocked on
that node by a state change update to another diskset. Some
of
the messages generate a sub-message which need to be propagated to
all the nodes too. Any attempt to send the sub message will
immediately cause a deadlock since
it is called from the context of the first ioctl. There are
other messages that need information from the kernel and therefore need
to
issue ioctls calls. These again will hang. Since
local disksets, named disksets and single owner
disksets can co-exist, operations on a single owner diskset can block
operations on a multi-owner diskset. For all these reasons it was
decided to multi-thread the ioctls.
While multi-threading the ioctls itself is simple, the impact of this
change to the rest of the SVM code is potentially significant. Large
chunks of the code path must be looked at to avoid race
conditions. Therefore at the time of the project, it was decided to
multi-thread only critical ioctls for multi-owner disksets.
The obvious question is how should single threaded and
multi-threaded ioctls interact ?
Most of the multi-threaded ioctls are directly related to cluster
interaction. For example, when a cluster membership is
forming, rpc.mdcommd must be suspended until a cluster new
membership
list is available. At that point the daemon must update its
notion of the active nodes in the cluster and send messages only to
those nodes. As a result, the ioctls that handle cluster
related operations were deemed to have a higher priority than the
ones that changed the state of the mddb. These ioctls also did not
interact with md structures. Based on this analysis, we decided
that while a single threaded ioctl was in progress, multi-threaded
ioctls must be allowed. The next issue was:
Should a single threaded ioctl be allowed when a
multi-threaded ioctl was in progress ?
Recall that the traditional ioctls(i.e. single threaded ones) can
change
and update of the mddb. These changes can result in
messages
being sent across the cluster. If the state of the cluster is
changing the mddb
state change needs to be held back until the cluster is stable.
Hence it was decided to block single threaded ioctls if multi-threaded
ioctls were in progress. This also means
that we risk starving single threaded ioctls if multi-threaded calls
keep occurring. We deemed this risk to be minimal since
only a few ioctls were multi-threaded and a if large number of these
were occurring, it indicated a problem with the cluster. In such
a
situation sacrificing a node to enable the availability of a cluster is
reasonable.
The implementation of multi threading ioctls in this manner turns out
to be quite simple.
The code snippets are from the function mdioctl
in usr/src/uts/common/io/lvm/md/md.c
if (!is_mt_ioctl(cmd) && md_ioctl_lock_enter() == EINTR) {
return (EINTR);
}
/*
* initialize lock tracker
*/
IOLOCK_INIT(&lock);
/* Flag to indicate that MD_GBL_IOCTL_LOCK is not acquired */
if (is_mt_ioctl(cmd)) {
/* increment the md_mtioctl_cnt */
mutex_enter(&md_mx);
md_mtioctl_cnt++;
mutex_exit(&md_mx);
lock.l_flags |= MD_MT_IOCTL;
}
md_ioctl_lock_enter()
calls md_global_lock_enter() with ~MD_GBL_IOCTL_LOCK.md_global_lock_enter()
(only the relevent code shown)
if (!(global_locks_owned_mask & MD_GBL_IOCTL_LOCK)) {
while ((md_mtioctl_cnt != 0) ||
(md_status & MD_GBL_IOCTL_LOCK)) {
if (cv_wait_sig_swap(&md_cv, &md_mx) == 0) {
mutex_exit(&md_mx);
return (EINTR);
}
}
md_status |= MD_GBL_IOCTL_LOCK;
md_ioctl_cnt++;
}
The if(!global_locks_owned_mask.. statement will always be true in the above call sequence. We therefore achieve the logic that while a multi-threaded ioctl is in progress or if another single threaded ioctl call is in progress, the subsequent single threaded ioctls will wait.
..and is the story behind multi-threading SVM ioctls.
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
Posted by Leon Koll on January 28, 2006 at 02:33 PM MST #
Posted by fdasf on October 02, 2006 at 03:38 AM MDT #
Posted by fsdfwewerwe on October 12, 2006 at 12:40 AM MDT #