Thursday Oct 20, 2005
Sunday Sep 25, 2005
Network
configurator for laptop
I have followed
the trend of many nascent bloggers by being silent for a long
time. So what have I been doing. Aside from doing the
normal work, I have been trying to get things organized so that we can
start putting tools and utilities that make a Solaris laptop a bit more
real. Internally we have been using a number of tools
delivered by frkit. Now the plan
is to make frkit available on laptop community
for OpenSolaris. The community is now open and a number of people
are contibuting.
Inetmenu
is one of the tools used internally to configure network
interfaces. Out of the box, Solaris and OpenSolaris don't deal
well with nomadic systems like laptops. One typically has
install Solaris as a stand alone system. Then after booting up,
the user has to manually plumb the interface, start DHCP
and then fix up the files required for DNS or
NIS. Inetmenu is a script that does all that. It
handles, wired, wireless and dial up interfaces. One
can define profiles for use. All these make life a bit
nicer for laptop users.
However Inetmenu
is just one of the network configurators. There is yet
another one that is used internally called netprof and hopefully we
should be able to have that available in not too distant future.
Technorati Tag: OpenSolaris
Tuesday Jun 14, 2005
Optimized Resyncs
in Solaris Volume Manager
Over the last
couple of months, a number of people wanted to know about optimized
resyncs. People familiar with VxVM, might know this as DRL (Dirty
Region
Logging). In Solaris Volume Manager this functionality is
called
optimized resyncs. Optimized resync in Solaris Volume Manager is
implemented using resync regions(RR).
The function of DRL or RR is to ensure that a mirror is consistent in
event of a crash. Consistency does not mean that the mirror
will contain up to date information. What is guaranteed is
that a read request of the same block from any side of a mirror
returns the same data. For
example, if block 10 is read from a 2 sided mirror, the data
returned must be identical
whether it is supplied from side 1 or side 2 of that mirror.
When parallel writes to a mirror are enabled, a window exists where a
system may die before writes to all sides of the mirror are
completed. To ensure the mirror is consistent in event of a
crash, a simple
implementation might be to choose one side of a mirror and copy its
contents to all the other
sides when the system boots up. This obviously is not very
efficient. A
smarter approach is to track regions in which writes occurred and
resync only those regions. SVM uses this
technique. SVM divides a mirror into 1001
regions (max). This is maintained as an incore bitmap and in the mddb.
When a write request
arrives at the mirror strategy routine, it has the block number and
length. From this information the impacted regions are computed. Prior
to issuing a write the incore bitmap
region is checked to see if the region has already been marked
dirty. If not the incore bitmap is updated. An
asynchronous resync
kernel daemon thread monitors this bitmap every few
seconds and writes it out to the mddb if required. After the RR
bitmap is flushed to the mddb, the bitmap is reset. On boot up svc:/system/metainit:default
starts the resync kernel threads. There is one thread per
mirror. The resync thread scans the
mddb and only the regions that are marked dirty are resynced.
When a machine is shutdown cleanly, the bitmap is zeroed out
and no resync occurs when starting up.
In the mddb, the resync
bitmap is called the resync record. Every mirror has two resync
records associated with it. To reduce hot spots, the
resync records are spread across multiple mddbs. That is,
if one has 2 mirrors and 4 mddbs, then the resync record for one mirror
will be on mddb1 and mddb2. For the second mirror the resync record
will be on mddb3 and mddb4. The actual algorithm for resync
record placement is a bit more sophisticated.
One can get metastat to display the location of the resync regions
for the mirror.
#metadb
flags first blk block count
a u 16 8192 /dev/dsk/c1t1d0s7
a u 16 8192 /dev/dsk/c1t0d0s7
a u 16 8192 /dev/dsk/c1t2d0s0
# export MD_DEBUG=STAT
# metastat d10
d10: Mirror
Submirror 0: d0
State: Okay Wed Jun 1 19:53:10 2005
Submirror 1: d1
State: Okay Wed Jun 1 19:53:10 2005
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 67094528 blocks (31 GB)
Regions which are dirty: 34% (blksize 67094 num 1001)
Resync record[0]: 0 (/dev/dsk/c1t1d0s7 16 8192)
Resync record[1]: 1 (/dev/dsk/c1t0d0s7 16 8192)
d0: Submirror of d10
State: Okay Wed Jun 1 19:53:10 2005
Size: 67094528 blocks (31 GB)
Stripe 0:
Device Start Dbase State Reloc Hot Spare Time
/dev/dsk/c3t50020F23000100F7d9s0 0 No Okay Yes Wed Jun 1 19:52:53 2005
d1: Submirror of d10
State: Okay Wed Jun 1 19:53:10 2005
Size: 67094528 blocks (31 GB)
Stripe 0:
Device Start Dbase State Reloc Hot Spare Time
/dev/dsk/c3t50020F23000100F7d10s0 0 No Okay Yes Wed Jun 1 19:53:04 2005
Device Relocation Information:
Device Reloc Device ID
/dev/dsk/c3t50020F23000100F7d9 Yes id1,ssd@n60020f20000100f740336c7b00023087
/dev/dsk/c3t50020F23000100F7d10 Yes id1,ssd@n60020f20000100f740336ca20001241b
In the output
above, notice that the resync regions are spread across 2 mddbs.
I was running newfs on the mirror and therefore it shows that 34%
of the regions are dirty. The blksize refers to the size of the
resync region. If you were monitoring the iostat output
for an active mirror, you would notice that the disks that contain the
mddbs are being written to. These writes are due to the periodic
updates of resync region bitmaps to the mddb.
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
SMS, Oban and
multi-threading ioctls in SVM
Now that Solaris
is open sourced, it's time to share a bit of inside information.
All Volume Manager projects were named after various single malt scotch(SMS). Over
the last couple of years we have gone through (yes ingested) Lagavulin,
Laphroig, Springbank, Ardbeg and the most recent one was Oban. The system
names for our lab machines are named after either distilleries
that are defunct or the islands around Islay. For some of us names
like,
Ronaldsay, Eday, Stronsay, Shapinsay easily roll off our tongues.
Often there are applications, such as Oracle RAC or SAN file systems,
where multiple nodes need to access the shared storage simultaneously.
Cluster volume managers typically provide this functionality.
OpenSolaris supports this functionality with Solaris Cluster Volume
Manager(SCVM). Oban was its code name. SCVM supports striping,
mirroring and soft partitions. In addition the mirror and
soft partition drivers were enhanced to support Application Based
Recovery (ABR) ioctls for block and character devices, to improve
cluster I/O performance. Oracle for example uses this
functionality to speed up recovery after a crash.
Solaris Volume
Manager can manage storage grouped in named disksets. For instance one
could group all the storage for home directories in a diskset. This
diskset could be moved from one node to another as required. Data
stored on the disk in this type of a diskset can also be remotely
replicated there by greatly improving disaster recovery. Ardbeg, a
single malt that tastes like burnt rubber delivered capability to move
disksets and support remote replication of disksets.
A diskset always
has a host associated with it. If multiple nodes share the same set of
disks then those nodes can participate in the same diskset. Disksets
have single owner and multi-owner attributes. Single owner
attribute is akin to EVMS's private
container, that is only one node at any given time
can access the disks in the diskset. Multi-owner disksets allow
multiple nodes to access the disksets simultaneously. These
disksets support cluster volume management functionality. Multi-owner
disksets are
similar to EVMS's shared containers.&nb
sp;
Internally we called them Oban disksets. The metadevices(or
volumes) in these multi-owner disksets can be managed from any node in
a cluster.
While other
blogs on OpenSolaris have focused on describing complex code, I felt it
would be interesting to describe complex problems that were addressed
with a simple solution. Normally one multi threads sections of
code for performance. That was not the case here; multi threading
was
required for
correctness. To appreciate the
issues, one needs to understand a bit about the SVM and SCVM
functionality and the daemon rpc.mdcommd. The detailed workings
of SCVM are beyond the scope of this blog, but here are few salient
features.
Ioctls in SVM
are used to either get configuration information or change/maintain a
metadevice. Since configuration changes are infrequent and status
of the metadevices is not a critical path, multi threading SVM ioctls
was never high on our agenda. SCVM changed that. Our goal
was to
make cluster
volume management functionality a seamless extension of Solaris Volume
Manager. This meant that local, single owner and multi-owner
disksets all had to co-exist.
For all the
nodes to have a consistent view of the state of metadevices, state
changes on one node must be propagated to all the nodes in the
cluster. Similarly when a configuration change is made on any
node, it should appear on all the nodes. The daemon rpc.mdcommd
is used to transmit and marshal SCVM meta data across a
cluster. When meta data from one node needs to be propagated to
all the nodes in the cluster, the following events occur:
Messages
typically contain configuration change information. Ioctls pass the
contents of the message to the kernel through an ioctl. Since
each diskset can have a
different master node, a single node may be master for one diskset and
slave for the other. If ioctls are single threaded, then critical
messages sent to the master node for one diskset could be blocked on
that node by a state change update to another diskset. Some
of
the messages generate a sub-message which need to be propagated to
all the nodes too. Any attempt to send the sub message will
immediately cause a deadlock since
it is called from the context of the first ioctl. There are
other messages that need information from the kernel and therefore need
to
issue ioctls calls. These again will hang. Since
local disksets, named disksets and single owner
disksets can co-exist, operations on a single owner diskset can block
operations on a multi-owner diskset. For all these reasons it was
decided to multi-thread the ioctls.
While multi-threading the ioctls itself is simple, the impact of this
change to the rest of the SVM code is potentially significant. Large
chunks of the code path must be looked at to avoid race
conditions. Therefore at the time of the project, it was decided to
multi-thread only critical ioctls for multi-owner disksets.
The obvious question is how should single threaded and
multi-threaded ioctls interact ?
Most of the multi-threaded ioctls are directly related to cluster
interaction. For example, when a cluster membership is
forming, rpc.mdcommd must be suspended until a cluster new
membership
list is available. At that point the daemon must update its
notion of the active nodes in the cluster and send messages only to
those nodes. As a result, the ioctls that handle cluster
related operations were deemed to have a higher priority than the
ones that changed the state of the mddb. These ioctls also did not
interact with md structures. Based on this analysis, we decided
that while a single threaded ioctl was in progress, multi-threaded
ioctls must be allowed. The next issue was:
Should a single threaded ioctl be allowed when a
multi-threaded ioctl was in progress ?
Recall that the traditional ioctls(i.e. single threaded ones) can
change
and update of the mddb. These changes can result in
messages
being sent across the cluster. If the state of the cluster is
changing the mddb
state change needs to be held back until the cluster is stable.
Hence it was decided to block single threaded ioctls if multi-threaded
ioctls were in progress. This also means
that we risk starving single threaded ioctls if multi-threaded calls
keep occurring. We deemed this risk to be minimal since
only a few ioctls were multi-threaded and a if large number of these
were occurring, it indicated a problem with the cluster. In such
a
situation sacrificing a node to enable the availability of a cluster is
reasonable.
The implementation of multi threading ioctls in this manner turns out
to be quite simple.
The code snippets are from the function mdioctl
in usr/src/uts/common/io/lvm/md/md.c
if (!is_mt_ioctl(cmd) && md_ioctl_lock_enter() == EINTR) {
return (EINTR);
}
/*
* initialize lock tracker
*/
IOLOCK_INIT(&lock);
/* Flag to indicate that MD_GBL_IOCTL_LOCK is not acquired */
if (is_mt_ioctl(cmd)) {
/* increment the md_mtioctl_cnt */
mutex_enter(&md_mx);
md_mtioctl_cnt++;
mutex_exit(&md_mx);
lock.l_flags |= MD_MT_IOCTL;
}
md_ioctl_lock_enter()
calls md_global_lock_enter() with ~MD_GBL_IOCTL_LOCK.md_global_lock_enter()
(only the relevent code shown)
if (!(global_locks_owned_mask & MD_GBL_IOCTL_LOCK)) {
while ((md_mtioctl_cnt != 0) ||
(md_status & MD_GBL_IOCTL_LOCK)) {
if (cv_wait_sig_swap(&md_cv, &md_mx) == 0) {
mutex_exit(&md_mx);
return (EINTR);
}
}
md_status |= MD_GBL_IOCTL_LOCK;
md_ioctl_cnt++;
}
The if(!global_locks_owned_mask.. statement will always be true in the above call sequence. We therefore achieve the logic that while a multi-threaded ioctl is in progress or if another single threaded ioctl call is in progress, the subsequent single threaded ioctls will wait.
..and is the story behind multi-threading SVM ioctls.
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
Friday May 20, 2005
People who are
familiar with LVM, LVM2 (from Linux) might find the
Solaris Volume Manager terms a bit strange. So here's a
cheat sheet. It is by no means complete. But it should get you started.
Volume Group = diskset
PE = soft partition on a slice
LE = soft partition on a metadevice
mirror = RAID 1 (it is really RAID 0+1 but works as 1+0) (/dev/md?)
raid = RAID 5
hotspare = could not find that concept in Linux
All Solaris Volume Manager commands begin with meta. This naming convention came from a device view perspective. RAID 0, RAID 1(0+1 and 1+0) RAID-5, soft partitions are all different types of metadevices. Solaris Volume Manager can operate either or slices or an entire disk. To quickly get started you can look up the documentation. I like this layout since it is organized as tasks.
My quick comparison with LVM/LVM2 indicates that the Solaris Volume Manager is quite a bit more sophisticated. It is more in line with the functionality provided by EVMS or VxVM. To get started, it might be useful to look at the overview here
Solaris Volume
Manager supports 32 disksets. In each diskset, 8192
metadevices can be created. Prior to Solaris 10, /kernel/drv/md.conf
had variables nmd and
md_sets. These variables controlled the number of metadevices and
diskets on
the system. The default values for nmd was 128 and md_sets was 4. If
you
wanted to create more than 128 metadevices, you would have to change
nmd
and reboot!! Gawd..you blew your 5 nines right there!
In S10, we eliminated the need for these variables. Now one can go to
the
maximum without any tuning.
Storage is grouped in disksets. The default diskset is referred to as the local set (i.e. diskset 0). All other disksets are called named disksets. When multiple nodes are connected to common storage, it must be managed in a named diskset. These named disksets can have either a single owner or multi-owners. The single owner diskset is equivalent to the private container concept in EVMS. Multi-owner disksets provide cluster volume management functionality. This is similar to the shared container concept in EVMS. Unlike EVMS, Solaris Volume Manager supports, mirroring in a shared environment. A more detailed overview of the disksets is available here
Monday May 16, 2005
There are some who jump onto things quickly. I am not one of them. I still run Solaris 8 (x86) on my Pentium III at home overclocked to 660 Mhz. My other machine is a ferrari running Solaris 10. (Whew..I was wondering how I could stick in that bumper sticker line.) Blogging by my standards is new and I could have waited for another couple of years before I jumped onto this bandwagon. However, recently I have been answering questions on the opensolaris discussion group that I thought would be useful to for a larger audience. What better way then to blog and so with much trepidation, I have decided to blog. I work on the Solaris Volume Manager and have been at it for a while. Since the first blog is like those awkward dance steps, I thought I might talk about something that I am very comfortable about; the history of Sun's Volume Manager
It started out as OnlineDiskSuite or ODS(thank marketing for naming a product that sounds like odious). It was later renamed to Solstice DiskSuite or SDS. SDS was enhanced to included a newer meta database and RAID-5. Disksets for clustering was also introduced. SDS lived through versions 4.0 through 4.2.1. The last version is supported on Solaris 8.
With the growing popularity of Solaris we felt that the volume manager should be an integral part of the operating environment. For the user this would provide seamless upgrades(and live upgrades). No need to tear down and rebuild the volume configurations anymore. It also avoided the annoyance of figuring how to obtain a volume manager. When you loaded Solaris it was there..just like the file system. There are advantages for us too. The integrated volume manager can automatically benefit from other enhancements in the kernel. For all these reasons the volume manager became part of Solaris 9. To clearly delineate this phase, the volume manager was renamed to Solaris Volume Manager or SVM for short. SVM does not have separate versions...just like the file system (UFS) does not have separate versions.
The big change between SDS and the integrated Solaris Volume Manager
is the use of device ids for tracking disks. The word device ids I
believe
is a Solaris term. But its basis is in WWN. All modern disks have a
unique
identifier. SCSI and FC drives use WWN (World Wide Numbers) as defined
in
the SCSI spec. IDE/ATA disks use a different scheme but they too are
unique. By using devids to track disks, we addressed a big customer
issue.
Volume numbering would not be lost even when disks get renumbers.
Previously
if drives were renumbered, the SDS configuration would be lost. This
was
because SDS used the major/minor numbers to track the configuration.
The
minor numbers could change for a variety of reasons. If one moved a
controller from one slot to another, the device enumeration would
change
which in turn would change the minor number. Now one can really shuffle
the
drives and SVM will not only find it and construct the metadevices
correctly, but it will also update the drive names so that the user has
the
correct information. Since Solaris 9, new functionality has been
steadily
delivered in various Solaris 9 updates while also incorporating these
features in Solaris 10. So what does SVM deliver now ?
- Support for Multiterabyte volumes
- Improved integration with Dynamic Reconfiguration
- Metassist - a high level command that assists users in creating volume
- Support for CIM/WBEM
- Diskset mobility - the ability to import diskset
- Support for cluster volume manager
These are just the highlights that encompass a number of smaller features.
Now that I have jumped headlong into blogging..there is no turning back!!
This blog copyright 2009 by nadkarni
