Optimized Resyncs
in Solaris Volume Manager
Over the last
couple of months, a number of people wanted to know about optimized
resyncs. People familiar with VxVM, might know this as DRL (Dirty
Region
Logging). In Solaris Volume Manager this functionality is
called
optimized resyncs. Optimized resync in Solaris Volume Manager is
implemented using resync regions(RR).
The function of DRL or RR is to ensure that a mirror is consistent in
event of a crash. Consistency does not mean that the mirror
will contain up to date information. What is guaranteed is
that a read request of the same block from any side of a mirror
returns the same data. For
example, if block 10 is read from a 2 sided mirror, the data
returned must be identical
whether it is supplied from side 1 or side 2 of that mirror.
When parallel writes to a mirror are enabled, a window exists where a
system may die before writes to all sides of the mirror are
completed. To ensure the mirror is consistent in event of a
crash, a simple
implementation might be to choose one side of a mirror and copy its
contents to all the other
sides when the system boots up. This obviously is not very
efficient. A
smarter approach is to track regions in which writes occurred and
resync only those regions. SVM uses this
technique. SVM divides a mirror into 1001
regions (max). This is maintained as an incore bitmap and in the mddb.
When a write request
arrives at the mirror strategy routine, it has the block number and
length. From this information the impacted regions are computed. Prior
to issuing a write the incore bitmap
region is checked to see if the region has already been marked
dirty. If not the incore bitmap is updated. An
asynchronous resync
kernel daemon thread monitors this bitmap every few
seconds and writes it out to the mddb if required. After the RR
bitmap is flushed to the mddb, the bitmap is reset. On boot up svc:/system/metainit:default
starts the resync kernel threads. There is one thread per
mirror. The resync thread scans the
mddb and only the regions that are marked dirty are resynced.
When a machine is shutdown cleanly, the bitmap is zeroed out
and no resync occurs when starting up.
In the mddb, the resync
bitmap is called the resync record. Every mirror has two resync
records associated with it. To reduce hot spots, the
resync records are spread across multiple mddbs. That is,
if one has 2 mirrors and 4 mddbs, then the resync record for one mirror
will be on mddb1 and mddb2. For the second mirror the resync record
will be on mddb3 and mddb4. The actual algorithm for resync
record placement is a bit more sophisticated.
One can get metastat to display the location of the resync regions
for the mirror.
#metadb
flags first blk block count
a u 16 8192 /dev/dsk/c1t1d0s7
a u 16 8192 /dev/dsk/c1t0d0s7
a u 16 8192 /dev/dsk/c1t2d0s0
# export MD_DEBUG=STAT
# metastat d10
d10: Mirror
Submirror 0: d0
State: Okay Wed Jun 1 19:53:10 2005
Submirror 1: d1
State: Okay Wed Jun 1 19:53:10 2005
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 67094528 blocks (31 GB)
Regions which are dirty: 34% (blksize 67094 num 1001)
Resync record[0]: 0 (/dev/dsk/c1t1d0s7 16 8192)
Resync record[1]: 1 (/dev/dsk/c1t0d0s7 16 8192)
d0: Submirror of d10
State: Okay Wed Jun 1 19:53:10 2005
Size: 67094528 blocks (31 GB)
Stripe 0:
Device Start Dbase State Reloc Hot Spare Time
/dev/dsk/c3t50020F23000100F7d9s0 0 No Okay Yes Wed Jun 1 19:52:53 2005
d1: Submirror of d10
State: Okay Wed Jun 1 19:53:10 2005
Size: 67094528 blocks (31 GB)
Stripe 0:
Device Start Dbase State Reloc Hot Spare Time
/dev/dsk/c3t50020F23000100F7d10s0 0 No Okay Yes Wed Jun 1 19:53:04 2005
Device Relocation Information:
Device Reloc Device ID
/dev/dsk/c3t50020F23000100F7d9 Yes id1,ssd@n60020f20000100f740336c7b00023087
/dev/dsk/c3t50020F23000100F7d10 Yes id1,ssd@n60020f20000100f740336ca20001241b
In the output
above, notice that the resync regions are spread across 2 mddbs.
I was running newfs on the mirror and therefore it shows that 34%
of the regions are dirty. The blksize refers to the size of the
resync region. If you were monitoring the iostat output
for an active mirror, you would notice that the disks that contain the
mddbs are being written to. These writes are due to the periodic
updates of resync region bitmaps to the mddb.
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
