One of the much anticipated feature of the
2009.Q3 release of the
fishworks OS is a complete rewrite of the iSCSI target implementation
known as Common Multiprotocol SCSI Target or
COMSTAR. The new target code is
an in-kernel implementation that replaces what was previously known as
the iSCSI target deamon, a user-level implementation of iSCSI.
Should we expect huge performance gains from this change ?
You Bet !
But like most performance question, the answer is often : it
depends. The measured performance of a given test is gated by the
weakest link triggered. iSCSI is just one component among many others
that can end up gating performance. If the daemon was not a limiting
factor, then that's your answer.
The target deamon was a userland implementation of iSCSI : some daemon
threads would read data from a storage pool and write data to a socket
or vice versa. Moving to a kernel implementation opens up options to
bypass at least one of the copies and that is being considered as a
future option. But extra copies while undesirable do not necessarily
contribute to the small packet latency or large request throughput;
For small packets requests, the copy is small change compared to the
request handling. For large request throughput the important things is
that the data path establishes a pipelined flow in order to keep every
components busy at all times.
But the way threads interact with one another can be a much greater
factor in delivered performance. And there lies the problem. The old
target deamon suffered from one major flaw in that each and every
iSCSI requests would require multiple trips through a single queue
(shared between every luns) and that queue was being read and written
by 2 specific threads. Those 2 threads would end up fighting for the
same locks. This was compounded by the fact that user level threads
can be put to sleep when they fail to acquire a mutex and that going
to sleep for a user level thread is a costly operation implying a
system call and all the accounting that goes with that.
So while the iSCSI target deamon gave reasonable service for large
request, it was much less scalable in terms of the number IOPS that
can be served and the CPU efficiency in which it could do that. IOPS
being of course a critical metrics for block protocols.
As an illustration of that with 10 client initiators and 10 threads
per initiators (so 100 outstanding request) doing 8K cache-hit reads,
we observed
| Old Target Daemon | Comstar | Improvement |
| 31K IOPS | 85K IOPS | 2.7X |
Moreover the target daemon was consuming 7.6 CPU to service those
31K IOPS while comstar could handle 2.7X more IOPS consuming only 10
cpus, a 2X improvement in iops per cpu efficiency.
On the write side, with a disk pool that had 2 striped write
optimised SSD, comstar gave us 50% more throughput (130 MB/sec vs
88MB/sec) and 60% more cpu efficiency.
Immediatedata
During our testing we noted a few interesting contributor to delivered
performance. The first being the setting of iSCSI
immediatedata
parameter
iSCSIadm(1M). On the
write path, that parameter will cause the initiator iSCSI to send up
to 8K of data along with the initial request packet. While this is a good
idea to do so, we found that for certain sizes of writes, it would
trigger some condition in the zil that caused ZFS to issue more data
than necessary through the logzillas. The problem is well understood
and remediation is underway and we expect to get to a situation in
which keeping the default value of
immediatedata=yes is the best. But
as of today, for those attempting world record data transfer speeds
through logzillas, setting
immediatedata=no and using a 32K or 64K write
size might yield positive result depending on your client OS.
Interrupt Blanking
Interested in low latency request response ? Interestingly, a chunk of
that response time is lost in the obscure setting of network card
drivers. Network cards will often delay pending interrupts in the hope
of coalescing more packets into a single interrupt. The extra
efficiency often results in more throughput at high data rate at the
expense of small packet latency. For 8K request we manage to get 15%
more single threaded IOPS by tweaking one such client side
parameter. Historically such tuning has always been hidden in the
bowel of each drivers and specific to ever client OS so that's too
broad a topic to cover here. But for Solaris clients, the
Crossbow
framework is aiming among other thing to make latency vs throughput decision
much more adaptive to operating condition relaxing the need for per
workload tunings.
WCE Settings
Another important parameter to consider for comstar is the 'write
cache enable' bit. By default all write request to an iSCSI lun needs
to be committed to stable storage as this is what is expected by most
consumers of block storage. That means that each individual write
request to a disk based storage pool will take minimally a disk
rotation or 5ms to 8ms to commit. This also why a write optimised SSD
is quite critical to many iSCSI workloads often yeilding 10X
performance improvements. Without such an SSD, iSCSI performance will
appear quite lackluster particularly for lightly threaded workloads
which more affected by latency characteristics.
One could then feel justified to set the write cache enable bits on
some luns in order to recoup some spunk in their engine. One good news
here is that in the new 2009.Q3 release of fishworks the setting is
now persistent across reboots and reconnection event, fixing a nasty
condition of 2009.Q2. However one should be very careful with this
setting as the end consumer of block storage (exchange, NTFS,
oracle,...) is quite probably operating under an unexpected set of
condition. This setting can lead to application corruption in case
of outage (no risk for the storage internal state).
There is one exception to this caveat and it is ZFS itself. ZFS is
designed to safely and correctly operate on top of devices that have
their write cached enabled. That is because ZFS will flush write
caches whenever application semantics or its own internal consistency
require it. So a ZPOOL created on top of iSCSI luns would be well
justified to set the WCE on the lun to boost performance.
Synchronous write bias
Finally as described in my blog entry about
Synchronous write bias,
we now have to option to bypass the write optimised SSDs for a lun if
the workload it receive is less sensitive to latency. This would be
the case of a highly threaded workload doing large data
transfers. Experimenting with this new property is warranted at this
point.
Is this iSCSI Target compatible with the Cluster-Feature in Microsoft Hyper-V Server 2008 R2
Posted by af on septembre 17, 2009 at 03:17 PM MEST #
Yes it is. I have done initial testing with Hyper-V and Microsoft Server 2008 R2. The 2009.Q3 release fixed a SCSI-3 PGR problem that prevented this in the 2009.Q2 release. I have setup iSCSI luns from a 7000 to be clustered shared volumes and have done live migrations.
Posted by Ryan Pratt on septembre 17, 2009 at 11:00 PM MEST #
Thank you.
Posted by af on septembre 17, 2009 at 11:57 PM MEST #
> a complete rewrite of the iSCSI target implementation
> known as Common Multiprotocol SCSI Target or COMSTAR
Just to clarify this specific misunderstanding in the otherwise excellent post, COMSTAR is a separate kernel software layer, non-specific to iSCSI. New target rewrite relies on COMSTAR to provide SCSI-related functionality, but hardly is COMSTAR by itself. Further info on COMSTAR design and its relation to iSCSI and other transport providers could be found at http://opensolaris.org/os/project/comstar.
Posted by Andrey Kuzmin on septembre 22, 2009 at 05:25 PM MEST #
Can you please explain how many and which interfaces you use to reach the 85K IOPS on the iSCSI target side? And what type of machine the iSCSI Target was running.
Posted by Christian Schmidt on septembre 23, 2009 at 05:05 PM MEST #
Christian, this was over a single 10Gbe. This was with the older version of 7410 with 16 cores (10 of which were busy).
Posted by Roch Bourbonnais on septembre 24, 2009 at 09:52 AM MEST #
Hi,
We have been facing serious performance issues in S7110 with ISCSI Luns. The New Q3 Firmware has been released from Sun, that has major changes in ISCSI Stack, I wanted to know the impact on old ISCSI luns and over all impact on S7110 after doing the up gradation of Q3. Also, if somebody could send the steps to reconfigure old luns after doing the upgradation will be great help.
Thanks
Reeshi Agrawal
Posted by Reeshi on septembre 25, 2009 at 03:07 PM MEST #
Actually immediate data improves performance (write performance in specific). The reason it was only 8K, was a bug which I believe is recently fixed:
6740060 iscsit does not send or handle MaxRecvDataSegmentLength key properly
But still the MaxRecvDataSegmentLength is set to 32K which I think is less. It should be set to 128K. I made a similar change in my private gate a few weeks ago and saw a consistent write improvement of 20% (throughput).
Posted by Sumit Gupta on octobre 04, 2009 at 07:43 PM MEST #
In case someone asks, looking in the Sun docs for "wce" with the intention of finding "write cache enable" will not return happiness. The property to set is "wcd" for "write cache disable." Obviously, the fishworks interface will make this more user-friendly. For command line aficianados:
To see the current setting:
stmfadm list-lu -v LU-NAME
To enable:
stmfadm modify-lu --lu-prop wcd=false LU-NAME
See the stmfadm command for details.
Posted by Richard Elling on octobre 09, 2009 at 02:36 AM MEST #
Great post and draw. Thank you for sharing.
Posted by links london jewelry on décembre 01, 2009 at 03:23 AM MET #