I describe here the effort I spearheaded studying the performance
characteristics of the OpenStorage platform and the ways in which our
team of engineers delivered real out of the box improvements to the
product that is shipping today.
One of the Joy of working on the OpenStorage NAS appliance was
that solutions we found to performance issues could be immediately
transposed into changes to the appliance without further process.
The first big wins
We initially stumble on 2 major issues, one for NFS synchronous writes
and one for the CIFS protocol in general. The NFS problem was a subtle
one involving the distinction of O_SYNC vs O_DSYNC writes in the ZFS
intent log and was impacting our threaded synchronous writes test by
up to a 20X factor. Fortunately I had an history of studying that part
of the code and could quickly identify the problem and suggest a
fix. This was tracked as
6683293: concurrent O_DSYNC writes to a
fileset can be much improved over NFS.
The following week, turning to CIFS studies, we were seeing great
scalability limitation in the code. Here again I was fortunate to be
the first one to hit this. The problem was that to manage CIFS request
the kernel code was using simple kernel allocations that could
accommodate the largest possible request. Such large allocations and
deallocations causes what is known as a storm of TLB shootdown
cross-calls limiting scalability.
Incredibly though after implementing the trivial fix, I found that the
rest of the CIFS server was beautifully scalable code with no other
barriers. So in one quick and simple fix (using kmem caches) I could
demonstrate a great scalability improvements to CIFS. This was
tracked as
6686647
: smbsrv scalability impacted by memory
Since those 2 protocol problems were identified early on, I must say
that no serious protocol performance problems have come up. While we
can always find incremental improvements to any given test, our
current implementation has held up to our testing so far.
In the next phase of the project, we did a lot of work on improving
network efficiency at high data rate. In order to deliver the
throughput that the server is capable of, we must use 10Gbps network
interface and the one available on the NAS platforms are based on the
Neptune networking interface running the nxge driver.
Network Setup
I collaborated on this with
Alan Chiu that already new a
lot about this network card and driver tunables and so we quickly
could hash out the issues. We had to decide for a proper out of the
box setup involving
- how many MSI-X interrupts to use
- whether to use networking soft rings or not
- what bcopy threshold to use in the driver as opposed to
binding dma.
- Whether to use or not the new Large Segment Offload (LSO)
technique for transmits.
We new basically where we wanted to go here. We wanted many interrupts
on receive side so as to not overload any CPU and avoid the use of
layered softrings which reduces efficiency. A low bcopy threshold so
that dma binding be used more frequently as the default value was too
high for this x64 based platform.
And LSO was providing a nice boost to efficiency. That got us to some
proper efficiency level.
However we noticed that under stress and high number of connections
our efficiency would drop by 2 or 3 X. After much head scratching we
rooted this to the use of too many TX dma channels. It turns out that
with this driver and architecture using a few channels leads to more
stickyness in the scheduling and much much greater efficiency. We
settled on 2 tx rings as a good compromise. That got us to a level
of 8-10 cpu cycles per byte transfered in network code (more on
Performance
Invariants).
Interrupt Blanking
Studying a Opensource alternative controller, we also found that on 1
of 14 metrics we where slower. That was rooted in the interrupt
blanking parameter that NIC use to gain efficiency. What we found here
was that by reducing our blanking to a small value we could leapfrog
the competition (from 2X worse to 2X better) on this test while
preserving our general network efficiency. We were then on par or
better for every one of the 14 tests.
Media Streaming
When we ran thousand or 1 Mb/s media streams from our systems we
quickly found that the file level software prefetching was hurting us.
So we initially disabled the code in our lab to run our media studies
but at the end of the project we had to find an out of the box setup
that could preserve our Media result without impairing maximum read
streaming. At some point we realized that what we were hitting
6469558:
ZFS prefetch needs to be more aware of memory pressure. It turns out
that the internals of zfetch code is setup to manage 8 concurrent
streams per file and can readahead up to 256 blocks or records : in
this case 128K. So when we realized that with 1000s of streams we
could readahead ourself out of memory, we knew what we needed to
do. We decided on setting up 2 streams per file reading ahead up to 16
blocks and that seems quite sufficient to retain our media serving
throughput while keeping so prefetching capabilities. I note here also
is that NFS client code will themselve recognize streaming and issue
their own readahead. The backend code is then reading ahead of client
readahead requests. So we kind of where getting
ahead of
ourselves here. Read more about it @
cndperf
To slog or not to slog
One of the innovative aspect of this Openstorage server is the use of
read and write optimized solid state devices; see for instance
The Value of Solid State Devices.
Those SSD are beautiful devices designed to help
latency but not throughput. A massive commit is actually better handled by
regular storage not ssd. It turns out that it was actually dead easy
to instruct the ZIL to recognize massive commits and divert it's block
allocation strategy away from the SSD toward the common pool of
disks. We see two benefits here, the massive commits will sped up
(preventing the SSD from becoming the bottleneck) but more importantly
the SSD will now be available as low latency devices to handle
workloads that rely on low latency synchronous operations. One should
note here that the ZIL is a "per filesystem" construct and so while a
filesystem might be working on a large commit another filesystem from
the same pool might still be running a series of small transaction and
benefit from the write optimized SSD.
In a similar way, when we first tested the read-optimized ssds , we quickly
saw that streamed data would install in this caching layer and that it
could slow down the processing later. Again the beauty of working on
an appliance and closely with developers meant that the following
build, those problems had been solved.
Transaction Group Time
ZFS operates by issuing regular transaction groups in which
modifications since last transaction group are recorded on disk and
the ueberblock is updated. This used to be done at a 5 second interval
but with the recent improvement to the
write
throttling code this became a 30 second interval (on light
workloads) which aims to not generate more than 5 seconds of I/O per
transaction groups. Using 5 seconds of I/O per txg was used to
maximize the ratio of data to metadata in each txg, delivering more
application throughput. Now these Storage 7000 servers will typically
have lots of I/O capability on the storage side and the data/metadata
is not as much a concern as for a small JBOD storage. What we found
was that we could reduce the the target of 5 second of I/O down to 1
while still preserving good throughput. Having this smaller value
smoothed out operation.
IT JUST WORKS
Well that is certainly the goal. In my group, we spent the
last year performance testing these OpenStorage systems finding and
fixing bugs, suggesting code improvements, and looking for better
compromise for common tunables. At this point, we're happy with the
state of the systems particularly for mirrored configuration with
write optimized SSD accelerators. Our code is based on a recent
OpenSolaris (from august) that already has a lot of improvements over
Solaris 10 particularly for ZFS, to which we've added specific
improvements relevant to NAS storage. We think these systems will at
times deliver great performance (see Amithaba's
results
) but almost always shine in the price performance categories.