|
I just tracked down a minor performance issue on my desktop in a couple of
minutes, and thought it may make a good example to follow as it demonstrates
using
Analytics
along with dtrace, truss, snoop and kstats - really the entire suite of perf
tools I regularly use on Solaris.
I was downloading a file and happened to run the kstat based nicstat
tool to watch network traffic. After the download finished, traffic
continued:
brendan@deimos:~> nicstat 1
Time Int rKB/s wKB/s rPk/s wPk/s rAvs wAvs %Util Sat
20:53:30 nge0 10.81 13.13 35.77 37.69 309.4 356.6 0.02 0.00
20:53:30 lo0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Time Int rKB/s wKB/s rPk/s wPk/s rAvs wAvs %Util Sat
20:53:31 nge0 28.17 21.71 89.51 91.50 322.2 243.0 0.04 0.00
20:53:31 lo0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Time Int rKB/s wKB/s rPk/s wPk/s rAvs wAvs %Util Sat
20:53:32 nge0 26.18 19.71 84.15 85.14 318.6 237.0 0.04 0.00
20:53:32 lo0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Time Int rKB/s wKB/s rPk/s wPk/s rAvs wAvs %Util Sat
20:53:33 nge0 26.89 19.93 80.18 85.13 343.4 239.7 0.04 0.00
20:53:33 lo0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Time Int rKB/s wKB/s rPk/s wPk/s rAvs wAvs %Util Sat
20:53:34 nge0 26.89 21.50 88.14 91.11 312.3 241.6 0.04 0.00
20:53:34 lo0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
^C
It's not much, but my desktop should be idle, and yet there are regular
network reads and writes. I'd like to know what that is.
Running snoop should give me a quick idea of the traffic over the wire:
root@deimos:/> snoop -r | cut -c1-80
Using device nge0 (promiscuous mode)
192.168.1.109 -> 192.168.2.156 NFS C 4 (lookup valid) PUTFH FH=700C NVERIFY GETA
192.168.2.156 -> 192.168.1.109 NFS R 4 (lookup valid) NFS4ERR_SAME PUTFH NFS4_OK
192.168.1.109 -> 192.168.2.156 NFS C 4 (lookup valid) PUTFH FH=700C NVERIFY GETA
192.168.2.156 -> 192.168.1.109 NFS R 4 (lookup valid) NFS4ERR_SAME PUTFH NFS4_OK
192.168.1.109 -> 192.168.2.156 NFS C 4 (lookup valid) PUTFH FH=700C NVERIFY GETA
192.168.2.156 -> 192.168.1.109 NFS R 4 (lookup valid) NFS4ERR_SAME PUTFH NFS4_OK
192.168.1.109 -> 192.168.2.156 NFS C 4 (lookup valid) PUTFH FH=700C NVERIFY GETA
192.168.2.156 -> 192.168.1.109 NFS R 4 (lookup valid) NFS4ERR_SAME PUTFH NFS4_OK
192.168.1.109 -> 192.168.2.156 NFS C 4 (lookup valid) PUTFH FH=700C NVERIFY GETA
192.168.2.156 -> 192.168.1.109 NFS R 4 (lookup valid) NFS4ERR_SAME PUTFH NFS4_OK
[...]
That's not what I was expecting. I'm doing NFS traffic? The only
NFS share I have mounted is my home directory, and I'm not doing any sustained
I/O right now.
Fortunately my home directory server is a
Sun Storage
system, so I can examine the NFS traffic with Analytics from the server side:
I'll begin by drilling down by client, as I know my desktop is "deimos":
This shows I'm doing more NFS IOPS to the home directory server than anyone
else.
Now to drill down further, by type of operation and by file:
".elinks"? elinks is a text based
browser that I use - a modern version of lynx.
Here's a screenshot of elinks viewing osnews:
So why is elinks causing so much NFS traffic?
Right now I could aim
DTrace
at elinks to get a detailed view of what it is up to, but as I'm not worried
about slowing the target in this case, I'll take a quick look with truss to
start with (runs non-root by default, fewer keystrokes and inbuilt syscall translation):
brendan@deimos:~> truss -fp `pgrep elinks`
295062: pollsys(0x08047AC0, 3, 0x08047B68, 0x00000000) = 0
295062: alarm(0) = 0
295062: open64("/home/brendan/.elinks/formhist", O_RDONLY) Err#2 ENOENT
295062: time() = 1258176283
295062: time() = 1258176283
295062: time() = 1258176283
295062: open64("/home/brendan/.elinks/formhist", O_RDONLY) Err#2 ENOENT
295062: time() = 1258176283
295062: time() = 1258176283
295062: time() = 1258176283
295062: open64("/home/brendan/.elinks/formhist", O_RDONLY) Err#2 ENOENT
295062: time() = 1258176283
295062: time() = 1258176283
295062: time() = 1258176283
295062: open64("/home/brendan/.elinks/formhist", O_RDONLY) Err#2 ENOENT
295062: time() = 1258176283
295062: time() = 1258176283
295062: time() = 1258176283
295062: open64("/home/brendan/.elinks/formhist", O_RDONLY) Err#2 ENOENT
[...]
Ahh - elinks is repeatedly trying to read a non-existant file from my home
directory, which is causing the NFS traffic.
To get an accurate read on the rate of these opens, I'll use dtrace to
frequency count the filenames that were called with open() by elinks, printing
a summary every second:
root@deimos:/> dtrace -n 'syscall::open*:entry /execname == "elinks"/ {
@[copyinstr(arg0)] = count(); } tick-1sec { printa(@); clear(@); }'
dtrace: description 'syscall::open*:entry ' matched 3 probes
CPU ID FUNCTION:NAME
1 87538 :tick-1sec
/home/brendan/.elinks/formhist 63
1 87538 :tick-1sec
/home/brendan/.elinks/formhist 63
1 87538 :tick-1sec
/home/brendan/.elinks/formhist 63
^C
It's 63 opens per second, all of the ".elinks/formhist" file.
DTrace can dig deeper, shedding light as to why elinks is reading that file.
I can just add "ustack()" as a second key to the DTrace aggregation used to
frequency count filenames, to include the entire user stack trace:
root@deimos:/> dtrace -n 'syscall::open*:entry /execname == "elinks"/ {
@[copyinstr(arg0), ustack()] = count(); } tick-1sec { printa(@); clear(@); }'
dtrace: description 'syscall::open*:entry ' matched 3 probes
CPU ID FUNCTION:NAME
1 87538 :tick-1sec
/home/brendan/.elinks/formhist
libc.so.1`__open64+0x7
libc.so.1`_endopen+0xb8
libc.so.1`fopen64+0x29
elinks`load_formhist_from_file+0x57
elinks`get_form_history_value+0x39
elinks`draw_forms+0xec
elinks`draw_view_status+0x1a
elinks`draw_doc+0x3fd
elinks`refresh_view+0x1b
elinks`draw_formatted+0x115
elinks`tabwin_func+0xb4
elinks`term_send_event+0x154
elinks`redraw_terminal+0x36
elinks`redraw_leds+0xb2
elinks`check_timers+0x87
elinks`select_loop+0x19a
elinks`main+0x43
elinks`_start+0x80
63
I'm not familiar with the elinks source, so I'll read from the bottom of the
stack trace up, looking for high level functions. It looks generic until I get
to redraw_leds(). leds? LEDs? There is a pseudo-LED text display in the
bottom right of the screen. It's calling redraw_terminal(), which is calling
refresh_view(), and then into load_formhist_from_file() - causing the file open.
Doing file opens with the frequent update of the status LEDs sounds like a bug, and
perhaps a known one. A minute with google found that this was known and
fixed in the next version of elinks.
While I was watching, the home directory server for fishworks was serving most of its NFS IOPS for a bug in elinks on my desktop. Which I can eliminate by upgrading my elinks version. This is an example of the best type of performance win: eliminating unnecessary work.
I'd like to show the benefit of the
Hybrid
Storage Pool (HSP) in the simplest possible way, as an addition to
my top
speed results for the
Sun
Storage
7410.
The HSP uses both read and write optimized flash based solid state disks (SSDs) to
improve the performance of file I/O, with the ZFS L2ARC and SLOG (seperate ZIL) technologies:

old model |

new model |

with ZFS |
I've demonstrated both the HSP technologies previously in detail, and explained how they work:
Here I'll test how these HSP technologies perform with simple workloads, to
measure best-case latency.
Disclaimer: Performance isn't simple. Measuring the speed of simple
workloads can be easy; inferring from this how a system will behave in any
given production environment can be quite hard. The best way to understand
performance is to test the system with a workload that closely resembles the
production environment.
Latency
The results I'll measure for the L2ARC and SLOG will be latency, not IOPS or
throughput. It doesn't make sense to compare IOPS values alone while ignoring
the latency of those IOPS ("10,000 IOPS" from flash memory does not equal
"10,000 IOPS" from slow rotating disks!) Throughput
isn't characteristic either - while the HSP delivers great throughput, it usually
does so by minimizing use of the L2ARC and SLOG and using multiple disk
spindles instead.
The latency presented here will be measured from the NFS layer using
Analytics
on the Sun Storage server. This latency will include the overheads from processing the NFS protocol, the ZFS filesystem, and waiting for the flash based
device I/O itself.
The reason is to best portray the delivered performance of the whole product,
rather than showing component performance.
Testing Notes
A couple of notes about performance testing SSDs.
- Age the SSDs: Flash memory based SSDs can be considerably faster the first
time you fill them. The second time they are populated the I/O latency can
become much worse as the data becomes fragmented. (Fragmentation may sound odd
in the context of flash memory, but this behavior does exist.)
So, the SSDs I test are aged - they've been completely filled with data many times
before.
- Use Gbytes of data: Like rotating disks, SSD devices can have
on-disk DRAM to cache I/O. If you are interested in flash memory read latency but
test this by rereading a small amount of data (Mbytes), you may hit from the
on-disk cache instead. I use large working sets (in terms of Gbytes) to avoid
this.
L2ARC
See my
screenshots for
the full introduction to the L2ARC. To characterize it, I'll perform random
512 byte reads over NFSv3 from a single thread on a single client. The target
is a 7410 (Barcelona) with 6 L2ARC devices (Readzillas). Some of the workload
will hit from DRAM, some from the L2ARC, and some from disk. The amount from
each will depend on cache warmth and working set size; what we are interested
in here is the latency.
To find the top speed, the ZFS record size is set to 512 bytes
to match the I/O size: but before you try this yourself, understand that
ZFS record sizes smaller than 4 Kbytes do begin to cost noticeable amounts of
DRAM (bigger metadata/data ratio) and significantly reduce streaming I/O
performance. Before this sounds too unlikely to be interesting, note that a
workload involving thousands of tiny files may behave in a similar way; the ZFS
record size only takes effect if the file becomes bigger than that size.
The following screenshots show NFS read latency heat maps at different vertical
ranges (click for larger versions), each corresponds to the HSP technology returning that read. I've also listed the latency range which covers most of the I/O:
DRAM latency: 0 - 10 us
This shows hits from the ZFS DRAM cache (the ARC: Adaptive Replacement Cache), which are fast as
expected.
L2ARC latency: 128 - 260 us
The L2ARC is returning these 512 byte reads with consistently low latency.
The range may sound high when compared to advertised SSD 512 byte read latency, but
remember that these are aged SSDs, and this time includes the ZFS and NFS
overheads.
Disk latency: 1900 - 9700 us
These 7200 RPM disks are returning the reads with latency from 2 to 10 ms,
as expected for random disk I/O (platter rotation time + head seek time.)
SLOG
See my
screenshots for
the full introduction to the SLOG. To characterize it, I'll perform 512 byte
O_DSYNC writes over NFSv3. Since these will all be handled by the SSD SLOG
device (Logzilla), I'll also show what happens without them - the synchronous
write latency to disk. As before, I'll show screenshots and latency ranges.
SLOG latency: 137 - 181 us
Most of the I/O is in a tight band in the 100 to 200 us range (the
screenshot doesn't look so tight due to the false color palette; see the
range average values on the left.)
Disk latency: 1950 - 9170 us
Without the SSD based SLOG devices, synchronous write I/O is served here
from 7200 RPM disks, with a latency between 2 and 9 ms.
Reality Check
While I've been showing top speeds of the L2ARC and SLOG devices, I can help
set expectations by describing cases where you will be slower (or faster!) than
these top speeds.
- I/O size. To find the top speeds above I used a application I/O size
of 512 bytes. Bigger I/O takes longer, although most of the time for the 512
byte I/O is processing the I/O - not data transfer, so this won't scale as bad
as it may sound. As an example, see this
L2ARC 4 Kbyte I/O
screenshot - it's a little slower than the 512 byte I/O, but not 8 times
slower.
- FS record size. For the L2ARC test I set the filesystem record
size to 512 bytes before creating the files, so that the L2ARC devices would be
accessed in 512 byte blocks. To keep metadata/data ratios in check and to keep
up streaming performance, I don't usually set the record size to below 4
Kbytes (although a smaller record size may be used inadvertently
due to thousands of small files.) The previous
screenshot
was also with a 4 Kbyte FS record size, which (along with the I/O size)
contributed to the slightly higher latency.
- Threads. For these top speed tests, I only used one client process (and
one thread) to perform I/O. If multiple threads access the L2ARC and SLOG
devices simultaneously, then contention can increase the latency. The more
simultaneous threads, the worse it gets. This is the same for disk I/O, but
with the HSP we usually have more disk spindles than SSD devices, so the pool
of disks can in some cases handle more concurrent I/O than the SSDs can (depends
on the number of disks and pool profile.)
- SSD technology. The latencies shown above are for the current SSD
devices we are shipping with the Sun Storage 7000 series. SSD technology has
been improving quickly - so I'd expect these latencies to get better over time.
- Workload. The L2ARC is currently tuned for random reads, and the
SSD SLOG devices are used for synchronous writes. If you test
streaming reads or asynchronous writes, the HSP will deliver great
performance - but it may not use the SSD devices very much. In these
cases the latency will often be better - but that isn't showing SSD latency.
The HSP uses the best tool for the job - which may or may not be the L2ARC and
SLOG devices.
Summary
NFS latency ranges for the L2ARC:
Here DRAM latency doesn't quite register a pixel in height on the linear
plot. The logarithmic plot shows the role of the SSD based L2ARC very well,
in between DRAM and disk.
NFS latency ranges for the SLOG:
Conclusion
When the HSP uses read and write optimized flash memory SSDs the NFS I/O latency can
reach as low as 0.1 ms, compared to 2 to 10 ms without these SSDs (disk + DRAM only.)
These results aren't really surprising
considering the technology.
I've updated
my top
speeds with these latency results.
And as top speeds go, this is as good as it
gets - your latency will be higher with larger I/O sizes.
Today we have released an update to the
Sun Storage
7410,
which upgraded the CPUs from Barcelona to Istanbul:
| 7410 (Barcelona) | 7410 (Istanbul) |
| Max 4 sockets quad core AMD Opteron CPU |
Max 4 sockets of six core AMD Opteron CPU |
| Max 128 Gbytes DRAM | Max 256 Gbytes DRAM |
| HyperTransport 1.0 | HyperTransport 3.0 |
This is per head node, so a 2-way cluster can bring half a Terabyte
of DRAM for filesystem caching.
But what has me most excited is the upgrade of main system bus,
from AMD's HyperTransport 1 to HyperTransport 3. In this blog post I'll
explain why, and post numbers with the new 7410.
The following screenshots show the Maintenance->Hardware screen from the original and the new 7410:
New 7410 Results
The following results were collected from the two 7410s shown above,
and by running the same workload on both systems (I increase the
workload to get the most from the new 7410 system later in this post.):
| Workload | 7410 (Barcelona) | 7410 (Istanbul) | Improvement |
| NFSv3 streaming cached read | 2.15 Gbytes/sec | 2.68
Gbytes/sec | 25% |
| NFSv3 8k read ops/sec | 127,143 | 223,969 | 75% |
A very impressive improvement from what were already great results.
Both of these results are reading a cached working set over NFS, so the
disks are not involved. The CPUs and HyperTransport were upgraded, and these
cached workloads were chosen to push those components to their limits (not the
disks), to see the effect of the upgrade.
The following screenshots are the source of those results, and were taken
from Analytics on the 7410 - showing what the head node really did. These
tests were performed by Sun's Open Storage Systems group (OSSG). I was able to login
to Analytics
on their systems and take screenshots from the tests they
performed after the fact (since
Analytics
archives this data) and check
that these results were consistent with my own - which they are.
Original 7410 (Barcelona)
Streaming cached read:
Notice that we can now reach 2.15 Gbytes/sec for NFSv3 on the original 7410 (60 second average of network throughput, which includes protocol headers.)
When I first blogged about the 7410 after launch, I was reaching 1.90
Gbytes/sec; sometime later that became 2.06 Gbytes/sec. The difference is the
software updates - we are gradually improving our performance release after
release.
8k cached read ops:
As a sanity check, we can multiply the observed NFS read ops/sec by their
known size - 8 Kbytes: 127,143 x 8 Kbytes = 0.97 Gbytes/sec. Our observed
network throughput was 1.01 Gbytes/sec, which is consistent with 127K x 8 Kbyte
read ops/sec (higher as it includes protocol headers.)
New 7410 (Istanbul)
Streaming cached read:
2.68 Gbytes/sec - awesome!
8k cached read ops:
This is 75% faster than the original 7410 - this is no small hardware
upgrade! As a sanity test, this showed 223,969 x 8 Kbytes = 1.71 Gbytes/sec.
On the wire we observed 1.79 Gbytes/sec, which includes protocol headers. This is consistent with the expected throughput.
System
The systems tested above were the Barcelona-based and Istanbul-based 7410, both with max CPU and
DRAM, and both running the latest software (2009.Q3.) The same 41 clients were used to
test both 7410s.
The HyperTransport
The Sun Storage 7410 could support four ports of 10 GbE, with a theoretical
combined maximum throughput of 40 Gbit/sec, or 4.64 Gbytes/sec. However in practice it was
reaching about 2.06 Gbytes/sec when reading cached data over NFS. While
over 2 Gbytes/sec is fantastic (and very competitive), why not over 3 or 4
Gbytes/sec?
First of all, if you keep adding high speed I/O cards to a system, you may run out of system resources to drive them before you run out of slots to plug them into. Just because the system lets you plug them all in, doesn't mean that the CPUs, busses and software can drive it at full speed. So, given that, what specifically stopped the 7410 from going faster?
It wasn't CPU horsepower: we had four sockets of quad-core Opteron and
the very scalable Solaris kernel. The bottleneck was actually the
HyperTransport.
The HyperTransport is used as the CPU interconnect and the path to the
I/O controllers. Any data transferred with the I/O cards (10 GbE cards, SAS
HBAs, etc), will travel via the HTs. It's also used by the CPUs so they can
access each other's memory. In the diagram above, picture CPU0 accessing the memory which
is directly attached to CPU3 - which would require two hops over HT links.
HyperTransport 1
A clue that the HyperTransport (and memory busses) could be the bottleneck
was found with the Cycles Per Instruction (CPI):
walu# ./amd64cpi-kernel 5
Cycles Instructions CPI %CPU
167456476045 14291543652 11.72 95.29
166957373549 14283854452 11.69 95.02
168408416935 14344355454 11.74 95.63
168040533879 14320743811 11.73 95.55
167681992738 14247371142 11.77 95.26
[...]
amd64cpi-kernel
is a simple script I wrote (these scripts are not supported by Sun), to pull the CPI from the AMD CPU PICs (Performance Instrumentation Counters.) The higher the CPI, the more
cycles are waiting for memory loads/stores, which are stalling instructions. A CPI of
over 11 is the highest I've ever seen - a good indication that we are waiting
a significant time for memory I/O.
Also note in the amd64cpi-kernel output that I included %CPU - CPU
utilization. With a CPU utilization of over 95%, how many of you would be
reaching for extra or faster CPU cores to improve the system? This is a problem
for all %CPU measurements - yes, the CPU was processing instructions, but it
wasn't performing 'work' that you assume - instead those instructions are
stalled waiting for memory I/O. Add faster CPUs, and you stall faster (doesn't
help.) Add more cores or sockets, and you could make the situation worse -
spreading the workload over more CPUs can decrease the L1/L2 CPU cache hit
rates, putting even more pressure on memory I/O.
To investigate the high CPI, I wrote more scripts to figure out what the
memory buses and HT buses were doing. My
amd64htcpu script shows
the HyperTransport transmit Mbytes/sec, by both CPU and port (notice in the
diagram each CPU has 3 HT ports.):
walu# ./amd64htcpu 1
Socket HT0 TX MB/s HT1 TX MB/s HT2 TX MB/s HT3 TX MB/s
0 3170.82 595.28 2504.15 0.00
1 2738.99 2051.82 562.56 0.00
2 2218.48 0.00 2588.43 0.00
3 2193.74 1852.61 0.00 0.00
Socket HT0 TX MB/s HT1 TX MB/s HT2 TX MB/s HT3 TX MB/s
0 3165.69 607.65 2475.84 0.00
1 2753.18 2007.22 570.70 0.00
2 2216.62 0.00 2577.83 0.00
3 2208.27 1878.54 0.00 0.00
Socket HT0 TX MB/s HT1 TX MB/s HT2 TX MB/s HT3 TX MB/s
0 3175.89 572.34 2424.18 0.00
1 2756.40 1988.03 578.05 0.00
2 2191.69 0.00 2538.86 0.00
3 2186.87 1848.26 0.00 0.00
[...]
Each number shows:
Socket HT0 TX MB/s HT1 TX MB/s HT2 TX MB/s HT3 TX MB/s
0 CPU0-1 MCP55 CPU0-2 0.00
1 CPU1-0 CPU1-3 IO55 0.00
2 CPU2-3 CPU2-3 CPU2-0 0.00
3 CPU3-2 CPU3-1 CPU3-2 0.00
This shows traffic on socket 0, HT0 was over 3 Gbytes/sec (and it was getting higher that that, approaching 4 Gbytes/sec - but I didn't keep the command output to include here.) These
HyperTransport 1 links have a theoretical maximum of 4 Gbytes/sec, which we are
approaching. While they may not be at 100% utilization (for a 1 second
interval), we have multiple cores per socket trying to access a resource that
has reasonably high utilization - which will lead to stalling the instructions.
After identifying memory I/O on HyperTransport 1 as a potential bottleneck, we were able
to improve the situation a few ways:
- zero-copy: reducing memory I/O that the kernel needs to do, in this
case by not copying data if possible as it is passed through the kernel stack -
instead managing references to it.
- card-reshuffle: we shuffled the I/O cards so that less traffic needed to go
via the CPU0 to CPU1 HT interconnect. Originally we had all the network cards
on one I/O controller, and all the SAS HBA cards on the other - so any disk I/O
served to the network would travel via that hot HT link.
- thread-binding: certain threads (such as for nxge, the 10 GbE driver) were bound
to specific CPUs, to increase the CPU L1/L2 cache hit rate, which would decrease the remote memory
traffic, which would relieve HT pressure.
With these changes, our performance improved and the CPI was down to about
10. To go further, we needed HyperTransport 3.
HyperTransport 3
HT3 promised to triple the bandwidth, however when I first got a prototype
HT3 system I was dissapointed to discover that the max NFSv3 throughput was the
same. It turned out that I had been sent
upgraded CPUs, but on a HT1 system. If anything, this further confirmed what I
had suspected - faster CPUs didn't help throughput, we needed to upgrade the
HT.
When I did get a HT3 system, the performance was considerably better -
between 25% and 75%. HT links:
topknot# ./amd64htcpu 1
Socket HT0 TX MB/s HT1 TX MB/s HT2 TX MB/s HT3 TX MB/s
0 5387.91 950.88 4593.76 0.00
1 6432.35 5705.65 1189.07 0.00
2 5612.83 3796.13 6312.00 0.00
3 4821.45 4703.95 3124.07 0.00
Socket HT0 TX MB/s HT1 TX MB/s HT2 TX MB/s HT3 TX MB/s
0 5408.83 973.48 4611.83 0.00
1 6443.95 5670.06 1166.64 0.00
2 5625.43 3804.35 6312.05 0.00
3 4737.19 4602.82 3060.74 0.00
Socket HT0 TX MB/s HT1 TX MB/s HT2 TX MB/s HT3 TX MB/s
0 5318.46 971.83 4544.80 0.00
1 6301.69 5558.50 1097.00 0.00
2 5433.77 3697.98 6110.82 0.00
3 4581.89 4464.40 2977.55 0.00
[...]
HT3 is sending more than 6 Gbytes/sec over some links. The CPI was down to
6 (and lower), from 10. The difference to performance numbers was huge:
Instead of 0.5 Gbytes/sec write to disk, we were now approaching 1.
Instead of 1 Gbyte/sec NFS read from disk, we were now approaching 2.
And instead of 2 Gbytes/sec NFS read from DRAM, we were now approaching 3.
Along with the CPU upgrade (which helps IOPS), and DRAM upgrade (helps
caching working sets), the 7410 hardware update was looking to be an incredible upgrade
to what was already a powerful system.
CPU PIC Analysis
If I've wet your appetite for more CPU PIC analysis, on Solaris run "cpustat
-h" and fetch the document it refers to, which will contain a reference for the
CPU PICs for the platform you are on. The scripts I used above are really not
that complicated - they use shell and perl to wrap the output (as the man page
for cpustat even suggests!) Eg, the amd64cpi-kernel tool was:
#!/usr/bin/sh
#
# amd64cpi-kernel - measure kernel CPI and Utilization on AMD64 processors.
#
# USAGE: amd64cpi-kernel [interval]
# eg,
# amd64cpi-kernel 0.1 # for 0.1 second intervals
#
# CPI is cycles per instruction, a metric that increases due to activity
# such as main memory bus lookups.
interval=${1:-1} # default interval, 1 second
set -- `kstat -p unix:0:system_misc:ncpus` # assuming no psets,
cpus=$2 # number of CPUs
pics='BU_cpu_clk_unhalted,sys0' # cycles
pics=$pics,'FR_retired_x86_instr_w_excp_intr,sys1' # instructions
/usr/sbin/cpustat -tc $pics $interval | perl -e '
printf "%16s %16s %8s %8s\n", "Cycles", "Instructions", "CPI", "%CPU";
while (<>) {
next if ++$lines == 1;
split;
$total += $_[3];
$cycles += $_[4];
$instructions += $_[5];
if ((($lines - 1) % '$cpus') == 0) {
printf "%16u %16u %8.2f %8.2f\n", $cycles,
$instructions, $cycles / $instructions, $total ?
100 * $cycles / $total : 0;
$total = 0;
$cycles = 0;
$instructions = 0;
}
}
'
A gotcha for this one is the "sys" modifier on the pics definitions; they make these PICs record activity during both user-code and kernel-code, not just user-code.
My 7410 Results
I've previously posted
many
numbers
covering 7410 performance, although I had yet to collect the full set. I
was missing iSCSI, FTP, HTTP and many others. This hardware upgrade changes
everything - all my previous numbers are now out of date. The numbers
for the new 7410 are so far between 25% and 75% better than what I had posted
previously!
Performance testing is like painting the Golden
Gate Bridge: once you reach the end you must immediately begin at the start
again. In our case, there are so many software and hardware upgrades that
once you approach completing perf testing, the earlier numbers are out of date.
The OSSG group (who gathered the numbers at the start of this post)
are starting to help out so that we can test and share numbers more
quickly.
I've created a new column of numbers on my
summary
post, and I'll fill out the new numbers as I get them.
Update 29-Sep-2009:
I've had a chance to gather max performance results for the Istanbul 7410 with the
2009.Q3 software. The results are excellent, and match what I saw
in development (but I wanted to replicate these on the shipping software
before I shared them.)
The numbers I've shown so far (from the OSSG team) were from running the same
workload on both the 7410s. That's interesting for observing deltas for a
given workload, however I'm also interested in outright maximums possible (so
that I can analyze the bottlenecks responsible.) This may mean increasing a
workload to better stress a more powerful server, which is what I've had to do
here. My results follow, and a description of the setup used.
Max NFSv3 streaming cached read
Top speed test - this shows the fastest we can serve data over the NFSv3
protocol. A 200 Gbyte working set was used (200 x 1 Gbyte files), which
caches entirerly in the 7410's 256 Gbytes of DRAM, and the clients run multiple
threads (usually 10) to repeatedly read over the files with a 1 Mbyte I/O size.
Here's my first attempt:
2.87 Gbytes/sec over the wire! I'd like to break 3.0, but I believe I've
hit our next bottleneck: PCIe gen1. The network traffic is served by two
dual-port 10 GbE cards. Each card can drive either of its 10 GbE ports to full
speed (1.16 Gbytes/sec), but since it is a x8 PCIe gen1 card, it can't drive
both ports at full speed at the same time. That would require 2.32 Gbytes/sec,
and x8 PCIe gen1 has a theoretical max of 2.0 Gbytes/sec - but in pratcise
actually moves between 1.4 to 1.5 Gbytes/sec (depending on PCIe payload
size.) So, while that's the bottleneck for this particular test, this is still
a truly great result - 2.87 Gbytes/sec on the wire.
Apart from the four 10 GbE ports in this 7410, there are also four 1 GbE
ports on the system board. I only need 0.13 Gbytes/sec more to break
3.0...
3.06 Gbytes/sec! This includes protocol headers as it's measured on
the wire, but as I'm using jumbo frames it's mostly payload. Click the image
to see the %CPU utilization - plenty of headroom available. I only hooked up
two of the onboard 1 GbE ports, so I may be able to take this further if I
plugged in the remaining two. For now, roughly 3.06 Gbytes/sec is a very
competitive max throughput for NFSv3.
Max NFSv3 streaming disk read
Apart from reading from DRAM cache (which practical on this 7410 if your
working set is less than 200 Gbytes), I'll also test max streaming read from
disk - uncached:
We are pulling 2.03 Gbytes/sec from disk, which is mostly payload (plus ZFS
metadata); once it goes over the wire it is 2.07 Gbytes/sec (with NFS headers.)
Over 2 Gbytes/sec is a great result - almost double what the Barcelona 7410
could do.
This was tested using two streaming read threads per client, on 20 clients, with
a 1 Mbyte I/O size.
Max NFSv3 streaming disk write
Now for a streaming write test to the 7410, running two write threads across 10 of the clients:
932 Mbytes/sec over the network, and over double when it reaches disk due to
mirroring ((932 + metadata) x 2). I've zoomed out to take a 15 minute average,
as writes can be variable (for reasons on both the server and clients) and I
didn't want to include an unfairly high or low 1 minute average.
This is a great result, and getting close to 1
Gbyte/sec write. I think I could go faster with some better clients, and
possibly break 1 Gbyte/sec.
Max NFSv3 read ops/sec
Now for an IOPS test: to find the upper bound of IOPS, I'll test cached
1 byte reads over NFSv3. The clients run hundreads of threads to
request as many IOPS as the 7410 can serve:
Half a million from a single head node. Not a cluster (some vendors
like to multiply their results by the max heads in a cluster.) Yes, this is 1
byte reads, but the performance is still solid when they increase in size.
Here's my swing at cached 4 Kbyte NFS I/O:
Over 400K! This result may be more impressive (and practical) than reaching half a million 1 byte IOPS.
Configuration
As the filer I was using a single Sun Storage
7410,
with the following config:
- 256 Gbytes DRAM
- 8 JBODs, each with 24 x 1 Tbyte disks, configured with mirroring
- 4 sockets of six-core AMD Opteron 2600 MHz CPUs (Istanbul)
- 2 x 2x10 GbE cards (4 x 10 GbE ports total), jumbo frames
- 3 x HBA cards
- noatime on shares, and database size left at 128 Kbytes
It's not a max config system - the 7410 can currently scale to 12 JBODs
and have flash based SSD as read cache and intent log - which
I'm not using for these tests.
The clients are 20 blades, each:
- 2 sockets of Intel Xeon quad-core 1600 MHz CPUs
- 3 Gbytes of DRAM
- 2 x 1 GbE network ports
- mount options:
- read tests: mounted forcedirectio (to skip client caching), and rsize to match the workload
- write tests: default mount options
These clients are great, apart from the CPU clock speed - which at 1600 MHz is
a little low.
The network consists of multiple 10 GbE switches to connect the
client 1 GbE
ports to the filer 10 GbE ports.
Conclusion
For this 7410 upgrade, the extra CPU cores help - but it's more about the
upgrade to the HyperTransport. HT3 provides 3x the CPU interconnect
bandwidth, and dramatically improves the delivered performance of the 7410:
from 25% to 75%. The 7410 was already a powerful server, it's now raised the
bar even higher.
I previously posted
screenshots of the L2ARC:
the ZFS second level cache which
uses read optimized SSDs ("Readzillas") to cache random read workloads.
That's the read side of the
Hybrid Storage Pool. On
the write side, the ZFS
separate intent log
(SLOG) can use write optimized SSDs (which we call "Logzillas") to accelerate performance
of synchronous write workloads.
In the screenshots that follow I'll show how Logzillas have delivered
12x more IOPS and over 20x reduced latency for a synchronous write
workload over NFS. These screenshots are
from Analytics
on the Sun Storage
7000 series.
In particular, the following heat
map shows the dramatic reduction in NFS latency when turning on the Logzillas:
Before the 17:54 line, latency was at the mercy of spinning 7200 RPM disks, and
reached as high as 9 ms.
After 17:54, these NFS operations were served from Logzillas, which consistently
delivered latency much lower than 1 ms. Click for a larger screenshot.
What is a Synchronous Write?
While the use of Logzilla devices can dramatically improve the performance of
synchronous write workloads, what is a synchronous write?
When an application writes data, if it waits for that data to complete writing
to stable storage (such as a hard disk), it is a synchronous write. If the
application requests the write but continues on without waiting (the data may is buffered
in the filesystem's DRAM cache, but not yet written to disk), it is an
asynchronous write. Asynchronous writes are faster for the application, and
is the default behavior.
There is a down side to asynchronous writes - the application doesn't know if the
write completed successfully. If there is a power outage before the write could be
flushed to disk, the write will be lost. For some applications such as database log
writers, this risk is unacceptable - and so they perform synchronous writes
instead.
There are two forms of synchronous writes: individual I/O which is written
synchronously, and groups of previous writes which are synchronously committed.
Individual synchronous writes
Write I/O will become synchronous when:
- A file is opened using a flag such as O_SYNC or O_DSYNC
- NFS clients use the mount option (if available):
- "sync": to force all write I/O to be synchronous
- "forcedirectio": which avoids caching, and the client may decide to make write
I/O synchronous as part of avoiding its cache
Synchronously committing previous writes
Rather than synchronously writing each individual I/O, an application may
synchronously commit previous writes at logical checkpoints in the code. This
can improve performance by grouping the synchronous writes. On ZFS these may
also be handled by the SLOG and Logzilla devices. These are performed by:
- An application calling fsync()
- An NFS client calling commit. It can do this because:
- It closed a file handle that it wrote to (Solaris clients can avoid this using the
"nocto" mount option)
- Directory operations (file creation and deletion)
- Too many uncommitted buffers on a file (eg, FreeBSD/MacOS X)
Examples of Synchronous Writes
Applications that are expected to perform synchronous writes include:
- Database log writers: eg, Oracle's LGWR.
- Expanding archives over NFS: eg, tar creating thousands of small files.
- iSCSI writes on the Sun Storage 7000 series (unless lun write cache is enabled.)
The second example is easy to demonstrate. Here I've taken
firefox-1.5.0.6-source.tar, a tar file containing 43,000 small files, and
unpacked it to an NFS share (tar xf.) This will create thousands of small
files, which becomes a synchronous write workload as these files are created.
Writing of the file contents will be asynchronous, it is just the act of creating
the file entries in their parent directories which is synchronous.
I've unpacked this tar file twice, the first time
without Logzilla devices and the second time with. The difference is clear:
Without logzillas, it took around 20 minutes to unpack the tar file. With logzillas,
it took around 2 minutes - 10x faster. This improvement is also visible as the higher
number of NFS ops/sec, and the lower NFS latency in the heat map.
For this example I used a Sun Storage
7410
with 46 disks and 2 Logzillas, stored in 2 JBODs. The disks and Logzillas can
serve many more IOPS than I've demonstrated here, as I'm only running a single
threaded application from a single modest client as a workload.
Identifying Synchronous Writes
How do you know if your particular write workload is
synchronous or asynchronous? There are many ways to check, here
I'll describe a scenario where a client side application is performing
writes to an NFS file server.
Client side
To determine if your application is performing synchronous writes, one way is to
debug the open system calls. The following example uses truss on Solaris
(try strace on Linux) on two different programs:
client# truss -ftopen ./write
4830: open64("outfile", O_WRONLY|O_CREAT, 0644) = 3
client# truss -ftopen ./odsync-write
4832: open64("outfile", O_WRONLY|O_DSYNC|O_CREAT, 0644) = 3
The second program (odsync-write) opened its file with the O_DSYNC
flag, so subsequent writes will be synchronous. In these examples, the program
was executed by the debugger, truss. If the application is already running,
on Solaris you can use pfiles to examine the file flags for processes
(on Linux, try lsof.)
Apart from tracing the open() syscall, also look for frequent fsync() or
fdsync() calls, which synchronously commit the previously written data.
Server side
If you'd like to determine if you have a synchronous write workload from the
NFS server itself, you can't run debuggers like truss on the target
process (since it's running on another host), so instead we'll need to examine
the NFS protocol. You can do this either with packet sniffers (snoop,
tcpdump), or with
DTrace if it and its
NFS provider are available:
topknot# dtrace -n 'nfsv3:::op-write-start { @[args[2]->stable] = count(); }'
dtrace: description 'nfsv3:::op-write-start ' matched 1 probe
^C
2 1398
Here I've frequency counted the "stable" member of the NFSv3 write protocol,
which was "2" some 1,398 times. 2 is a synchronous write workload (FILE_SYNC),
and 0 is for asynchronous:
enum stable_how {
UNSTABLE = 0,
DATA_SYNC = 1,
FILE_SYNC = 2
};
If you don't have DTrace available, you'll need to dig this out of the NFS protocol
headers. For example, using snoop:
# snoop -d nxge4 -v | grep NFS
[...]
NFS: ----- Sun NFS -----
NFS:
NFS: Proc = 7 (Write to file)
NFS: File handle = [A2FC]
NFS: 7A6344B308A689BA0A00060000000000073000000A000300000000001F000000
NFS: Offset = 96010240
NFS: Size = 8192
NFS: Stable = FSYNC
NFS:
That's an example of a synchronous write.
The NFS client may be performing a synchronous-write-like workload by frequently
calling the NFS commit operation. To identify this, use whichever tool is available
to show NFS operations broken down by type (Analytics on the Sun Storage 7000;
nfsstat -s on Solaris)
SLOG and Synchronous Writes
The following diagram shows the difference when adding a seperate intent log
(SLOG) to ZFS:
Major components:
- ZPL: ZFS POSIX Layer. Primary interface to ZFS as a filesystem.
- ZIL: ZFS Intent Log. Synchronous write data for replay in the event of a crash.
- DMU: Data Management Unit. Transactional object management.
- ARC: Adaptive Replacement Cache. Main memory filesystem cache.
- ZIO: ZFS I/O Pipeline. Processing of disk I/O.
For the most detailed information of these and the other components of ZFS,
you can browse their source code at the
ZFS Source Tour.
When data is asynchronously written to ZFS, it is buffered in memory and
gathered periodically by the DMU as transaction groups, and then
written to disk. Transaction groups either succeed or fail as a whole, and are part
of the ZFS design to always deliver on disk data consistency.
This periodical writing of transaction groups can improve asynchronous writes
by aggregating data and streaming it to disk. However the interval of these writes is in
the order of seconds, which makes it unsuitable to serve synchronous writes directly -
as the application would stall until the next transaction group was synced.
Enter the ZIL.
ZFS Intent Log
The ZIL handles synchronous writes by
immediately writing their data and information to stable storage, an "intent log",
so that ZFS can claim that the write completed. The written data hasn't reached
its final destination on the ZFS filesystem yet, that will happen sometime later
when the transaction group is written. In the meantime, if there is a system failure,
the data will not be lost as ZFS can replay that intent log and write the data to
its final destination. This is a similar principle as Oracle redo logs, for example.
Separate Intent Log
In the above diagram on the left, the ZIL stores the log data on the same disks as
the ZFS pool. While this works, the performance of synchronous writes can suffer
as they compete for disk access along with the regular pool requests (reads, and
transaction group writes.) Having two different workloads compete for the same
disk can negatively effect both workloads, as the heads seek between the hot data
for one and the other. The solution is to give the ZIL its own dedicated
"log devices" to write to, as pictured on the right. These dedicated log devices
form the separate intent log: the SLOG.
Logzilla
By writing to dedicated log devices, we can improve
performance further by choosing a device which is best suited for fast writes.
Enter Logzilla. Logzilla was the name we gave to write-optimized
flash-memory-based solid state disks (SSDs.) Flash memory is known for slow
write performance, so to improve this the Logzilla device buffers the write in DRAM
and uses a super-capacitor to power the device long enough to write the DRAM buffer
to flash, should it lose power. These devices can write an I/O as fast as 0.1 ms
(depends on I/O size), and do so consistently. By using them as our SLOG, we can
serve synchronous write workloads consistently fast.
Speed + Data Integrity
It's worth mentioning that when ZFS synchronously writes to disk, it uses
new ioctl()s
to ensure that the data is flushed properly to the disk platter, and isn't just
buffered on the disk's write cache (which is a small amount of DRAM inside the disk
itself.) Which is exactly what we want to happen - that's
why these writes are synchronous. Other filesystems didn't bother to do this (eg, UFS),
and believed that data had reached stable storage when it may have just been cached. If
there was a power outage, and the disk's write cache is not battery backed, then those
writes would be lost - which means data corruption for the filesystem. Since
ZFS waits for the disk to properly write out the data, synchronous writes on ZFS
are slower than other filesystems - but they are also correct. There is
a way to turn off this behaviour in ZFS (zfs_nocacheflush), and ZFS will perform as
fast or faster than other filesystems for synchronous writes - but you've also
sacrificed data integrity, so this is highly unrecommended. By using fast SLOG
devices on ZFS, we get both speed and data integrity.
I won't go into too much more detail of the inner workings of the SLOG, which is best
described by the author
Neil Perrin.
Screenshots
To create a screenshot to best show the effect of Logzillas, I applied a synchronous
write workload over NFS from a single threaded process on a single client.
The target was the same 7410 used for
the tar test, which has 46 disks and 2 Logzillas. At first I let the
workload run with the Logzillas disabled, then enbled them:
This is the effect of adding 2 Logzillas to a pool of 46 disks, on both
latency and IOPS. My last post discussed the
odd latency pattern
that these 7200 RPM disks causes for synchronous writes, which looks a bit like an icy
lake.
The latency heat map shows the improvement very well, but it's also shown
something unintended: These heat maps use a false color palette which draws
the faintest details much darker than they (linearly) should be, so that they are
visible. This has made visible a minor and unrelated performance issue: those
faint vertical trails on the
right side of the plot. These were every 30 seconds, and was when ZFS flushed
a transaction group to disk - which stalled a fraction of NFS I/O while this
happened. The fraction is so small it's almost lost in the rounding, but appears
to be about 0.17% when
examining
the left panel numbers. While minor, we are working to fix it. This perf issue
is known internally as the "picket fence". Logzilla is still
delivering a fantastic improvement over 99% of the time.
To quantify the Logzilla improvement, I'll now zoom to the before and after periods
to see the range averages from Analytics:
Before
With just the 46 disks:
Reading the values from the left panels:
with just the 46 disks, we've averaged 143 NFS synchronous writes/sec. Latency has reached
as high as 8.93ms - most of the 8 ms would be worst case rotational latency for a
7200 RPM disk.
After
46 disks + 2 Logzillas:
Nice - a tight latency cloud from 229 to 305 microseconds, showing very fast
and consistent responses from the Logzillas.
We've now averaged 1,708 NFS synchronous writes/sec - 12x faster than without the
Logzillas. Beyond 381 us, latency has rounded down to zero ops/sec, and was mostly in the
267 to 304 microsecond range. Previously latency stretched from a few to over 8 ms,
making the delivered NFS latency improvement reach over 20x.
This 7410 and these Logzillas can handle much more than 1,708 synchronous writes/sec,
if I apply more clients and threads (I'll show a higher load demo in a moment.)
Disk I/O
Another way to see the effect of Logzillas is through disk I/O. The following
shows before and after in seperate graphs:
I've used the Hierarchial breakdown feature to highlight the JBODs in green.
In the top plot (before), the I/O to each JBOD was equal - about 160 IOPS.
After enabling the Logzillas, the bottom plot shows one of the JBODs is now
serving over 1800 IOPS - which is the JBOD with the Logzillas in.
There are spikes of IOPS every 30 seconds in both these graphs. That is when
ZFS flushes a transaction group to disk - commiting the writes to their final
location. Zooming into the before graph:
Instead of highlighting JBODs, I've now highlighted individual disks (yes,
the pie chart is pretty now.) The non-highlighted slice is for the system disks,
which are recording the numerous Analytics statistics I've enabled.
We can see that all the disks have data flushed to them every 30 seconds, but
they are also being written to constantly. This constant writing is the synchronous
writes being written to the ZFS intent log, which is being served from the pool of
46 disks (since it isn't a separate intent log, yet.)
After enabling the Logzillas:
Now those constant writes are being served from 2 disks:
HDD 20 and HDD 16, which are both in the /2029...012 JBOD. Those are the Logzillas,
and are our separate intent log (SLOG):
I've highlighted one in this screenshot, HDD 20.
Mirrored Logzillas
The 2 Logzillas I've been demonstrating have been acting as individual log
devices. ZFS lists them like this:
topknot# zpool status
pool: pool-0
state: ONLINE
[...]
NAME STATE READ WRITE CKSUM
pool-0 ONLINE 0 0 0
[...many lines snipped...]
mirror ONLINE 0 0 0
c2t5000CCA216CCB913d0 ONLINE 0 0 0
c2t5000CCA216CE289Cd0 ONLINE 0 0 0
logs
c2tATASTECZEUSIOPS018GBYTES00000AD6d0 ONLINE 0 0 0
c2tATASTECZEUSIOPS018GBYTES00000CD8d0 ONLINE 0 0 0
spares
c2t5000CCA216CCB958d0 AVAIL
c2t5000CCA216CCB969d0 AVAIL
Log devices can be mirrored instead (see zpool(1M)), and zpool will list them
as part of a "mirror" under "logs". On the Sun Storage 7000 series, it's
configurable from the BUI (and CLI) when creating the pool:
With a single Logzilla, if there was a system failure when a client was
performing synchronous writes, and if that data was written to the Logzilla
but hadn't been flushed to disk, and if that Logzilla also fails, then data
will be lost. So it is system failure + drive failure + unlucky timing. ZFS
should still be intact.
It sounds a little paranoid to use mirroring, however from experience system
failures and drive failures sometimes go hand in hand, so it is understandable
that this is often chosen for high availability environments. The default for the
Sun Storage 7000 config when you have multiple Logzillas is to mirror them.
Expectations
While the Logzillas can greatly improve performance, it's important to understand
which workloads this is for and how well they work, to help set realistic expectations.
Here's a summary:
- Logzillas do not help every write workload; they are for synchronous
write workloads, as described earlier.
- Our current Logzilla devices for the Sun Storage 7000 family deliver as high
as 100 Mbytes/sec each (less with small I/O sizes), and as high as 10,000 IOPS
(less with big I/O sizes).
- A heavy multi threaded workload on a single Logzilla device can queue I/O,
increasing latency. Use multiple Logzilla devices to serve I/O concurrently.
Stepping up the workload
So far I've only used a single threaded process to apply the workload.
To push the 7410 and Logzillas to their limit, I'll multiply this workload by 200.
To do this I'll use 10 clients, and on each client I'll add another workload thread every
minute until it reaches 20 per client. Each workload thread performs 512 byte synchronous
write ops continually. I'll also upgrade the 7410: I'll now use 6 JBODs
containing 8 Logzillas and 132 data disks, to give the 7410 a fighting chance against
this bombardment.
8 Logzillas
We've reached 114,000 synchronous write ops/sec over NFS, and the delivered NFS latency
is still mostly less than 1 ms - awesome stuff! We can study the numbers in the left
panel of the latency heat map to quantify this: there were 132,453 ops/sec total,
and (adding up the numbers) 112,121 of them completed faster than 1.21 ms - which is 85%.
The bottom plot suggests we have driven this 7410 close to its CPU limit (for what is
measured as CPU utilization, which includes bus wait cycles.) The Logzillas
are only around 30% utilized. The CPUs are 4 sockets of 2.3 GHz quad core Opteron - and for
them to be the limit means a lot of other system components are performing
extremely well. There are now faster CPUs available, which we need to make available
in the 7410 to push that ops/sec limit much higher than 114k.
No Logzillas
Ouch! We are barely reaching 7,000 sync write ops/sec, and our latency is pretty
ordinary - often 15 ms and higher. These 7,200 RPM disks are not great to serve
synchronous writes from alone, which is why
Adam created the Hybrid Storage Pool model.
When comparing the above two screenshots, note the change in the vertical scale.
With Logzillas, the NFS latency heat map is plotted to 10 ms; without, it's plotted
to 100 ms to span the higher latency from the disks.
Config
The full description of the server and clients used in the above demo is:
Server
- Sun Storage 7410
- Latest software version: 2009.04.10.2.1,1-1.15
- 4 x quad core AMD Opteron 2.3 GHz CPUs
- 128 Gbytes DRAM
- 2 x SAS HBAs
- 1 x 2x10 GbE network card (1 port connected)
- 6 JBODs (136 disks total, each 1 Tbyte 7,200 RPM)
- Storage configured with mirroring
- 8 Logzillas (installed in the JBODs)
Clients
- 10 x blade servers, each:
- 2 x quad core Intel Xeon 1.6 GHz CPUs
- 3 Gbytes DRAM
- 2 x 1 GbE network ports (1 connected)
- mounted using NFSv3, wsize=512
The clients are a little old. The server is new, and is a medium sized 7410.
What about 15K RPM disks?
I've been comparing the performance of write optimized SSDs vs cheap 7,200 RPM SATA
disks as the primary storage. What if I used high quality 15K RPM disks instead?
Rotational latency will be half. What about seek latency? Perhaps if the disks
are of higher quality the heads move faster, but they also have to move further to
span the same data (15K RPM disks are usually lower density than 7,200 RPM disks), so that
could cancel out. They could have a larger write cache, which would help. Lets be
generous and say that 15K RPM disks are 3x faster than 7,200 RPM disks.
So instead of reaching 7,000 sync write ops/sec, perhaps we could reach 21,000
if we used 15K RPM disks. That's a long way short of 114,000 using Logzilla SSDs.
Comparing the latency improvement also shows this is no contest. In fact, to
comfortably match the speed of current SSDs, disks would need to spin at 100K RPM
and faster. Which may never happen. SSDs, on the other hand, are getting
faster and faster.
What about NVRAM cards?
For the SLOG devices, I've been using Logzillas - write optimized SSDs.
It is possible to use NVRAM cards as SLOG devices, which should
deliver much faster writes than Logzilla can. While these may be an
interesting option, there are some down sides to consider:
- Slot budget: A Logzilla device consumes one disk bay, whereas an NVRAM card
would consume one PCI-E slot. The Sun Storage 7410 has 6 PCI-E slots, and a current
maximum of 288 disk bays. Consuming disk bays for Logzillas may be a far
easier trade-off than spending one or more PCI-E slots.
- Scaling: With limited slots compared to disk bays, the scaling of PCI-E cards
will be very coarse. With Logzillas in disk bays it is easier to fine tune their
number to match the workload, and to keep scaling to a dozen devices and more.
- Complexity: One advantage of the Logzillas is that they exist in the pool
of disks, and in clustered environments can be accessed by the other head node
during failover. NVRAM cards are in the head node, and would
require a high speed cluster interconnect to keep them both in sync should one
head node fail.
- Capacity: NVRAM cards have limited capacity when compared to Logzillas to begin
with, and in clustered environments the available capacity from NVRAM cards may
be halved (each head node's NVRAM cards must support both pools.) Logzillas are
currently 18 Gbytes, whereas NVRAM cards are currently around 2 Gbytes.
- Thermal: Battery backed NVRAM cards may have large batteries attached to the card.
Their temperature envelope needs to be considered when placed in head nodes, which
with CPUs and 10 GbE cards can get awfully hot (batteries don't like getting
awfully hot.)
- Cost: These are expected to be more expensive per Gbyte than Logzillas.
Switching to NVRAM cards may also make little relative difference for NFS clients
over the network, when other latencies are factored. It may be an interesting
option, but it's not necessarily better than Logzillas.
Conclusion
For synchronous write workloads, the ZFS intent log can use dedicated
SSD based log devices
(Logzillas) to greatly improve performance. The screenshots above from Analytics
on the Sun Storage 7000 series showed 12x more IOPS and reaching over 20x reduced
latency.
Thanks to
Neil Perrin for creating the ZFS SLOG,
Adam Leventhal for designing it into the
Hybrid Storage Pool, and Sun's Michael Cornwell and STEC for making the Logzillas
a reality.
Logzillas as SLOG devices work great, making the demos for this blog entry easy
to create. It's not every day you find a performance
technology that can deliver 10x and higher.
Here is another example of
latency art
(interesting I/O latency heat maps.) This is a screenshot from
Analytics
on the Sun Storage
7000 series.
The configuration was 2 JBODs (48 disks) configured with mirroring, and a single
NFS client performing a synchronous write workload (opened with O_DSYNC), from one
thread writing 8 Kbyte I/O sequentially.
This is a crippled config - if you really
cared about synchronous writes (some applications perform them) you can use SSD based
separate intent logs ("Logzillas") which are part of the
Hybrid Storage
Pool strategy. Since I didn't, these synchronous writes will be served at disk
speed, which for these 7200 RPM disks will be between 0 and 10 ms. Here's the latency
(click for a larger version):
Woah! That's Insane.
This is not what I was expecting. Remember that this workload is very simple -
just one thread performing sequential synchronous writes. The magnitude of the latency
is as expected for these disks (and no Logzilla), but the strange diagonal lines?
Latency is creeping up or down second after second, sometimes at the same time (the 'X's.)
It looks a bit like an icy lake, and this particular example has a distinct
'X' in it.
I'd guess this is related to how ZFS is placing the data sequentially on disks as they
rotate - perhaps each diagonal represents one track. With time this can be answered
using DTrace;
the first step would be to
see if individual disks contribute single lines to the heat map...
The point of this post is to show
another interesting latency heat map, one that we don't yet fully understand (leave a comment
if you have another theory about why this happens.)
Before Analytics, I analyzed disk latency using averages or coarse distributions,
which don't have the resolution to reveal patterns like this. Indeed, for
other filesystems (UFS, ...) there may be similar or even more bizarre patterns.
In my next post I'll demo Logzillas - the solution to this latency.
As part of my role in Fishworks, I push systems to their limits to investigate and solve bottlenecks. Limits can be useful to consider as a possible upper bound of performance - as it shows what the target can do. I previously posted my
results for the Sun Storage 7410, which is our current top performing product. Today the Sun Storage 7310 has been
launched, which is an entry level offering that can be clustered for high availability environments like the 7410.
The following summarises the performance limits I found for a single 7310 head node, along with the current 7410 results for comparison:
As shown above, the 7310 has very reasonable performance in comparison to the high end 7410. (As of this blog post, I've yet to update my 7410 results
for the 2009.Q2 software update, which gave about a 5% performance boost.)
Next I'll show the source of these results - screenshots taken from the 7310 using
Analytics, to measure how the target NFS server actually performed. I'll finish by describing the 7310 and clients used to perform these tests.
NFSv3 streaming read from DRAM
To find the fastest read throughput possible, I used 10 Gbytes of files (working set) and had 20 clients run two threads per client to read 1 Mbyte I/O through them repeatedly:
Over this 10 minute interval, we've averaged 1.08 Gbytes/sec. This includes both inbound NFS requests and network protocol headers, so the actual data transferred will be a little less. Still, breaking 1 Gbyte/sec of delivered NFS for an entry level server is a great result.
NFSv3 streaming read from disk
This time 400 Gbytes of files were used as the working set, to minimize caching in the 7310's 16 Gbytes of DRAM. As before, 20 clients and 2 threads per client read through the working set repeatedly:
This screenshot from Analytics includes the disk bytes/sec, to confirm that this workload really did read from disk. It averaged 780 Mbytes/sec - a solid result. The 792 Mbytes/sec average on the network interfaces includes the NFS requests and network protocol headers.
NFSv3 streaming write to disk
To test streaming writes, 20 clients ran 5 threads per client performing writes with a 32 Kbyte I/O size:
While there were 477 Mbytes/sec on the network interfaces (which includes network protocol headers and ACKs), at the disk level the 7310 has averaged 1.09 Gbytes/sec. This is due to software mirroring, which has doubled the data sent to the storage devices (plus ZFS metadata). The actual data bytes written will be a little less than 477 Mbytes/sec.
NFSv3 max read IOPS
I tested this with the 7410, more as an experiment than of practical value. This is the most 1 byte reads that NFS can deliver to the clients, with the 10 Gbyte working set entirerly cached in DRAM. While 1 byte I/O isn't expected, that doesn't render this test useless - it does give the absolute upper bound for IOPS:
The 7310 reached 182,282 reads/sec, and averaged around 180K.
NFSv3 4 Kbyte read IOPS from DRAM
For a more realistic test of read IOPS, the following shows an I/O size of 4 Kbytes, and a working set of 10 Gbytes cached in the 7310's DRAM. Each client runs an additional thread every minute (stepping up the workload), from one to ten threads:
This screenshot shows %CPU utilization, to check that this ramping up of workload is pushing some limit on the 7310 (in this case, what is measured as CPU utilization.) With 10 threads running per client, the 7310 has served a whopping 110,724 x 4 Kbyte read ops/sec, and still has a little CPU headroom for more. Network bytes/sec was also included in this screenshot to double check the result, which should be at least be 432 Mbytes/sec (110,724 x 4 Kbytes), which it is.
Configuration
As a filer, I used a single
Sun Storage 7310 with the following config:
- 16 Gbytes DRAM
- 2 JBODs, each with 24 x 1 Tbyte disks, configured with mirroring
- 1 socket of quad-core AMD Opteron 2300 MHz CPU
- 1 x 2x10 GbE card, jumbo frames
- 1 x HBA card
- noatime on shares, and database size left at 128 Kbytes
It's not a max config system - the 7310 can currently scale to 4 JBODs and have 2 sockets of quad-core Opteron.
The clients were 20 blades, each:
- 2 sockets of Intel Xeon quad-core 1600 MHz CPUs
- 3 Gbytes of DRAM
- 1 x 1 GbE network ports
- Running Solaris, and NFS mounted forcedirectio for read testing to avoid client caching
These are great, apart from the CPU clock speed - which at 1600 MHz is a little low.
The network consists of multiple 10 GbE switches to connect the client 1 GbE ports to the filer 10 GbE ports.
Conclusion
We've spent years working on maximizing performance of our highest end offering, the Sun Storage 7410. New family members of the Sun Storage 7000 series inherit most of this work, and really hit the ground running. When
Adam Leventhal designed the 7310 product, we weren't thinking of a system capable of 1 Gbyte/sec - but as the pieces came together we realized the performance could in fact be very good, and was. And it's worth emphasizing - the results above are all from a single socket 7310; the 7310 can be configured with two sockets of quad core Opteron!
For performance testing storage products, in particular the Sun Storage
7000 series,
I previously posted the top
10 suggestions
for performance testing, and the
little stuff
to check. Here I'll post load generation scripts I've been using to measure the
limits of
the Sun Storage
7410 (people have been asking for these.)
Load Generation Scripts
Many storage benchmark tools exist, including
FileBench which
can apply sophisticated workloads and report various statistics. However the performance
limit testing I've blogged about has not involved sophisticated workloads at all - I've
tested sequential and random I/O to find the limits of throughput and IOPS. This is
like finding out a car's top speed and its 0-60 mph time -
simple and useful metrics, but aren't sophisticated such as finding out its best
lap time at
Laguna Seca Raceway.
While a lap time would be a more comprehensive test of a car's performance, the simple metrics can be a better measure of specific abilities, with fewer external factors to interfere with the result. A lap time will be more affected by the driver's ability and weather conditions, for example, unlike top speed alone.
The scripts that follow are very simple, as all they need
to do is generate load. They don't generate a specific level of load or measure the
performance of that load, they just apply load as fast as they can. Measurements
are taken on the target - which can be a more reliable way of measuring what actually
happened (past the client caches.) For anything more complex, reach for a real
benchmarking tool.
Random Reads
This is my randread.pl script, which takes a filename as an argument:
#!/usr/bin/perl -w
#
# randread.pl - randomly read over specified file.
use strict;
my $IOSIZE = 8192; # size of I/O, bytes
my $QUANTA = $IOSIZE; # seek granularity, bytes
die "USAGE: randread.pl filename\n" if @ARGV != 1 or not -e $ARGV[0];
my $file = $ARGV[0];
my $span = -s $file; # span to randomly read, bytes
my $junk;
open FILE, "$file" or die "ERROR: reading $file: $!\n";
while (1) {
seek(FILE, int(rand($span / $QUANTA)) * $QUANTA, 0);
sysread(FILE, $junk, $IOSIZE);
}
close FILE;
Dead simple. Tune $IOSIZE to the I/O size desired.
This is designed to run over NFS or CIFS, so
the program will spend most of its time waiting for network responses - not
chewing over its own code, and so Perl works just fine. Rewriting this in C
isn't going to make it much faster, but it may be fun to try and see for
yourself (be careful with the resolution of rand(), which may not have the
granularity to span files bigger than 2^32 bytes.)
To run randread.pl, create a file for it to work on, eg:
# dd if=/dev/zero of=10g-file1 bs=1024k count=10k
which is also how I create simple sequential write workloads. Then run it:
# ./randread.pl 10g-file1 &
Sequential Reads
This is my seqread.pl script, which is similar to randread.pl:
#!/usr/bin/perl -w
#
# seqread.pl - sequentially read through a file, and repeat.
use strict;
my $IOSIZE = 1024 * 1024; # size of I/O, bytes
die "USAGE: seqread.pl filename\n" if @ARGV != 1 or not -e $ARGV[0];
my $file = $ARGV[0];
my $junk;
open FILE, "$file" or die "ERROR: reading $file: $!\n";
while (1) {
my $bytes = sysread(FILE, $junk, $IOSIZE);
if (!(defined $bytes) or $bytes != $IOSIZE) {
seek(FILE, 0, 0);
}
}
close FILE;
Once it reaches the end of a file, it loops back to the start.
Client Management Script
To test the limits of your storage target, you'll want to run these scripts on
a bunch of clients - ten or more. This is possible with some simple shell scripting.
Start by setting up ssh (or rsh) so that a master server (your desktop) can login
to all the clients as root without prompting for a password (ssh-keygen,
/.ssh/authorized_keys ...). My clients are named dace-0 through to dace-9,
and after setting up the ssh keys the following succeeds without a password prompt:
# ssh root@dace-0 uname -a
SunOS dace-0 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
Since I have 10 clients, I'll want an easy way to execute commands on them all at
the same time, rather than one by one. There are lots of simple ways to do this,
here I've created a text file called clientlist with the names of the clients:
# cat clientlist
dace-0
dace-1
dace-2
dace-3
dace-4
dace-5
dace-6
dace-7
dace-8
dace-9
which is easy to maintain. Now a script to run commands on all the clients in
the list:
#!/usr/bin/ksh
#
# clientrun - execute a command on every host in clientlist.
if (( $# == 0 )); then
print "USAGE: clientrun cmd [args]"
exit 1
fi
for client in $(cat clientlist); do
ssh root@$client "$@" &
done
Testing that this script works by running uname -a on every client:
# ./clientrun uname -a
SunOS dace-0 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
SunOS dace-1 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
SunOS dace-2 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
SunOS dace-3 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
SunOS dace-4 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
SunOS dace-5 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
SunOS dace-7 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
SunOS dace-8 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
SunOS dace-6 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
SunOS dace-9 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
Great.
Running the Workload
We've got some simple load generation scripts and a way to run them on our client
farm. Now to execute a workload on our target server, turbot.
To prepare for this:
- Scripts are available to the clients on an NFS share,
/net/fw/tools/perf. This makes it easy to adjust the scripts and have the clients run the latest version, rather than installing the scripts one by one on the clients.
- One share is created on the target NFS server for every client (usually more
interesting than just using one), and is named
the same as the client's hostname (/export/dace-0 etc.)
Creating a directory on the clients to use as a mount point:
# ./clientrun mkdir /test
Mounting the shares with default NFSv3 mount options:
# ./clientrun 'mount -o vers=3 tarpon:/export/`uname -n` /test'
The advantage of using the client's hostname as the share name is that it
becomes easy for our clientrun script to have each client mount their own share,
by getting the client to call uname -n to construct the share name.
(The single forward quotes in that command are necessary.)
Creating files for our workload. The clients have 3 Gbytes of DRAM each, so
10 Gbytes per file will avoid caching all (or most) of the file on the client,
since we want our clients to apply load to the NFS server and not hit from their
own client cache:
# ./clientrun dd if=/dev/zero of=/test/10g-file1 bs=1024k count=10k
This applys a streaming write workload from 10 clients, one thread (process) per
client. While that is happening, it may be interesting to login to the NFS server and
see how fast the write is performing (eg, network bytes/sec.)
With the test files created, I can now apply a streaming read workload like so:
# ./clientrun '/net/fw/tools/perf/seqread.pl /test/10g-file1 &'
That will apply a streaming read workload from 10 clients, one thread per client.
Run it multiple times to add more threads, however the client cache is more likely to interfere when trying this (one thread reads what the other just cached); on Solaris you can try adding the mount option forcedirectio to avoid the client cache altogether.
To stop the workload:
# ./clientrun pkill seqread
And to cleanup after testing is completed:
# ./clientrun umount /test
Stepping the Workload
Running one thread per client on 10 clients may not be enough to stress powerful
storage servers like the 7000 series. How many do we need to find the limits?
An easy way to find out is to step up the workload over time, until the target stops
serving any more.
The following runs randread.pl on the clients, starting with one and running another
every 60 seconds until ten are running on each client:
# ./clientrun 'for i in 1 2 3 4 5 6 7 8 9 10; do
/net/fw/tools/perf/randread.pl /test/10g-file1 & sleep 60; done &' &
This was executed with an I/O size of 4 Kbytes. Our NFS server turbot is a Sun Storage 7410.
The results from
Analytics on turbot:
Worked great, and we can see that after 10 threads per client we've pushed the target to
91% CPU utilization, so we are getting close to a limit (in this case, available
CPU.) Which is the aim of these type of tests - to drive load until we reach some
limit.
I included network bytes/sec in the screenshot as a sanity check; we've reached
138180 x 4 Kbyte NFS reads/sec, which would require at least 540 Mbytes/sec of
network throughput; we pushed 602 Mbytes/sec (includes headers.) 138K IOPS is
quite a lot - this server has 128 Gbytes of DRAM, so 10 clients with a 10 Gbyte file
per client means a 100 Gbyte working set (active data) in total, which has entirerly cached in
DRAM on the 7410. If I wanted to test disk performance (cache miss), I can increase
the per client filesize to create a working set much larger than the target's DRAM.
This type of testing can be useful to determine how fast a storage server can go - it's top speed. But that's all. For a better test of application performance, reach for a real benchmarking tool, or setup a test environment and run the application with a simulated workload.
With the release of the Sun Storage
7000 series
there has been much interest in the products performance, which I've been
demonstrating. In my
previous post I listed
10 suggestions
for performance testing - the big stuff you need to get right. The 10 suggestions are applicable to all perf testing on the 7000 series.
Here I'll dive into the smaller details of max performance tuning, which may only be of interest if you'd like to see the system reach its limits.
The little stuff
The following is a tuning checklist for achieving
maximum performance on the Sun Storage
7410,
particularly for finding performance
limits.
Please excuse the brevity of some descriptions - this was originally written as an internal
Sun document and has been released here in the interests of transparency. This kind of tuning is used during product development, to drive systems as fast as possible to identify and solve bottlenecks.
These can all be useful points to consider, but they do not apply
to all workloads. I've seen a number of systems that were misconfigured with random tuning suggestions found on the Internet. Please understand what you are doing before making changes.
Sun Storage 7410 Hardware
Use the max CPU and max DRAM option (currently 4 sockets of quad core Opteron, and 128 Gbytes of DRAM.)
Use 10 GbE. When using 2 ports, ideally use 2 x 10 GbE cards and one port on
each - which balances across two PCI-E slots, two I/O controllers, and both CPU to I/O
HyperTransports. Two ports on a single 10 GbE card will be limited by the PCI-E slot
throughput.
Using LACP to trunk over 1 GbE interfaces is a
different way to increase network bandwidth, but you'll have
port hashing to balance in small test environments (eg, fewer than 10 clients),
which can be a headache based on the client attributes (my 20 client test farm all
have even-numbered IP address and mac-addresses!)

Balance load across I/O controllers. There are two I/O controllers in the 7410, the MCP55 and the IO55. The picture on the right shows which PCI-E slots are served by which controller, labeled A and B. When using two 10 GbE cards, they should (already) be installed in the top left and bottom right slots, so that their load is balanced across each controller. When three HBAs are used, they are put in the bottom left and both middle slots; if I'm running a test that only uses two of them, I'll use the middle ones.
Get as many JBODs as possible with the largest disks possible. This will improve
random I/O for cache busting workloads - although considering the amount of
cache possible (DRAM plus Readzilla), to be cache busting it may become too artificial to be interesting.
Use 3 x HBAs, configure dual paths to chains of 4 x JBODs (if you have the
full 12.)
Consider Readzilla for random I/O tests - but plan for a long (hours)
warmup. Make sure the share database size is small (8 Kbytes) before file
creation. Also use multiple readzillas for concurrent I/O (you get about
3100 x 8 Kbyte IOPS from each of the current STECs.) Note that Readzilla
(read cache) can be enabled on a per filesystem basis.
Use Logzilla for synchronous (O_DSYNC) write workloads, which may include database
operations and small file workloads.
Sun Storage 7410 Settings
Configure mirroring on the pool. If there is interest in a different profile,
test it in addition to mirroring. Mirroring will especially help random IOPS read
and write workloads that are cache busting.
Create multiple shares for the clients to use, if this resembles the
target environment. Depending on the workload, it may improve performance by a tiny
amout for clients not to be hammering the same share.
Disable access time updates on the shares.
128K recsize for streaming. For streaming I/O tests: 128 Kbyte
"database size" on the shares before file creation.
8K recsize for random. Random I/O tests: set the share "database size"
to match the application record size. I generally
wouldn't go smaller than 4 Kbytes - if you make this extremely small then the ARC
metadata to reference tiny amounts of data can begin to consume Gbytes of DRAM.
I usually use 8 Kbytes.
Consider testing with LZJB compression on the shares on to see if it
improves
performance (it may relieve back-end I/O throughput.) For compression, make sure
your working set files actually contain data and aren't all zeros - ZFS has some tricks
for compressing zero data that will artificially boost performance.
Consider suspending DTrace based Analytics during tests for an extra
percent or so of performance (especially the by-file, by-client, by-latency, by-size
and by-offset ones.) The default Analytics
(listed in the HELP wiki) are negligible, and I leave them on during my limit testing
(without them I couldn't take screenshots.)
Don't benchmark a cold server (after boot.) Run a throwaway benchmark first
until it fills DRAM, then cancel it and run the real benchmark. This avoids a one-off
kmem cache buffer creation penalty, that may cost 5% performance or so only in the minutes immediately after boot.
Network
Use jumbo frames on clients and server ports (not needed for management). The 7410 has plenty of horsepower to drive 1 GbE with or without jumbo frames, so their use on 1 GbE is more to help the clients keep up. On 10 GbE the 7410s peak throughput will improve by about 20% with jumbo frames. This ratio would be higher, but we are using LSO (large send offload) with this driver (nxge) which keeps the non-jumbo frame throughput pretty high to start with.
Check your 10 Gbit switches. Don't assume a switch with multiple
10 GbE ports can drive them at the same time; some 10 Gbit switches we've tested cap at
11 Gbits/sec. The 7410 can have 4 x 10 GbE ports - so make sure the switches can
handle the load, such as by using multiple switches. You don't want to test
the switches by mistake.
Clients
Lots of clients. At least 10. 20+ is better. Everyone who tries to
test the 7410 (including me) gets client bound at some point.
A single client can't drive 10 GbE, without a lot of custom
tuning. And if the clients have 1 GbE interfaces, it should (obviously) take at least 10
clients to drive 10 GbE.
Connect the clients to dedicated switch(s) connected to the 7410.
You can reduce client caching by booting the clients with less DRAM: eg,
eeprom physmem=786432 for 3 Gbytes on Solaris x86. Otherwise the test
can hit from client cache and test the client instead of the target!
Client Workload
NFSv3 is generally the fastest protocol to test.
Consider umount/mount cycles on the clients between runs to flush the client
cache.
NFS client mount options: Tune the "rsize" and "wsize" option to match the
workload. eg, rsize=8192 for random 8 Kbyte IOPS tests - this reduces
unnecessary read-ahead; and rsize=131072 for streaming (and also try setting
nfs3_bsize to 131072 in /etc/system on the clients for the streaming tests.)
For read tests, try "forcedirectio" if your NFS client supports this option (Solaris does) - this especially helps clients apply a heavier workload by not using their own cache. Don't leave this option enabled for write tests.
Whatever benchmark tool you use, you want to try multiple threads/processes
per client. For max streaming tests I generally use 2 processes per client
with 20 clients - so thats 40 client processes in total; for max random IOPS
tests I use a lot more (100+ client processes in total.)
Don't trust benchmark tools - most are misleading in some way. Roch
explains Bonnie++ here, and we have
more posts planned for other tools. Always
double check what the benchmark tool is doing using Analytics on the 7410.
Benchmark working set: very important! To properly test the 7410, you
need to plan different total file sizes or working sets for the benchmarks.
Ideally you'd run one benchmark that the
server could entirely cache in DRAM, another that cached in Readzilla, and
another that hit disk.
If you are using a benchmark tool without tuning the file size, you are probably getting this wrong. For example, the defaults for iozone are very small (512 Mbytes and less) and can entirely cache on the client. If I want to test both DRAM and disk performance on a 7410 with 128 Gbytes of DRAM, I'll use a total file size of 100 Gbytes for the DRAM test, and at least 1 Terabyte (preferably 10 Terabytes) for the disk test.
Keep a close eye on the clients for any issues - eg, DTrace kernel activity.
Getting the best numbers involves tuning the clients as much as possible.
For example, use a large value for autoup (300); tune tcp_recv_hiwat (for read
tests, 400K should be good in general, 1 MB or more for long latency links.) The aim is to eliminate any effects from the available clients, and have results which are bounded by the targets performance.
Aim for my limits:
That will help sanity check your numbers - to see if they are way off.
The Sun Storage 7410 doesn't need special tuning at all (no messing with /etc/system settings.) If it did, we'd consider that a bug we should fix. Indeed, this is part of what
Fishworks is about - the expert tuning has already been done in our products. What there is left for the customer is simple and industry common: pick mirroring or double parity RAID, jumbo frames, no access time updates and tuning the filesystem record size. The clients require much, much more tuning and fussing when doing these performance tests.
In my next post, I'll show the simple tools I use to apply test workloads.
|
|