I've been studying the popular Bonnie++ load generator to see if it
was a suitable benchmark to use with Network attached storage such as
Sun Storage 7000 line.
At this stage I've looked at the single client runs, and it doesn't appear
that Bonnie++ is an appropriate tool in this environment because
as we'll see here, for many of the tests, it either stresses the networking environment
or the strength of client side cpu.
The first interesting thing to note is that Bonnie will work
on a data set that is double the client's memory. This does address
some of the client side caching concern one could otherwise have. In a
NAS environment the amount of memory present on the server is not
considered by a default bonnie++ run. My client had 4GB leading to a
working set was then 8GB while the server had 128GB of memory.
The Bonnie++'s output looks like :
Writing with putc()...done
Writing intelligently...done
Rewriting...done
Reading with getc()...done
Reading intelligently...done
start 'em...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version 1.03d ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
v2c01 8G 81160 92 109588 38 89987 67 69763 88 113613 36 2636 67
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 687 10 +++++ +++ 1517 9 647 10 +++++ +++ 1569 8
v2c01,8G,81160,92,109588,38,89987,67,69763,88,113613,36,2635.7,67,16,687,10,+++++,+++,1517,9,647,10,+++++,+++,1569,8
Method
I have used a combination of Solaris truss(1), reading
Bonnie++ code, looking at AmberRoad's
Analytics
data , as well as a custom Bonnie
d-script in
order to understand how each test triggered system calls on the client
and how those translated into a NAS server load. In the d-script, I
characterise the system calls by the average elapse time as well as by
the time spent waiting for a response from the NAS server. The time
spent waiting is the operational latency that one should be interested
in when characterising a NAS, while the additional time relates to the
client CPU strength along with the client NFS implementation. Here is
what I found trying to explain how performant each test was.
Writing with putc()
So easy enough, that test creates a file using single character putc stdio library
call.
This test is clearly a client CPU test with most of the time spent
in user space running putc(). Every 8192 putc, stdio library will
issue a
write(2) system call. That syscall is still a client CPU test
since the data is absorbed on the client cache. What we test here is
the client single CPU performance and the client NFS implementation.
On a 2 CPU/ 4GB V20z running Solaris, we observed on the server using
analytics a network transfer rate of 87 MB/sec.
Results : 87 MB/sec of writes. Limited by single CPU speed.
Writing intelligently...done
Here it's more clever since it writes a file using sequential 8K write
system calls.
In this test the CPU is much relieved. So here the application is
running 8K write system call to client NFS. This is absorbed by memory
on the client. With an Opensolaris client, no over the wire request is
sent for such an 8K write. However after 4 such 8K writes we reach the
natural 32K chunk advertised by the server and that will cause the
client to asynchronously issue a write request to the server. The
asynchronous nature means that this will not cause the application to
wait for the response and the test will keep going on CPU. The process
will now race ahead generating more 8K writes and 32K asynchronous NFS
requests. If we manage to generate such request at a greater rate than
responses, we will consume all allocated aysnchronous threads. On
Solaris this maps to nfs4_max_threads (8) threads. When all 8
asynchronous threads are waiting for a response, then the application
will finally block waiting for a previously issued request to get a
response and free an async thread.
Since generating 8K write systems to fill the client cache is faster
than the network connection between the client and the server we will
eventually reach this point. The steady state of this test is that
Bonnie++ is waiting for data to transfer to the server. This happens
at the speed of a single NFS connection which for us saturated the
1Gbps link we had. We observed 113MB/sec which is network line rate
considering protocol overheads.
To get more through on this test, one could use Jumbo Frame ethernet
instead of the 1500 Byte default frame size used as this would reduce the protocol overhead
slightly. One could also configure the server and client to use
10Gbps ethernet links.
One could also use LACP link aggregation of 1Gbps network ports to
increase the throughput. LACP increases throughput of multiple network
connections but not single socket protocol. By default a Solaris
client will establish a single connection (clnt_max_conns = 1) to a
server (1 connections per target IP). So using multiple aggregated
links _and_ tuning clnt_max_conns could yield extra throughput here.
Using single connection one could use a faster network between client
and server links to reach additional throughput.
More commonly, we expect to saturate the client 1Gbps connectivity
here, not much of a stress for a Sun Storage 7000 server.
Results : 113 MB/sec of writes. Network limited.
Rewriting...done
This gets a little interesting. It actually reads 8K, lseek back to
the start of the block, overwrites the 8K with new data and loops.
So here we read, lseek back, overwrite . For the NFS protocol lseek is
a noop since every over the wire write is tagged with the target
offset. In this test we are effectively stream reading the file from
the server and stream writing the file back to the server. The stream
write behavior will be much like the previous test. We never need to
block the process unless we consume all 8 asynchronous threads.
Similarly 8K sequential reads will be recognised by our client NFS as
streaming access which will deploy asynchronous readahead requests. We
will use 4 (nfs4_nra) request for 32K blocks ahead of the point being
currently read. What we observed here was that of 88 second of elapse
time, 15 was spent in write and 20 in reads. However a small portion
of that was spent waiting for response. It was mostly all spent on CPU
time to interact with the client NFS. This implies that readhead and
asynchronous writeback was behaving without becoming bottlenecks. The
Bonnie++ process took 50 sec of the 88 sec and a big chunk of this, 27
sec, was spent waiting off cpu. I struggle somewhat in this
interpretation but I do know from the Analytics data on the server
that the network is seeing 100 MB/sec of data flowing in each
direction. This must also be close to network saturation. The wait
time attributed to Bonnie++ in this test seems be related to kernel
preemption. As Bonnie++ is coming out of its system calls we see such
events in dtrace.
unix`swtch+0x17f
unix`preempt+0xda
genunix`post_syscall+0x59e
genunix`syscall_exit+0x59
unix`0xfffffffffb800f06
17570
This must be to service the kernel threads of higher priority, likely
the asynchronous threads being spawned by the reads and writes.
This test is then a stress test of bidirectional flow of 32K data
transfers. Just like the previous test, to improve the numbers one
would need to improve the network connection throughput between the
client and server. It also potentially could then benefit from faster
and more client CPUs.
Results : 100MB/sec in each direction, network limited.
Reading with getc()...done
Reads the file one character at a time.
Back to a test of the client CPU much like the first one.
We see that the readahead are working great since little time is spent
waiting (0.4 of 114 seconds). Given that this test does 1 million
reads in 114 seconds, the average latency could be evaluated to be 114
usec.
Results : 73MB/sec, single CPU limited on the client.
Reading intelligently...done
start 'em...done...done...done...
Reads with 8k system calls, sequential.
This test seems to be using 3 spawned bonnie process to read files.
The reads are of size 8K and we needed 1M of them to read our 8GB
working set. We observed with analytics no I/O done on the server
since it had 128GB of cache available to it. The network on the other
hand is saturated at 118 MB/sec.
The dtrace script shows that the 1M read calls collectively spend 64
seconds waiting (most of that NFS response). So that implies a 64 usec
read response time for this sequential workload.
Results : 118MB/sec, limited by Network environment.
start 'em...done...done...done...
Here is seems that Bonnie starts 3 helper processes
used to read the files in the "Reading Intelligently" test.
Create files in sequential order...done.
Here we see 16K files being created (with
creat(2)) then closed.
This test will create and close 16K files and took 22 seconds in our
environment. 19 seconds were used for the creates, 17.5 waiting for
responses. That means a 1ms response time for file creates. The test
seems single threaded. Using analytics we observe 13500 NFS ops per
second to handle those file create. We do see some activity on the
Write bias SSD although very modest at 2.64MB /sec. Given that the
test is single threaded we can't estimate if this metric is
representative of the NAS server capability. More likely this is
representative the single thread capability of the whole environment
made of : client CPU, client NFS implementation, client network driver
and configuration, network envinronment including switches, and the
NAS server.
Results : 744 filecreate per second per thread. Limited by operational latency.
Here is the analytics view captured for the this tests and the
following 5 tests.
Stat files in sequential order...done.
Test was too elusive possibly working against cached stat information.
Delete files in sequential order...done.
Here we
unlink(2) the 16K files.
Here we call the unlink system call for the 16K files. The run takes
10.294 seconds showing a 1591 unlink per second. Each call goes
off cpu, waiting for a server response for 600 usec.
Much like the create file test above, while we get information about
the single threaded unlink time present in the environment it's
obviously not representative of the server's capabilities.'
Results : 1591 unlink per second per thread, Limited by operational latency.
Create files in random order...done.
We recreate 16K files, closing each one but also running a stat()
system call on each.
Stat files in random order...done.
Elusive as above.
Delete files in random order...done.
We remove the 16K files.
I could not discern in the "random order" test any meaninful
differences to the sequential order ones.
Analytics screenshot of Bonnie++ run
Here is the full screen shot from analytics including Disk and CPU
data
The takeway here is that single instance bonnie++ does not generally
stress one Sun Storage 7000 NAS server but will stress the client CPU
and 1Gbps network connectivity. There is no multi-client support in
Bonnie++ (that I could find).
One can certainly start multiple clients simultaneously, but since the
different tests would not be synchronized the output of bonnie++ would
be very questionable. Bonnie++ does have a multi-instance
synchronisation mode that is based on semaphore which will only work if
all instances are running within the same OS environment.
So in a multi client test, Only the total elapsed time would be of
interest here and that would be dominated by the streaming performance
as each client would read and write its working set 3 times over the
wire. Filecreate and unlink times would also contribute to the total
elapsed time of such a test.
For a single node multi-instance bonnie++ run, one would need to have
a large client, with at least 16 x 2Ghz CPUS, and about 10Gbps worth
of network capabilities in order to properly test one Sun Storage 7410
server. Otherwise, Bonnie++ is more likely to show client and network
limits, not server ones. As for unlink capabilities, the topic is a
pretty complex and important one that certainly cannot be captured
with simple commands. The interaction with snapshots and the I/O load
generated on the server during large unlink storms needs to be studied
carefully in order to understand the competitive merits of different
solutions.
In Summary, here is what governs the performance of the
individual Bonnie++ tests :
| Writing with putc()... | 87 MB/sec | Limited by client's single CPU speed |
| Writing intelligently... | 113 MB/sec | Limited
by Network conditions |
| Rewriting... | 100MB/sec | Limited by Network conditions |
| Reading with getc()... | 73MB/sec | Limited by
client's single CPU speed |
| Reading intelligently... | 118MB/sec | Limited
by Network conditions |
| start 'em...done...done...done... |
| Create files in sequential order... | 744 create/s | Limited by operational latency |
| Stat files in sequential order... | not observable |
| Delete files in sequential order... | 1591 unlink/s | Limited by operational latency |
| Create files in random order... | same as sequential |
| Stat files in random order... | same as sequential |
| Delete files in random order... | same as sequential |
So Bonnie++ won't tell you much about our server's
capabilities. Unfortunately, the clustered mode of Bonnie++ won't coordinate
multiple clients systems and so cannot be used to stress a server.
Bonnie++ could be used to stress a NAS server using a single large
multi-core client with very strong networking capabilities.
In the end though I don't expect to learn much about our servers over
and above what is already known. For that please check out our links here :
Low
Level Performance of Sun Storage
Analyzing
the Sun Storage 7000
Designing
Performance Metrics...
Sun
Storage 7xxx Performance Invariants
Here is the
bonnie.d
d-script used
and the output generated
bonnie.out.
Interesting to see your results with Bonnie, it is indeed a pretty limited tool... however, the 7410 is a pretty big beast... have you considered running bonnie++ or even any other benchmarks against lighter-weight systems like the 7110 or 7210?
Posted by Erik LaBianca on janvier 12, 2009 at 05:22 AM MET #