As promised in my last post, I upgraded my Infiniband test fabric to include a more powerful Sun Storage 7410. As luck would have it, Brendan just finished up his tests for the 7410 with the Istanbul processor upgrade and the system was available for IB testing. In my last set of experiments, I quickly exhausted my CPU, memory, and disk capabilities with the 8 clients connected to my IB fabric. Here, I've significantly upgraded my filer and added two more clients.

Fabric Configuration
Filer: Sun Storage 7410, with the following config:
- 256 Gbytes DRAM
- 8 JBODs, each with 24 x 1 Tbyte disks, configured with mirroring
- 4 sockets of six-core AMD Opteron 2600 MHz CPUs (Istanbul)
- 2 Sun DDR Dual Port Infiniband HCA
- 3 HBA cards
- noatime on shares, and database size left at 128 Kbytes
- 2 sockets of Intel Xeon quad-core 1600 MHz CPUs
- 3 Gbytes of DRAM
- 1 Sun DDR Dual Port Infiniband HCA Express Module
- mount options:
- read tests: mounted forcedirectio (to skip client caching), and rsize to match the workload
- write tests: default mount options
Switches: 2 internal Sun DataCenter 3x24 Infiniband switches (A and C)
Subnet manager:
- Centos 5.2
- Sun HPC Software, Linux Edition
- 2 Sun DDR Dual Port Infiniband HCA
Most of performance results you'll find reported for Infiniband (RDMA or IPOIB) are
limited to cached workloads. While these types of tests help to
evaluate the raw capabilites of the transport, they don't necessarily
show how a storage system behaves or what the
possible benefits are. Brendan chose these tests and his workloads to demonstrate the 7410 maximum capabilities. The goal of the following experiments is to duplicate what Brendan
demonstrated for ethernet and point out where the bottlenecks or
problem spots are for Infiniband.
RDMA
NFS over the RDMA protocol is available in the 2009.Q3 software release for clients that support it. RDMA (Remote Direct Memory Access) moves data between memory on one host to another host. The details of moving data between hosts is left to hardware, in our case the Infiniband HCAs. The advantage is that we can bypass the network and device software stacks and reduce much of the data copies performed by the CPU. We should see a reduction in CPU utilization and an increase in the amount of data we can transfer between clients and NFS server.
Max NFSv3 streaming cached read
This test demonstrates the maximum read throughput we can achieve over NFSv3/RDMA. The test reads a 1GByte file cached entirely in DRAM from the SS7410 filer to 10 clients. Each client is running 10 threads that are each performing 128KB read accesses from the filer and dumping the data into their DRAM. This test is effectively the same test used to publish typical results for the IB transport.
I am able to reach a bit beyond Brendan's 3.06Gyte/sec with half the number of clients and reduce my CPU utilization to just 30%. In the graph above, we can calculate the throughput by multiplying the number of write IOPS (24041) by the write size (128KB) or 3.15 GBytes/sec. For confirmation, I can observe the throughput for both IB ports on the subnet manager where we reach 3.18 Gbytes/sec. 3.18 GBytes/sec at the port level includes additional header information imposed by the transport.
5/2 3729230 1500731797
mthca0 LID/Port XMIT bytes/second RECV bytes/second
3/2 2682860 1580370155
The bottleneck however is the PCI Express 1.0 I/O interconnects. The PCIExpress 1.0 root complexes can (in practice) reach only 1.4-1.5 GBytes/sec. Using Brendan's amd64htcpu script, we can see that the PCIe interconnect are at or near their maximums:
Socket HT0 TX MB/s HT1 TX MB/s HT2 TX MB/s HT3 TX MB/s
0 5011.33 1374.05 4594.51 0.00
1 6982.65 6366.86 1890.57 0.00
2 5392.58 4343.28 5773.35 0.00
3 5228.30 5664.78 4247.36 0.00
Socket HT0 TX MB/s HT1 TX MB/s HT2 TX MB/s HT3 TX MB/s
0 4852.97 1329.00 4442.26 0.00
1 7011.03 6385.20 1893.62 0.00
2 5361.24 4331.55 5741.79 0.00
3 5201.71 5643.10 4244.37 0.00
Socket HT0 TX MB/s HT1 TX MB/s HT2 TX MB/s HT3 TX MB/s
0 6257.99 1705.76 5716.17 0.00
1 6036.49 5462.77 1614.62 0.00
2 5380.50 4360.88 5827.16 0.00
3 5207.53 5586.87 4231.31 0.00
Max NFSv3 streaming disk read
As much as I tried, I could not acheive a workload confined strictly to disk reads. The problem is not with SS7410 but rather the number of clients in my fabric. In order to obtain results for this test, I will have to increase the number or capabilities of my clients.
Max NFSv3 streaming disk write
Using the same 10 IB clients I used in my read experiments, I will drive 2 streaming write threads per client. Each thread uses a 32KB block size to stream to a separate file residing on a separate share.
I was pleasantly surprised to see that we can indeed break the 1 GByte/sec maximum Brendan saw with ethernet. The 1 GBytes/sec result is obtained by multiplying the NFS write IOPS by the write size. I am unable to sanity check this result with the network throughput in Analytics as we are bypassing the TCP/IP stack. I can though, confirm the throughput on the fabric subnet manager using the port counters exported by each HCA port. According to the port counters, I am seeing roughly 1 GBytes/second receive rate. Using the port counters is not precise as the time it takes collect the information varies and the counters (being 32-bit in length) can wrap. But the counters do provide a way to confirm our transport throughput in the absence of Analytics for RDMA. On the subnet manager, mlx4_0 (LID/Port 5/2) is attached to switch A and mthca0 (LID/Port 3/2) is attached to switch C in the IB fabric topology.
mlx4_0 LID/Port XMIT bytes/second RECV bytes/second
5/2 333697 518640843
mthca0 LID/Port XMIT bytes/second RECV bytes/second
3/2 3030 518400821
Max NFSv3 read ops/sec
As was the case with streaming reads from disk, my clients are insufficiently configured to push a maximum workload. I will need to increase the number of clients and try again.
IPoIB
The IPoIB protocol uses the TCP/IP network to transmit and receive network packets. Unlike RDMA that bypassses the network stack, IPoIB suffers from some of the performance implications inherent in the traditional TCP/IP software stack.

I re-ran the tests described above and summarize the results here.
| |
RDMA | IPoIB |
| NFSv3 Streaming DRAM Read |
3.18 GBytes/second | 2.24 GBytes/second |
| NFSv3 Streaming Disk Read |
|
Not Available |
| NFSv3 Streaming Write |
1.00 Gbytes/second |
753 MBytes/second |
| NFSv3 Max IOPS |
|
|
As I build up my IB fabric with more or better clients, I'll update the results that I was unable to capture this time around. The next step is to build out and attach the 7410 to a QDR-based fabric with at least 20 clients. This should provide a client workload large enough to push the 7410 to its maximum potential.

