Tuesday May 06, 2008
Today UPerf is being open sourced. Uperf is one of the key network performance measurement tools that we use at Performance Technologies (PT) at Sun. It is unique in its ability to simulate all kinds of network traffic, which may be easily specified in XML format, and its ability to gather comprehensive network performance data both from the System Under Test (SUT) and remote clients.
Let me give you an example. Suppose I wish to evaluate how my server responds to HTTP GET requests for medium sized files/ images. In this scenario, the server which is the SUT receives the GET requests, which are small (<64 byte) sized messages (We are excluding the network stack overhead here). In response it sends back a file/image of size 64 KBytes. Here is the Uperf XML profile corresponding to this network scenario. The profile mentions that the SUT receives requests from three clients, which will be specified by shell environmental variables: host1, host2, and host3. The number of connections is specified by the variable nth. The transaction defines the patterns of the network traffic, which is a 64 byte read followed by a 64 Kbytes write. This pattern continues for every connection for the specified duration: 120 seconds.
One of the best ways to use Uperf is to write a wraparound script that examines a workload under different scenarios and then parses through uperf outputs to generate a comprehensive report. Here is a bash script to do the same. This script has been designed to use the output of vmstat 1 in Solaris, an equivalent script may be written for Linux. I examine how my SUT performs with the HTTP Get workload for different number of connections.
My network comprises of my SUT connected to three clients with 10 Gig Ethernet Sun Neptune cards using a Cisco 6509 switch.
Now if I run my script as:
./runuperf.sh ~/ 199.199.1.1 199.199.1.2 199.199.1.3
I get the following neat output which reports throughput as well as CPU utilizations.
Section: uperf tests using the HTTP GET network profile
#Conn Wnd cpu(usr) cpu(sys) cpu(idle) Throughput (Mbps)
1 256k 0 11 89 1585.10
10 256k 0 27 73 3631.86
30 256k 1 78 21 5790.68
100 256k 1 98 1 6577.13
300 256k 1 99 0 6537.30
1000 256k 2 98 0 6629.40
2000 256k 2 98 0 6652.14
3000 256k 1 99 0 6673.01
So I can easily know that for the HTTP GET workload, my SUT will scale to about 100 simultaneous network connections for this desired workload and in terms of throughput to about 6.5 Gbps.
It is important to remember that UPerf is used in the above example to emulate network traffic for HTTP GET. The above does not mean that my SUT wull scale to only 100 HTTP
Note that I have parsed the uperf report to extract the bits that were most useful for me. Uperf reports are pretty comprehensive and I would refer you to the uperf documentation for the same.
Now if I were smart like Charles, I would draw a neat graph, and then show it to my manager :)
Have fun uperfing, and please do post your comments here.
Thursday Mar 20, 2008
Virtualization
Since virtualization is one of the hottest areas of growth today, it would be good to blog about virtualization and networking today. This is one of the beauties of blogging, just writing about some topic in a public forum motivates one to do more research and become more thorough and proficient with the subject.
So why is virtualization so hot? It is primarily because as servers grow more and more powerful, virtualization allows consolidation of multiple hosts on one physical system. The benefits of consolidation are many, mainly power and administrative costs saving. These end hosts can be very different operating systems. The challenge is to run each independent of the other. So that the performance of one host is independent of the performance of the other. While they all share resources of the same physical system.
So what's the challenge that virtualization brings to networking. Simply put, sharing I/O is challenging. Why? Consider other components such as CPU and memory. Since modern servers have multiple CPUs, we can simply assign the desired number of CPUs to each host and not allow hosts to touch each other. If a single CPU needed to be shared, that too could be done with a scheduling algorithm that follows some time Division Multiplexed (TDM) like approach. How about memory? Since memory is always managed as virtual memory, all we need to do is play with the paging algorithm. Partition the memory and just be careful about paging algorithms. Now this is not always very simple because of memory locality issues in a system which is Non-Uniform Memory Access (NUMA). But more on that later.
Now let us consider I/O. It is hard to partition peripheral devices across multiple hosts. Consider a Network Interface Card (NIC). Suppose two hosts do network I/O using this NIC simultaneously. Who resolves this conflict? Who coordinates the device instructions so that DMA mappings do not overlap with each other? How to fairly distribute network bandwidth amongst the two hosts? These are challenging problems.
In comes the role of the hypervisor. The hypervisor is a thin layer of software which interfaces between the virtual hosts and the physical machine. Simply put, in a virtualized environment, it is the hypervisor which plays the role of managing all the resources, such as CPU, memory, and I/O, and coordinating all the instructions sent by the virtual hosts.
So now let us talk about virtualization and networking. Here are the prominent ways in which network I/O works over virtualized environments today. The hypervisor plays different roles depending on the solution chosen by the vendor.
Software solutions
Binary Translations:
The idea here is to trap the privileged instructions issued by the guest operating system (OS) at the hypervisor layer and translate them into safe instructions. Binary translations have been historically used by VMWare to support virtualization on unmodified OSs such as Microsoft Windows. The guest OS being completely ignorant of the hypervisor, and issues instructions assuming it is executing on a bare metal x86 box. The hypervisor classifies all instructions issued into two broad categories, those that may be directly executed (called non-privileged) and those that need to be translated (called priviledged). Priviledged instructions are translated on the fly and executed.
The biggest advantage of this technique is that it doesn't require any modification in guest OSs. However, performance often suffers because of the in-flight translation, and therefore the virtualization industry is moving more towards paravirtualization and hardware assisted virtualization.

Paravirtualization:
In paravirtualization, the guest OS is modified to recognize the hypervisor and interact with it. The best example of this technique is in the open source Xen and Solaris XVM. In Solaris XVM, network I/O is handled by the Xen frontend driver whose source code is available here. The frontend driver interacts with the Solaris XVM backend driver (found here) which is running on the control domain, also known as Dom0. Dom0 controls and manages the network and other I/O devices directly. Thus the network path from all guest OSs is Guest OS -> Dom0 -> external world for transmit and in the reverse direction for receive. Dom0 plays the role of the arbitrator when multiple guest domains are conflicting for network I/O.
Paravirtualization typically performs better than binary translations because the hypervisor doesn't have to inspect each and every instruction. Moreover, it works great in cases like guest domain to guest domain communication, since the Dom0 can recognize the same and avoid sending packets to the hardware. However, paravirtualized solutions often require a good design (to ensure that Dom0 does not become a bottleneck as an example), and therefore higher cost of support and maintenance.
Hardware solutions
Intel I/O virtualization and AMD Pacifica virtualization technologies: Since 2006, both Intel and AMD have had hardware support to support virtualization. The hardware provides support to trap any priviledged instruction and send it to the hypervisor. This allows support of unmodified OSs on the Xen hypervisor on supported hardware. As an example, we can now run Windows XP, Solaris and Linux with Solaris XVM in the same box. Support for hardware virtualization although currently an initial step, is expected to grow and become dominant in the coming years. But as of now, paravirtualized solutions are generally seen outperforming hardware assisted solutions.
PCI-Express Technologies- I/O VT
The PCI-Express community is currently standardizing technologies to support multiple OSs running simultaneously within a single computer to natively share PCI-Express devices. There are two main technologies currently undergoing standardization, single-root I/O virtualization and multi-root I/O virtualization. The idea here is to allow an OS handle its own IOV compliant interface over PCI-Express which is also shared by other virtual OSs running in the system. This will allow more parallelism in hardware and reduce the role of the hypervisor in arbitrating amongst multiple OSs competing for the same I/O.
The current industry is in a flux of moving from software based virtualization solutions to hardware assisted ones. How much the performance of hardware solutions will improve over time is difficult to speculate. Therefore, paravirtualized solutions are still expected to be dominant for some time. It is interesting to see most vendors to support both hardware and software solutions for now.
Thursday Feb 28, 2008
Let us consider the process by which the operating system (OS) handles network I/O. For simplicity we consider the receive path. While there are some differernces between Solaris, Linux, and other flavors of unix, I will try to generalize the steps to construct a high-level representative picture. Here is an outline of the steps:
1. When packets are received, the Network Interface Card (NIC) performs a Direct Memory Access (DMA) to transfer the data to the main memory. Once a sufficient size of data prescribed by the interrupt coalescing parameter is received, an interrupt is raised to inform the device driver of this event. The device driver assigns a data structure called the receive descriptor to handle the memory location identified by the DMA.
2. In the interrupt handling context, the device driver handles the packet in the DMA memory. The packet is processed through the network protocol stack (MAC, IP, TCP layers) in the interrupt context and is ultimately copied to the TCP socket buffer. The work of the interrupt handler ends at this stage. Solaris GLDv3 based drivers have a tunable to employ independent kernel threads (also known as soft rings) to handle the network protocol stack so that the interrupt CPU does not become the bottleneck. This is is sometimes required on the UltraSparc based systems because of the large number of cores that they support.
3. The application thread, usually executing as a user-level process, then reads the packet from the socket buffer and processes the data appropriately.

Thus, data transfer between the NIC and the application may involve at least two copies of data: one from the DMA memory to kernel space, and the other from kernel space to user space. In addition, if the application is writing data to the disk, there may be an additional copy of data from memory to the disk. Such a large number of copies has high overhead, particularly when the network transmission line rate is high. Moreover, the CPU becomes increasingly burdened with the large amount of packet processing and copying.
The following techniques have been studied to improve the end-system performance.
Protocol Offload Engines (POE)
Offload engines implement the functionality of the network protocol in on-chip hardware (usually in the NIC), which reduces the CPU overhead for network I/O. The most common offload engines are TCP Offload Engines (TOEs). TOEs have been demonstrated to deliver higher throughput as well as reduce the CPU utilization for data transfers. Although POEs improve network I/O performance, they do not completely eliminate the I/O bottleneck, as the data still must be copied to the application buffer space.
Moreover TOE has numerous vulnerabilities because of which it is not supported by any operating system. Patches to provide TOE support to Linux were rejected for many reasons, which are documented here. The main reasons are: (i)Difficulty of patching security updates since TOE resides firmly in hardware, (ii) Inability of ToE to perform well under stress, (iii)Vulnerabilities to SYN flooding attacks, (iv)Difficulties in longterm kernel maintenance with evolving dimensions of TOE hardware.
Zero-Copy Optimizations
Zero-copy optimizations such as the sendfile() implementation in Linux 2.4 , aim to reduce the number of copy operations between kernel and user space. As an example, in sendfile(), only one copy of data occurs when data is transferred from the file to the NIC. Numerous zero-copy enabled versions of TCP/IP have been developed and implementations are available for Linux, Solaris, and FreeBSD. A limitation of most zero-copy implementations is the amount of data that may be transferred. As an example, sendfile() has the limitation of a maximum file size of 2 GB. Although zero-copies improve performance, they do not eliminate the contention for end-system resources.
Remote DMA (RDMA)
The RDMA protocol implements both POEs and zero-copy optimizations.
RDMA allows data to be directly written to/read from the application buffer without the involvement of the CPU or OS. It thus avoids the overhead of the network protocol stack and context switches, and allows transfers to continue in parallel with other executing tasks. However, apart from cluster computing environments, the acceptance of RDMA has been rather limited because of the need of a separate networking infrastructure. Moreover, RDMA has security concerns, particularly in the setting of remote end-to-end data transfers.
Large Send Offload (LSO)/ Large Receive Offload (LRO)
LSO and LRO are NIC features to allow the network protocol stack to process large (up to 64 KB) segments. The NIC has hardware features to split the segments into 1500 byte MTU packets for send (LSO) and combine incoming MTU sized packets into a large segment for receive (LRO). LSO and LRO help save CPU cycles consumed in the network protocol stack because a single call can handle a 64 KB segment. LSO/LRO are supported in most NICs and are known to improve the CPU efficiency of networking considerably.
Transport Protocol Mechanisms
There are several approaches to optimizing TCP performance. Most focus on improving the Additive Increase Multiplicative Decrease (AIMD) congestion control algorithm of TCP which is sometimes less inefficient at very high bit-rates, because a single packet-loss may quench the transfer rate. Also, the congestion control algorithm in TCP has been demonstrated to be not scalable in high Bandwidth Delay Product (BDP) settings (connections with high bandwidth and Round Trip Time (RTT)).
To improve these remedies, a large variety of TCP-variants which improve on the congestion control algorithm have been proposed, such as FAST, High-Speed TCP (HS-TCP), Scalable TCP, BIC-TCP, and many others.
Monday Nov 19, 2007
Just read an article from Nyquist Capital about Google designing its own 10 GigE switches. Its interesting how the authors traced the source of thousand of SFP+ components to Google to determine this. The article discusses how the strategy is very similar to Google designing its own servers and compute farms from basic components as opposed to buying servers from companies like SUN. the article also mentions Google adding 5K+ 10GigE ports a month to manage its 500,000+ compute nodes.
Thats a ton of money being saved, given the cost of 10GigE switches in the market today. Which brings us to the question, what is so proprietary about a switch that vendors like Cisco, Juniper, Woven etc., can sell them at a huge price and a huge margins. Does IOS-X from Cisco have some patented stuff that open source software cannot implement or don't have. I would love to see the above Google technique revolutionize the switch market and motivate some startup to design a switch fabric based on commodity hardware.