The Jupiter Interconnect
Relaxed Ordering is important to the Sun SPARC Enterprise M-series server I/O architecture. The SPARC Enterprise servers use a network of switches and crossbars to connect CPUs, memory access controllers (MACs), and I/O controllers (IOCs). This internal network, called the Jupiter interconnect (it is sometimes called the Jupiter bus, although it's not a "bus" at all) employs error detection and correction mechanisms, and will retry transactions if a protocol error occurs between two nodes in the network. As a result, it is possible for one transaction to "pass" another transaction; such that the agent issuing the transactions sees them complete in opposite order from the order in which they were issued.
For example, consider an M9000 system with two system boards shown in the following figure:
System Board #1
|
System Board #2
|
|||||||||||||||||||||
|
||||||||||||||||||||||
Each system board has four I/O hostbridges, four MACs (memory access controllers), and four CPU chips. The IOCs, MACs and CPU chips on a system board are interconnected by four SC chips (system controller), and the SCs connect the system boards to the crossbar unit (XBU). [If you're curious, each SC has direct connections to all four MACs, all four CPU chips, and one of the hostbridges.]
Now let's take two simple transactions: Transaction TA and Transaction TB. Transaction TA is a write from a hostbridge on system board #1 to memory on system board #2. The transaction must go from the hostbridge to the SC on system board #1 (SC#1), then to the XBU, then to the SC on system board #2 (SC#2), then to the MAC on system board #2 (MAC#2). Transaction TB is a write from the same hostbridge to memory on the same system board. This transaction must go from the hostbridge to the SC on system board #1 (SC#1), then directly to the MAC on system board #1 (MAC#1). The following scenario shows how the transactions could get reordered while in flight:
- Hostbridge issues transaction TA to SC#1 on the same system board.
- Hostbridge issues TB to SC#1.
- SC#1 issues TA to XBU.
- SC#1 issues TB to MAC#1 on same system board.
- XBU issues TA to SC#2 on destination system board.
- MAC#1 commits data to RAM, sends acknowledge back to hostbridge that TB is complete.
- SC#2 issues TA to MAC#2.
- MAC#2 commits data to RAM, sends acknowledge back to hostbridge that TA is complete.
In order for the hostbridge to maintain the strict PCI ordering rules, it is necessary for the hostbridge to wait until the first transaction completes before issuing the next transaction. Using the above example, if TA and TB must adhere to the PCI strict ordering rules, the scenario would look very different:
- Hostbridge waits for all outstanding writes to by acknowledged.
- Hostbridge issues TA to SC#1.
- SC#1 issues TA to XBU.
- XBU issues TA to SC#2 on destination system board.
- SC#2 issues TA to MAC#2.
- MAC#2 commits data to RAM, send acknowledge back to hostbridge that TA is complete.
- Hostbridge issues TB to SC#1.
- SC#1 issues TB to MAC#1.
- MAC#1 commits data to RAM, send acknowledge back to hostbridge that TB is complete.
When to use Relaxed Ordering
In most case, relaxed ordering cannot be enabled on every transaction. Take for example a typical network interface card (NIC) architecture. The NIC might write a large number of data blocks, followed by an update of a descriptor block indicating that the data is available. When the driver sees the descriptor updated, it goes and processes the data. It doesn't matter in what order the data blocks get committed to RAM. But the descriptor must be written after all of the data is in RAM; otherwise, the driver might see the descriptor get updated, and read the partially-updated data buffer.Therefore, the data writes can employ relaxed ordering; the descriptor must be strictly ordered so that it will not pass the data writes. Assuming the number and size of data transactions are much larger than descriptor updates, the system will see high write-to-memory performance when relaxed ordering is enabled on the data transactions.
An I/O device should only set the relaxed ordering bit in the TLP header if the device is smart enough to know which transactions can be reordered without causing data corruption. Unfortunately, we've encountered some devices which set the relaxed ordering bit incorrectly.
Enabling Relaxed Ordering in Solaris
The Sun SPARC Enterprise servers are the first SPARC servers from Sun that support relaxed ordering. When we first started testing with hardware, we found that several cards did not support relaxed ordering, or did not support it correctly. The SAS controller used in the M4000/M5000 servers, for example, does not support relaxed ordering. The Gigabit Ethernet controller, on the other hand, incorrectly set the relaxed ordering bit in the TLP header for all transactions, including control block updates. To deal with this, we had to turn off relaxed ordering in the GBE controller itself.Even though the device hardware did not support (or did not enable) relaxed ordering, good throughput from these devices required that they allow relaxed ordering on the Jupiter Interconnect. To deal with this, Sun added a new flag, DDI_DMA_RELAXED_ORDERING, which allows a device driver to specify which DMA buffers may be relaxed ordered. We also modified the SAS and GBE drivers to tag data buffers with the DDI_DMA_RELAXED_ORDERING bit; control buffers were not tagged.
To enable relaxed ordering, a device driver must set the DDI_DMA_RELAXED_ORDERING in the dma_attr_flags in the ddi_dma_attr_t(9S) structure passed to ddi_dma_alloc_handle(9F). Per the ddi_dma_attr_t man page:
For an example of driver code which uses the DDI_DMA_RELAXED_ORDERING flag to enable relaxed ordering on data buffers, see the bge driver on OpenSolaris.org:
DDI_DMA_RELAXED_ORDERING
This optional flag can be set if the DMA transactions
associated with this handle are not required to observe
strong DMA write ordering among each other, nor with DMA
write transactions of other handles.
It allows the host bridge to transfer data to and from
memory more efficiently and may result in better DMA
performance on some platforms.
1977 /*
1978 * Enable PCI relaxed ordering only for RX/TX data buffers
1979 */
1980 if (bge_relaxed_ordering)
1981 dma_attr.dma_attr_flags |= DDI_DMA_RELAXED_ORDERING;
System Considerations
If you're going to deploy a system with a mix of I/O devices that support relaxed ordering (either in the TLP header, or using the DDI_DMA_RELAXED_ORDERING flag) and I/O devices that do not support relaxed ordering, you should consider the system impact.Take, for example, the I/O architecture of a system board on a Sun SPARC Enterprise M9000 server:
|
|
|
The above diagram shows that an M9000 I/O Unit has two I/O controller chips, each IOC has two hostbridges, and each hostbridge contains the root complexes for two PCI-Express slots.
The ideal situation is if all the cards enable relaxed ordering. On the other hand, let's say you have one legacy card that does not support relaxed ordering (perhaps it's a low-performance card where the vendor did not feel throughput, and therefore relaxed ordering, was important). If you put this low-performance card in, for example, slot 0 along with a high performance card that supports relaxed ordering in slot 1, both cards will share a single hostbridge and therefore a single Jupiter Interconnect interface. If the hostbridge has some strictly-ordered writes to memory from card 0, the relaxed-ordered writes from card 1 may queue up behind the strictly-ordered writes.
For comparison, here is the I/O Unit for an M4000/M5000:
|
|
|
In this case, PCI-X slot 0 and PCI-E slots 1 and 2 all share a hostbridge, while PCI-E slots 3 and 4 share the other hostbridge (there is only one IOC with two hostbridges on an M4000/M5000 I/O Unit). While the hostbridge-to-memory latency is not as large on the M4000/M5000 systems, mixing cards that support relaxed ordering under the same hostbridge as cards that require strict-ordering can impact I/O throughput. Note that the SAS controller and the Gigabit Ethernet conroller already have relaxed ordering enabled using the DDI_DMA_RELAXED_ORDERING flag in their respective drivers.
To maximize write-to-memory throughput, it is best to group cards that do not enable relaxed ordering together below the same set of hostbridges, and group high-performance cards that enable relaxed ordering together below a different set of hostbridges. At the same time, you don't want to not oversubscribe the hostbridge. The hostbridge can easily handle a single x8 PCI-Express link writing at its top bandwidth of about 1.7 GB/s; however, two high-performance x8 cards could be limited by the hostbridge's Jupiter interface bandwidth of 2.1GB/s. Of course, the best arrangement of I/O cards may depend on other factors as well; relaxed ordering is just one thing to keep in mind when building a system.