SPARC Enterprise M-class Servers The Secrets of Olympus

Friday Jul 27, 2007

Alex Noordergraaf has innagurated his blog Systems Platform Security with a post about Sun SPARC Enterprise M-class Security. This first post is an overview, and he promises more details in the future. Check it out!

Wednesday Jul 11, 2007

The Sun SPARC Enterprise M-class server service processor (XSCF) has several commands that are new or different from previous enterprise-class servers. Most users are probably aware that the 'man' facility is available for all commands; however, as someone once said (I think it was Scott McNealy), "man has the answer to any question, conveniently organized by answer." In other words, if you don't know what command to use, man isn't going to help. Luckily there are a couple of things that can help...

The XSCF man facility does include a standard Intro(8) topic, which provides a complete list of commands and a short synopsis. For example:

    XSCF> man intro

    System Administration                                    Intro(8)

    NAME
         Intro - eXtended System Control Facility (XSCF) man pages

    DESCRIPTION
         This manual contains XSCF man pages.

    LIST OF COMMANDS
         The following commands are supported:

         Intro, intro            eXtended  System  Control   Facility
                                 (XSCF) man pages

         addboard                configure   an    eXtended    System
                                 Board(XSB)  into  the  domain confi-
                                 guration or assigns it to the domain
                                 configuration

         addcodlicense           add  a  Capacity  on  Demand   (COD)
                                 right-to-use  (RTU)  license  key to
                                 the COD license database

         addfru                  add a Field Replaceable Unit (FRU)

         adduser                 create an XSCF user account
    ...

On the other hand, you may sort of know the command you want to run, but aren't sure of the specifics. For example, to set up the network I sometimes forget if the command is 'setnet' or 'setnetwork'. Like a standard bash shell, the tab can be used to complete the command. for example, 'setnet<TAB>' will expand to 'setnetwork'.

Also, like the bash shell, you can use the double-tab to display a list of possible completions. This comes in handy when you know you need a 'set' command, but you don't recall specifically which command; you can do 'set<TAB><TAB>' and get a complete list of all 'set*' commands. For example (using 'setn<TAB><TAB>' which is a little less verbose):

    XSCF> setn<TAB><TAB>
    setnameserver  setnetwork     setntp         
    XSCF> setn

Finally, all of the XSCF commands consistently implemented the -h option to display the command's synopsis. This isn't the full man page, just the synopsis. For example:

    XSCF> setnetwork -h
    usage: setnetwork [-m addr] interface address
           setnetwork -c {up | down} interface
           setnetwork -h
That's usually enough to help you figure out what arguments you need to provide. And when it isn't, you always have man to provide all the details.

Tuesday Jul 10, 2007

Chris Kevlahan here at Sun has done extensive power measurements of the SPARC Enterprise M4000/M5000 servers, and put together a spreadsheet to estimate power usage based on machine configuration. I turned that into Javascript so I could embed it in this blog

The purpose of this tool is:

  • Estimate and calculate the power consumption of a planned configuration.
  • Estimate and calculate the cooling requirements of a planned configuration.
Due to proprietary agreements with Fujitsu, the power calculator has been removed from this blog
Notes:
  • Total number of memory boards must be 2 or more.
  • Total number of memory boards must not exceed 4 (M4000) or 8 (M5000).
  • Power usage for PCI/PCI-Express cards is estimated using 14 Watts average; however, 25 Watts is the maximum.

Friday Jul 06, 2007

The Sun SPARC-Enterprise M9000-64 has internal support for up to 64 internal 2.5" SAS disks -- that's more disks than the Sun Fire X4500 (although, the X4500 does take up a lot less space). We were concerned that the system administrator might find it difficult to locate a disk, or map c0t0d0 to a physical disk.

To make service a bit easier, each physical disk bay in the Sun SPARC-Enterprise M-class line of servers has two LEDs: one for power, the other for fault indications. We used the fault LED in blinking mode to act as an indicator so the system administrator can identify and locate a specific disk.

The disk LEDs are accessed using cfgadm(1M). The cfgadm PCI plugin always supported LED manipulation, using a -x option. So we expanded the cfgadm SCSI plugin to use the same syntax. We wanted to add the ability to directly control the disk LEDs, so we added "-x led=LED[,mode=MODE]" (where LED could be power, fault, active or attn, just like PCI, and mode can be on, off or blink).

But we also felt there was a general problem being solved here, that of a "locator" indication. In the case of the M-class servers, the "locator" indication happens to be blinking the fault LED, but in the future, other platforms may have separate locator LEDs, or use some other method (for example, a separate locator LED, an annoying wable sound like a car alarm coming from the disk, or an LCD display on the from of the server that draws a hand pointing to the disk you want to remove). Our solution should not pre-suppose that the fault LED, or any LED for that matter, must be used as the locator.

So another -x option was added to the cfgadm SCSI plugin: "-x locator[={on|off}]". This allows a user to turn on or turn off the locator indication, regardless of the underlying implementation.

Here's a snippet from the new cfgadm_scsi(1M) man page showing the two new -x options:

 -x hardware_function   Some of the following  commands  can 
                        only  be  used with SCSI controllers
                        and some only with SCSI devices.

                        In the  following,  controller_ap_id
                        refers  to  an ap_id for a SCSI con-
                        troller,    for     example,     c0.
                        device_ap_id  refers to an ap_id for
                        a   SCSI   device,   for    example:
                        c0::dsk/c0dt3d0.

                        The  following   hardware   specific
                        functions are defined:

                        locator [=on|off] device_ap_id
                            Sets or gets the hard disk loca-
                            tor  LED,  if  it is provided by
                            the platform.  If  the  [on|off]
                            suboption  is not set, the state
                            of  the  hard  disk  locator  is
                            printed.

                        led[=LED,mode=on|off|blink]
                        device_ap_id
                            If  no  sub-arguments  are  set,
                            this  function  print  a list of
                            the  current  LED  settings.  If
                            sub-arguments   are   set,  this
                            function  sets  the  mode  of  a
                            specific LED for a slot.

To give you an idea of how this works, here's some annotated output from a Solaris session:

    Check the current state of the fault LED for disk c0t0d0; it is now off:
    # cfgadm -x led=fault c0::dsk/c0t0d0
    Disk                    Led
    c0t0d0                  fault=off
    
    Turn on the locator indication for disk c0t0d0:
    # cfgadm -x locator=on c0::dsk/c0t0d0
    # cfgadm -x locator c0::dsk/c0t0d0
    Disk                    Led
    c0t0d0                  locator=on
    
    Check the current state of the fault LED; it is now blinking:
    # cfgadm -x led=fault c0::dsk/c0t0d0
    Disk                    Led
    c0t0d0                  fault=blink
    
    Turn off the fault LED:
    # cfgadm -x led=fault c0::dsk/c0t0d0
    Disk                    Led
    c0t0d0                  fault=off
    # cfgadm -x led=fault,mode=off c0::dsk/c0t0d0
    
    And the locator indication is also off:
    # cfgadm -x locator c0::dsk/c0t0d0
    Disk                    Led
    c0t0d0                  locator=off
    

The intention is certainly that the system administrator would use '-x locator' to locate and replace internal SAS disks. But once SCSI FMA is supported in Solaris, we'll have the ability to set the fault LED on disks, and administrators can use '-x led' to view the state of the fault LED.

Thursday Jul 05, 2007

The new Sun External I/O Expansion Unit, or IO Box, supports PCI-X and PCI-Express hot plug. However, early on we realized that with the IO Box being remote from the host, it would be a challenge figuring out what the hot plug attachment points would be.

Normally, PCI card AP IDs (attachment point IDs) are labeled based on the physical location of the card -- its I/O Unit (IOU) and slot. For example, on Sun SPARC Enterprise M-class servers, the AP ID "iou#0-pci#1" is the slot PCI#1 on I/O Unit IOU#0.

With IO Box, however, there is no fixed physical location for the IO Box slots. The IO Box does connect to a host slot (like iou#0-pci#1), so one could label the AP IDs something like "iou#0-pci#1:iob.pci3" to show that it's PCI slot 3 in an IO Box attached to IOU#0-PCI#1.

On the other hand, this introduces issues when the IO Box is physically remote from the server -- it might not be obvious where this box conencts to the host. We don't want customers tracing cables, and a simple mistake could cause you to power off or power on the wrong slot. Something better, something more reliably was needed.

So we augmented the AP ID to include the serial id of the IO Box boat. With this approach, someone can look at an IO Box, write down the serial id and slot they wanted to power off, then go back to Solaris and power off that slot based on serial id. Similarly, if you powered off a slot and wanted to go remove the card, you can write down the serial id and slot, then go find the IO Box boat with matching serial id, and have confidence that you're removing the right card.

The resulting AP ID is a combination of physical location of the host slot, and serial id of the IO Box boat. An example of the AP ID format looks like this "iou#0-pci#4:iobE00E7.pcie1", which slows PCIE slot 1 in the IO Box boat with serial id ending in "E00E7". From the AP ID, it also is clear that the IO Box boat is connected to the host using a link card in host slot IOU#0-PCI#4.

I should also note that the IO Box boat serial id is prominently featured on handle, in plain view. There's no need to remove the boat to get to the product nameplate.

Here's some sample output from 'cfgadm -a' showing just the PCI slots:

    # cfgadm -a
    Ap_Id                          Type         Receptacle   Occupant     Condition
    ...
    iou#0-pci#0                    unknown      empty        unconfigured unknown
    iou#0-pci#1                    unknown      empty        unconfigured unknown
    iou#0-pci#2                    unknown      disconnected unconfigured unknown
    iou#0-pci#3                    pci-pci/hp   connected    configured   ok
    iou#0-pci#3:iobX00FC.pci1      unknown      empty        unconfigured unknown
    iou#0-pci#3:iobX00FC.pci2      fibre/hp     connected    configured   ok
    iou#0-pci#3:iobX00FC.pci3      scsi/hp      connected    configured   ok
    iou#0-pci#3:iobX00FC.pci4      unknown      empty        unconfigured unknown
    iou#0-pci#3:iobX00FC.pci5      fibre/hp     connected    configured   ok
    iou#0-pci#3:iobX00FC.pci6      unknown      empty        unconfigured unknown
    iou#0-pci#4                    pci-pci/hp   connected    configured   ok
    iou#0-pci#4:iobE00E7.pcie1     unknown      empty        unconfigured unknown
    iou#0-pci#4:iobE00E7.pcie2     etherne/hp   connected    configured   ok
    iou#0-pci#4:iobE00E7.pcie3     etherne/hp   connected    configured   ok
    iou#0-pci#4:iobE00E7.pcie4     pci-pci/hp   connected    configured   ok
    iou#0-pci#4:iobE00E7.pcie5     pci-pci/hp   connected    configured   ok
    iou#0-pci#4:iobE00E7.pcie6     unknown      empty        unconfigured unknown
    
In the above output, IOU#0-PCI#0 through IOU#0-PCI#4 are the host slots; IOU#0-PCI#4 is connected to a PCI-X IO Box boat, while IOU#0-PCI#4 is connected to a PCI-Express IO Box boat.

Tuesday Jul 03, 2007

One of the most unique features of the Sun External I/O Expansion Unit, or IO Box, is the way it is managed.

We knew customers would not want an unmanaged, "black box" for external I/O. The system administrator needs access to the status of the IO Box. They also need to know when the IO Box fails, and why it is failing (and how that failure is affecting the rest of the system). And being able to light "locator" indicators to find one of hundreds of IO Boxes located remotely from the host server was critical for service. Providing a managed IO Box, with fault and status information available using standard protocols, and LEDs controlled by the host, is essential to provide a highly available system.

On the other hand, we didn't want the IO Box to be another system that the customer had to manage. The IO Box does not have a service processor; it doesn't have its own MIB; it doesn't need software upgrades; it doesn't have an ethernet port.

The IO Box is just a bunch of I/O slots available to the host; it shouldn't matter if they are located in the same chassis as, or several meters (or several dozen meters) away from, the CPUs and memory.

The IO Box is fully managed, as a part of the host, reporting status and faults back to the host to which it is connected. The host SP includes IO Box status as part of the overall system status. The IO Box does not require any special cables between the host and the box, other than the PCI-Express cables. Separate cabling would introduce the risk that a customer could cross-wire a box to the wrong host. And we wanted to make the wiring as simple as possible.

This is accomplished by using extra signals in the PCI-Express cable to implement a management connection between the host service processor (SP) and the IO Box. The standard PCI-Express connector has two pins for SMBus: SMCLK (B5) and SMDAT (B6). Using those two pins, the host (SP) is able to talk to a "slave" microcontroller on the link card. The microcontroller then uses spare signals in the cable to communicate in a reliable fashion with the microcontroller on the link card in the IO Box to forward requests from the host SP. The link card in the IO Box then uses SMBus on those same two pins B5 and B6, this time as "master", to access devices in the IO Box, reading or writing any device that the host SP requests. In effect, the I2C devices in the IO Box become "local" to the host service processor, even though the IO Box itself is remote. The microcontrollers on the link cards act as proxies.

With this arrangement, the host SP can retrieve environmental information about the IO Box: temperatures, fan speeds, voltages, currents, switch positions, etc. In addition, the host SP can control the IO Box, turning off power to a power supply unit so it can be removed, lighting the "locator" indicator so the IO Box can be found in the datacenter, etc. And when the IO Box experiences an error, the host SP can gather error information, and factor it into other host-detected errors to diagnose the fault.

IO Box management is entirely optional -- the IO Box as a standalone unit can function with no host management. But management by the host SP provides an added dimension of availability and serviceability which is not found on low-end I/O expansion units.

Friday Jun 29, 2007

As I mentioned in my blog XCP 1041 Now Available, Sun SPARC Enterprise M-class servers support an External I/O Expansion Unit, or IO Box. This week, IO Box started shipping to customers! IO Box is one of my pet projects.

The IO Box addresses a critical problem with previous generations of enterprise servers: The I/O-to-CPU ration was too low for some customer applications. On a Sun Fire 25K fully populated with I/O boards and CPU boards, there are 72 PCI slots, which is plenty for most customers. But with 72 CPU sockets filled with dual-core UltraSPARC IV CPUs, that yields a PCI-slot to CPU core ration of 1/2 -- one PCI slot for ever two CPU cores. The SPARC Enterprise M9000-64 has 128 PCI-Express slots, and while that may seem like a lot, with 64 dual-core SPARC64-VI CPU chips, that's still just 1 PCI-Express slot per core (and it will get worse when we pack more cores into each CPU chip). Some customers really care about I/O, and a higher I/O-to-CPU ratio is important.

The IO Box allows you to connect one PCI-Express slot in the host to an IO Boat, which has six additional, hot-plug-capable PCI-Express or PCI-X slots. Each IO Box can support up to two IO Boats (either PCI-Express or PCI-X), independently connected to the same host. The host-to-box link can either be copper (low cost, but short and bulky, so only really applicable if the server and the IO Box are in the same cabinet) or fibre optic (higher cost, but with 25m cable lengths you can locate the IO Boxes together in a separate cabinet from your servers).

There are other IO expansion units on the market, but the Sun version is really designed for the enterprise-class environment, with features like:

  • Fully redundant, hot-swappable power supplies.
  • Support for Sun's indicator standard, with LEDs for locating field replaceable units (FRUs), showing FRU power state, identifying FRUs that are ready to remove, and faults.
  • Ability to monitor IO Box internal voltages, currents, temperatures, LEDs, power state and switch settings from the host's Service Processor.
  • Host's ability to detect and diagnose IO Box faults seamlessly with other errors and faults in the host.
In sort, the IO Box is less a peripheral, and more like a part of the system; it just happens to be located several meters away. Disconnecting compute power and I/O only makes sense if you reconnect them virtually, so there's only one system to manage.

While IO Box is initially supported on the M-class servers, it's history predates the Sun/Fujitsu APL agreements, and will almost certainly be supported on other Sun products in the future. I've got some neat things I'd like to share about IO Box in future posts...

Tuesday Jun 12, 2007

In a prior posting, SBs, XSBs, and LSBs, I wrote how Sun SPARC-Enterprise M-class servers use logical system board (LSB) numbers, not physical system board (SB) numbers, to assign CPU IDs and I/O bus addresses in Solaris. One question that common arises is: In Solaris, how can you figure out what the physical system board is for a given CPU ID or I/O device path?

For I/O, the easiest method happens to be the fmtopo command. fmtopo will list all of the I/O devices, the device path, the FRU (Field Replaceable Unit) and FRU Label. Here's a snippet of the output showing the device node /devices/pci@90,600000:

    # /usr/lib/fm/fmd/fmtopo -p
    ...
    hc:///chassis=0/ioboard=0/hostbridge=0/pciexrc=0
            ASRU: dev:////pci@90,600000
            FRU: hc:///component=iou#0
            Label: iou#0
    ...
The PCI-Express root complex pci@90,600000 belongs to the logical system board (LSB) 9 (you can tell from the "90" after the @ sign). But the output of fmtopo shows that the device is actually on the physical FRU iou#0, which is part of SB#0.

Now, knowing that LSB#9 is really SB#0, one can infer that the cpuids associated with LSB#9 (i.e., cpuids 9*32 through 10*32-1, or 288-319) are also on SB#0.

So, how does fmtopo figure out how to map LSBs to SBs? Turns out that there is one memory controller per LSB, and the memory controller node has two properties of interest, board# and physical-board#. The board# is the LSB number, while the physical-board# is the SB number. Other nodes in the device tree have the board# property (I/O hostbridges, CPUs, etc), but only the memory controller node has the physical-board# property.

To see what I mean, you can use prtconf, for example:

    # /usr/sbin/prtconf -pv | grep "board#"
        physical-board#:  00000000
        board#:  00000009
        board#:  00000009
        board#:  00000009
        board#:  00000009
        board#:  00000009
        board#:  00000009
        board#:  00000009
        board#:  00000009
        board#:  00000009
        board#:  00000009
If you look at the full output of prtconf, you'll see that the first two lines belong to the memory controller node (pseudo-mc) with an LSB board# of 9 and an SB physical-board# of 0. The other board# properties belong to the CPUs and I/O host bridges.

Monday Jun 04, 2007

The SPARC Enterprise M-class XSCF User's Manual chapter 8.1.10, "Firmware Update Procedure" will tell you: "XCP import is done by using the XSCF Web", indicating that you need to use your web browser and the Browser User Interface (BUI) feature of the XSCF (the service processor) to upload an XCP firmware image (the software that runs on the service processor). But that really isn't the only way.

XCP 1040 (with which most systems have shipped) has another, undocumented command, called getflashimage(8M). XCP 1041 is now available, and upgrading using getflashimage and flashupdate is easy -- it takes just one minute to download the image with getflashimage, and 15 minutes to install the new version with flashupdate.

Since getflashimage didn't make the XSCF User's Manual, I thought I'd give you a brief overview.

getflashimage allows you to log on to the XSCF and download an XCP image to the XSCF. getflashimage works similar wget: you provide a URL and getflashimage downloads the file. getflashimage supports http, https and ftp protocols, and will even allow you to download the XCP image from a USB flash device (which is useful if your XSCF does not connect to the network where the XCP images are).

The synopsis for getflashimage is:

   getflashimage [-v] [-q -{y|n}] [-u username] [-p proxy [-t proxy_type]] URL
   getflashimage -l
   getflashimage [-q -{y|n}] -d
   getflashimage -h
I think most of the options in the first synopsis are pretty straight forward (including the standard '-v' for "verbose", '-q' for "quiet", "-{y|n}" for "yes" or "no"). For example, I can download a flash image from an https server that requires user authentication by doing:
    XSCF> getflashaimge -u rjh https://imageserver/images/FFXCP1041.tar.gz
where "rjh" is my webserver user name. getflashimage will prompt for my password, and in about a minute, the image will be downloaded and ready for flashupdate(8M). If you're having problems accessing the server (and the error messages aren't sufficiently clear), you can use the -v option to view the protocol exchange with the server, and see the exact error codes returned by the server (if you find yourself having to use "-v", please let me know what the problem was, and I'll try to improve the error messages). And of course the -p and -t can be used to access the web through a proxy, where -p is the proxy name or IP address, and -t can be used to specify the proxy type (http, socks4 or socks5; the default proxy type is http).

In hindsight, it probably would have made sense to do some integrity checking on the image file, maybe verifying its checksum during the download process. Maybe I'll add that to the next release.

The second synopsis line, 'getflashimage -l' (small L), allows you to list the image file that was downloaded, just in case you forgot whether you finished downloading. It will also display the image file size and download date.

The third synopsis line, 'getflashimage -d', lets you delete any and all image files previously downloaded. If you've done the download and flashupdate, you can delete the image at any time, but it doesn't hurt to leave the file around (the space is reserved and can't be used for anything else). On the other hand, if you downloaded an image file and decided you did not want to flashupdate it (perhaps you downloaded the wrong version), you might want to delete it immediately so you don't accidentally flashupdate it later.

getflashimage can be a real lifesaver if you have a slow connection to the XSCF. For example, when I connect from my home (in the Boston area) to Sun's lab in San Diego, using the Brower User Interface can be slow because I have to ftp the XCP image from San Diego to my home workstation, then upload it using Firefox back to the XSCF in San Diego -- a 6,000 mile journey. But with getflashimage, I can ssh to the XSCF, then use getflashimage to very quickly load an XCP image from a file server to the XSCF on the same subnet, without the bits needing to travel cross-country to me on the East Coast. I also tend to prefer command lines over GUIs.


Glossary:
  • XSCF: The service processor (called a system controller on Sun Fire systems).
  • XCP: The firmware that runs on the XSCF (similar to SMS or SCAPP on Sun Fire systems). XCP also includes POST and OBP.

Friday Jun 01, 2007

Sun SPARC Enterprise M-class service processor firmware release XCP 1041 is now available to download: I was looking at the product notes for XCP 1041, and it's not obvious what new features are in XCP 1041, so here's a quick summary of what's new:
  • Capacity on demand: If you're not familiar with COD, let me summarize by saying that you can get a CPU/Memory Unit (CMU) at a low up-front cost, but not pay the rest until you actually need and use the CPUs and memory. For example, you could buy an M5000 server with four CPUs and get four extra COD CPUs. Normally, you'd use the system with four CPUs. But if you needed more compute capacity, you could call up Sun, buy a license to add another CPU, enter the license key, and then power on the CPU and use it. No boards to install. No downtime. Just enable the CPU and go. We've had this feature for the Sun Fire 3800-6900 and 12K-25K, and now it's available on the M-class servers. I don't think the SPARC Enterprise COD info is posted on sun.com yet, but here's a link to the Sun Fire COD web page.
  • External I/O Expansion Unit: XCP 1041 includes full support for the Sun External I/O Expansion Unit (during development, we just called it "IO Box" for short). This 4RU chassis can connect to a host PCI-Express slot, and gives you six additional PCI-Express or PCI-X slots. Since the link connecting the IO Box and the host is fibre optic, the IO Box can be in the next rack, or across the room. The IO Box product isn't shipping yet, so I'll probably do a full blog posting when it does ship.
In addition to the new features, we fixed a small number of bugs.

If you're going to download and install XCP 1041, you'll want to read the product notes about getflashimage. I'll write a separate posting about that.

Wednesday May 30, 2007

PCI-Express supports the relaxed ordering mechanism originally defined for PCI-X. Relaxed ordering allows certain transactions to violate the strict-ordering rules of PCI; that is, a transaction may be completed prior to other transactions that were already enqueued. In PCI-Express, if the transaction layer protocol (TLP) header has the Relaxed Ordering attribute set to 1, the transaction may use relaxed ordering. In particular, a memory write transaction with the RO attribute set to 1 is allowed to complete before prior write transactions enqueued in the hostbridge ahead of it.

The Jupiter Interconnect

Relaxed Ordering is important to the Sun SPARC Enterprise M-series server I/O architecture. The SPARC Enterprise servers use a network of switches and crossbars to connect CPUs, memory access controllers (MACs), and I/O controllers (IOCs). This internal network, called the Jupiter interconnect (it is sometimes called the Jupiter bus, although it's not a "bus" at all) employs error detection and correction mechanisms, and will retry transactions if a protocol error occurs between two nodes in the network. As a result, it is possible for one transaction to "pass" another transaction; such that the agent issuing the transactions sees them complete in opposite order from the order in which they were issued.

For example, consider an M9000 system with two system boards shown in the following figure:

System Board #1
I/O
Hostbridges
x4
 
MACs
x4
 
CPUs
x4
 
SCs
x4
System Board #2
I/O
Hostbridges
x4
 
MACs
x4
 
CPUs
x4
 
SCs
x4
 
XBU (Crossbar Unit)
 

Each system board has four I/O hostbridges, four MACs (memory access controllers), and four CPU chips. The IOCs, MACs and CPU chips on a system board are interconnected by four SC chips (system controller), and the SCs connect the system boards to the crossbar unit (XBU). [If you're curious, each SC has direct connections to all four MACs, all four CPU chips, and one of the hostbridges.]

Now let's take two simple transactions: Transaction TA and Transaction TB. Transaction TA is a write from a hostbridge on system board #1 to memory on system board #2. The transaction must go from the hostbridge to the SC on system board #1 (SC#1), then to the XBU, then to the SC on system board #2 (SC#2), then to the MAC on system board #2 (MAC#2). Transaction TB is a write from the same hostbridge to memory on the same system board. This transaction must go from the hostbridge to the SC on system board #1 (SC#1), then directly to the MAC on system board #1 (MAC#1). The following scenario shows how the transactions could get reordered while in flight:

  1. Hostbridge issues transaction TA to SC#1 on the same system board.
  2. Hostbridge issues TB to SC#1.
  3. SC#1 issues TA to XBU.
  4. SC#1 issues TB to MAC#1 on same system board.
  5. XBU issues TA to SC#2 on destination system board.
  6. MAC#1 commits data to RAM, sends acknowledge back to hostbridge that TB is complete.
  7. SC#2 issues TA to MAC#2.
  8. MAC#2 commits data to RAM, sends acknowledge back to hostbridge that TA is complete.
Since the path from the hostbridge to MAC#1 is shorter (2 hops) than from the hostbridge to MAC#2 (4 hops), transaction TB completes prior to transaction TA.

In order for the hostbridge to maintain the strict PCI ordering rules, it is necessary for the hostbridge to wait until the first transaction completes before issuing the next transaction. Using the above example, if TA and TB must adhere to the PCI strict ordering rules, the scenario would look very different:

  1. Hostbridge waits for all outstanding writes to by acknowledged.
  2. Hostbridge issues TA to SC#1.
  3. SC#1 issues TA to XBU.
  4. XBU issues TA to SC#2 on destination system board.
  5. SC#2 issues TA to MAC#2.
  6. MAC#2 commits data to RAM, send acknowledge back to hostbridge that TA is complete.
  7. Hostbridge issues TB to SC#1.
  8. SC#1 issues TB to MAC#1.
  9. MAC#1 commits data to RAM, send acknowledge back to hostbridge that TB is complete.
In the above scenario, the hostbridge is unable to initiate transaction A until all outstanding write transactions are complete. Then after initiating transaction TA, the Hostbridge cannot start transaction TB until it receives confirmation that TA has completed. And until TB is acknowledged, the hostbridge is unable to initiate any other writes, whether strictly ordered or relaxed ordered. This means the Hostbridge is unable to pipeline writes, which can limit write bandwidth. The bandwidth is limited by the latency from the IOC to memory and back. On an M4000, the latency is very low so the effects of strictly ordered writes is small. However, on an M9000-64 where the Hostbridge and MAC are in different cabinets, the latency can be very large; if relaxed ordering is not enabled, the Hostbridge write-to-memory bandwidth can be significantly affected.

When to use Relaxed Ordering

In most case, relaxed ordering cannot be enabled on every transaction. Take for example a typical network interface card (NIC) architecture. The NIC might write a large number of data blocks, followed by an update of a descriptor block indicating that the data is available. When the driver sees the descriptor updated, it goes and processes the data. It doesn't matter in what order the data blocks get committed to RAM. But the descriptor must be written after all of the data is in RAM; otherwise, the driver might see the descriptor get updated, and read the partially-updated data buffer.

Therefore, the data writes can employ relaxed ordering; the descriptor must be strictly ordered so that it will not pass the data writes. Assuming the number and size of data transactions are much larger than descriptor updates, the system will see high write-to-memory performance when relaxed ordering is enabled on the data transactions.

An I/O device should only set the relaxed ordering bit in the TLP header if the device is smart enough to know which transactions can be reordered without causing data corruption. Unfortunately, we've encountered some devices which set the relaxed ordering bit incorrectly.

Enabling Relaxed Ordering in Solaris

The Sun SPARC Enterprise servers are the first SPARC servers from Sun that support relaxed ordering. When we first started testing with hardware, we found that several cards did not support relaxed ordering, or did not support it correctly. The SAS controller used in the M4000/M5000 servers, for example, does not support relaxed ordering. The Gigabit Ethernet controller, on the other hand, incorrectly set the relaxed ordering bit in the TLP header for all transactions, including control block updates. To deal with this, we had to turn off relaxed ordering in the GBE controller itself.

Even though the device hardware did not support (or did not enable) relaxed ordering, good throughput from these devices required that they allow relaxed ordering on the Jupiter Interconnect. To deal with this, Sun added a new flag, DDI_DMA_RELAXED_ORDERING, which allows a device driver to specify which DMA buffers may be relaxed ordered. We also modified the SAS and GBE drivers to tag data buffers with the DDI_DMA_RELAXED_ORDERING bit; control buffers were not tagged.

To enable relaxed ordering, a device driver must set the DDI_DMA_RELAXED_ORDERING in the dma_attr_flags in the ddi_dma_attr_t(9S) structure passed to ddi_dma_alloc_handle(9F). Per the ddi_dma_attr_t man page:


     DDI_DMA_RELAXED_ORDERING

         This optional flag can be set if  the  DMA  transactions
         associated  with this handle are not required to observe
         strong DMA write ordering among each other, nor with DMA
         write transactions of other handles.

         It allows the host bridge to transfer data to  and  from
         memory  more  efficiently  and  may result in better DMA
         performance on some platforms.
For an example of driver code which uses the DDI_DMA_RELAXED_ORDERING flag to enable relaxed ordering on data buffers, see the bge driver on OpenSolaris.org:
   1977 	/*
   1978 	 * Enable PCI relaxed ordering only for RX/TX data buffers
   1979 	 */
   1980 	if (bge_relaxed_ordering)
   1981 		dma_attr.dma_attr_flags |= DDI_DMA_RELAXED_ORDERING;

System Considerations

If you're going to deploy a system with a mix of I/O devices that support relaxed ordering (either in the TLP header, or using the DDI_DMA_RELAXED_ORDERING flag) and I/O devices that do not support relaxed ordering, you should consider the system impact.

Take, for example, the I/O architecture of a system board on a Sun SPARC Enterprise M9000 server:

Architecture for the M8000/M9000 I/O Unit
          __________
         |  IOC 0   |
         |  ______  | x8
Jupiter  | | Host |-+---------------[ PCI-E Slot 0
---------+-|Bridge| | x8
Interface| |______|-+---------------[ PCI-E Slot 1
         |  ______  | x8
Jupiter  | | Host |-+---------------[ PCI-E Slot 2
---------+-|Bridge| | x8
Interface| |______|-+---------------[ PCI-E Slot 3
         |__________|
          __________
         |  IOC 1   |
         |  ______  | x8
Jupiter  | | Host |-+---------------[ PCI-E Slot 4
---------+-|Bridge| | x8
Interface| |______|-+---------------[ PCI-E Slot 5
         |  ______  | x8
Jupiter  | | Host |-+---------------[ PCI-E Slot 6
---------+-|Bridge| | x8
Interface| |______|-+---------------[ PCI-E Slot 7
         |__________|
[Forgive the ASCII art -- I'm in Engineering, not Marketing.]

The above diagram shows that an M9000 I/O Unit has two I/O controller chips, each IOC has two hostbridges, and each hostbridge contains the root complexes for two PCI-Express slots.

The ideal situation is if all the cards enable relaxed ordering. On the other hand, let's say you have one legacy card that does not support relaxed ordering (perhaps it's a low-performance card where the vendor did not feel throughput, and therefore relaxed ordering, was important). If you put this low-performance card in, for example, slot 0 along with a high performance card that supports relaxed ordering in slot 1, both cards will share a single hostbridge and therefore a single Jupiter Interconnect interface. If the hostbridge has some strictly-ordered writes to memory from card 0, the relaxed-ordered writes from card 1 may queue up behind the strictly-ordered writes.

For comparison, here is the I/O Unit for an M4000/M5000:

Architecture for the M4000/M5000 I/O Unit
                                         ______
                                        |      |-- SAS Controller
                                        |PCI-X | 
                               ______   |Bridge|-- Gigabit Ethernet
          __________          |      |--|      |
         |   IOC    |         |PCI-E |  |______|--[ PCI-X Slot 0
         |  ______  | x8      |Switch|
Jupiter  | | Host |-+---------|      |------------[ PCI-E Slot 1
---------+-|Bridge| | x8      |______|
Interface| |______|-+-----------------------------[ PCI-E Slot 2
         |  ______  | x8
Jupiter  | | Host |-+-----------------------------[ PCI-E Slot 3
---------+-|Bridge| | x8
Interface| |______|-+-----------------------------[ PCI-E Slot 4
         |__________|

In this case, PCI-X slot 0 and PCI-E slots 1 and 2 all share a hostbridge, while PCI-E slots 3 and 4 share the other hostbridge (there is only one IOC with two hostbridges on an M4000/M5000 I/O Unit). While the hostbridge-to-memory latency is not as large on the M4000/M5000 systems, mixing cards that support relaxed ordering under the same hostbridge as cards that require strict-ordering can impact I/O throughput. Note that the SAS controller and the Gigabit Ethernet conroller already have relaxed ordering enabled using the DDI_DMA_RELAXED_ORDERING flag in their respective drivers.

To maximize write-to-memory throughput, it is best to group cards that do not enable relaxed ordering together below the same set of hostbridges, and group high-performance cards that enable relaxed ordering together below a different set of hostbridges. At the same time, you don't want to not oversubscribe the hostbridge. The hostbridge can easily handle a single x8 PCI-Express link writing at its top bandwidth of about 1.7 GB/s; however, two high-performance x8 cards could be limited by the hostbridge's Jupiter interface bandwidth of 2.1GB/s. Of course, the best arrangement of I/O cards may depend on other factors as well; relaxed ordering is just one thing to keep in mind when building a system.

Tuesday May 22, 2007

The Sun SPARC Enterprise M-series servers introduce several new configuration options compared to the Sun Fire 6900/25K family. In my posting eXtended System Boards I explained XSBs -- how a single physical system board (SB) can be partitioned and configured into domains at the granularity of a CPU. The Sun SPARC Enterprise servers also support a concept called Logical System Boards, or LSBs. LSBs add a new dimension of configuration.

Physical System Boards and the Sun Fire 6800

In the past (using the Sun Fire 6800 as an example, since I happen to have one handy), an SB could be configured into a domain, and the resources on that SB were identified to Solaris based on the board number; similarly, if you knew the resource id, you could infer the physical system board it is on. For example, if psrinfo in Solaris showed

    % psrinfo
    0       on-line   since 03/01/2007 12:16:43
    1       on-line   since 03/01/2007 12:16:44
    2       on-line   since 03/01/2007 12:16:44
    3       on-line   since 03/01/2007 12:16:44
    8       on-line   since 03/01/2007 12:16:44
    9       on-line   since 03/01/2007 12:16:44
    10      on-line   since 03/01/2007 12:16:44
    11      on-line   since 03/01/2007 12:16:44
you could infer that your domain consisted of system boards 0 and 2 (the CPU IDs on an SB start at the SB number times 4, so SB0 contains CPU IDs 0 through 3, while SB2 contains CPU IDs 8 through 11). The PCI hostbridge bus addresses are assigned in a similar fashion. For example:
    % ls -1d /devices/ssm@0,0/pci@*000
    /devices/ssm@0,0/pci@18,600000
    /devices/ssm@0,0/pci@18,700000
    /devices/ssm@0,0/pci@19,600000
    /devices/ssm@0,0/pci@19,700000
    /devices/ssm@0,0/pci@1c,600000
    /devices/ssm@0,0/pci@1c,700000
    /devices/ssm@0,0/pci@1d,600000
    /devices/ssm@0,0/pci@1d,700000
shows the hostbridges on I/O boards 6 and 8. (The math here is a bit more complex. I/O board numbers start at 6 with bus address 0x18, and each I/O board has two host bridges, so IB6 has pci@18 and pci@19, while IB8 has pci@1c and pci@1d.)

If a system board experienced a fault and needed to be replaced, or worse, the system board slot was at fault so you could not simply replace the system board, you could reconfigure the system from the System Controller to add CPUs, memory or IO from a different system board to restore the domain to full power. You could, for example, configure SB0 out of the domain, and configure SB1 into the domain. At that point, the domain would be running with CPU IDs 4 through 11 (4 through 7 on SB1, and 8 through 11 on SB2). Similarly, you could replace IB6 with IB7, and the PCI hostbridges would change from pci@18 and pci@19 to pci@1a and pci@1b.

That's all fine, unless your boot device was hanging off IB6. Even if you moved the boot device to IB7, the device paths would all be different. The boot device that was "/devices/ssm@0,0/pci@18,700000/pci@1/SUNW,isptwo@4/sd@0,0:a" would change to "/devices/ssm@0,0/pci@1a,700000/pci@1/SUNW,isptwo@4/sd@0,0:a".

In effect, the CPU IDs and hostbridge bus addresses are physical addresses -- they are calculated based on the physical location of the board.

Logical System Boards and the Sun SPARC Enterprise Servers

The Sun SPARC Enterprise Servers introduce a new concept called Logical System Boards, or LSBs. The LSB number defines the way the CPUs and I/O on an extended system board (or XSB) are identified by a domain.

When an XSB is assigned to a domain, it is given a logical system board number. In effect, the LSB number is a virtual address. And as the Sun Fire 6800 assigns CPU IDs based on physical system board number, the Sun SPARC Enterprise assigns CPU IDs based on logical system board number. The same is true for hostbridge bus addresses. The CPUs on LSB 0 are assigned CPU IDs from the range 0 through 31, and the CPUs on LSB 1 are assigned CPU IDs from the range 32 through 63, regardless of the physical system board hosting the CPU chips. Similarly, the first hostbridge on LSB 0 is pci@0,600000, while the first hostbridge on LSB 1 is pci@10,600000, and so forth.

The mapping from LSB-to-XSB is user configurable; you can choose the LSB number for any XSB almost entirely at will, and LSB numbers can be re-used for different XSBs in different domains. As a result, it is possible that every domain in a chassis could have a CPU with cpuid 0. And every domain could have its boot device below /devices/pci@0,600000. You could have a domain that includes SB 0, 1 and 2 assigned as LSBs 0, 1 and 2, but for some reason it is necessary to replace SB 0. You could then assign SB 3 to the domain as LSB 0. The domain would continue to have cpuid 0 and /devices/pci@0,600000. If you move your boot device over to SB 3's I/O unit (either move the PCI-Express card, or simply move the internal SAS disk), you could boot the domain, and device paths and processor sets would remain unaffected.

If we use the analogy of virtual memory, the domain is the context, the LSB is the virtual address, and the SB (or more specifically, the XSB) is the physical address.

Configuring LSBs

With the added flexibility of XSBs and LSBs, the process of configuring a domain requires some extra steps. The Sun SPARC Enterprise M4000/M5000/M8000/M9000 Servers Administration Guide, Chapter 4 explains the process. The following is an example using a Sun SPARC Enterprise M5000, configured into two domains, each with one system board mapped to LSB 0.

setupfru

The first step is to configure the system boards as either uni-XSB or quad-XSB mode using the setupfru command:
    XSCF> setupfru -x 1 sb 0
    XSCF> setupfru -x 1 sb 1
The above example places SB 0 and SB 1 in uni-XSB mode, so all of the resources on a system board are assigned to domains in a single configuration unit. At this point, SB 0 is referred to as the single XSB 00-0; SB 1 is referred to as the single XSB 01-0.

setdcl

The next step is to establish the mapping from LSB to XSB. Just for illustration, I chose the following mapping:
  • Domain 0:
    • LSB 0 => XSB 00-0 (system board 0)
    • LSB 1 => XSB 01-0 (system board 1)
  • Domain 1:
    • LSB 15 => XSB 00-0 (system board 0)
    • LSB 0 => XSB 01-0 (system board 1)
Note that both domains have an LSB 0 (and they refer to different physical system boards). Also note that the domains have different mappings for the system boards. To define the mapping from XSB-to-LSB you use setdcl, which stands for "set domain component list". Here are the commands to set up the domains:
    # For domain 0, map LSB 0 to XSB 00-0 and LSB 1 to XSB 01-0
    XSCF> setdcl -d 0 -a 0=00-0 1=01-0
    # For domain 1, map LSB 15 to XSB 00-0 and LSB 0 to XSB 01-0
    XSCF> setdcl -d 1 -a 15=00-0 0=01-0
The fact that there's an LSB-to-XSB mapping for a system board for a domain does not mean that the XSB is assigned to the domain. It only means, once the XSB is assigned to the domain, this is the LSB number it will get.

addboard

So obviously the next step is to assign real XSBs to domains. We only have two SBs (and they're in uni-XSB mode, so we only have two XSBs), and two domains, so give each domain one XSB:
    XSCF> # Assign XSB 00-0 to domain 0
    XSCF> addboard -c assign -d 0 00-0
    XSB#00-0 will be assigned to DomainID 0. Continue?[y|n] :y
    XSCF> # Assign XSB 01-0 to domain 1
    XSCF> addboard -c assign -d 1 01-0
    XSB#01-0 will be assigned to DomainID 1. Continue?[y|n] :y
Once an XSB has been assigned to a domain, that domain owns the XSB; the XSB cannot be assigned to more than one domain. For example, if I tried to give XSB 00-0 to domain 1 after it has been assigned to domain 0:
    # Try to assign XSB 00-0 to domain 1 also
    XSCF> addboard -c assign -d 1 00-0
    XSB#00-0 is already assigned to another domain.

We can use showboards to see what we've done:

    XSCF> showboards -a
    XSB  DID(LSB) Assignment  Pwr  Conn Conf Test    Fault
    ---- -------- ----------- ---- ---- ---- ------- --------
    00-0 00(00)   Assigned    n    n    n    Unknown Normal
    01-0 01(00)   Assigned    n    n    n    Unknown Normal
The above shows that XSB 00-0 is assigned to domain 0 as LSB 0, and XSB 01-0 is assigned to domain 0, also as LSB 0.

Proof

Domain 0 should power on with SB 0 as LSB 0, and should have cpuids starting at 0 and hostbridges starting at pci@0,600000. Domain 1 should power on with SB 1 as LSB 0, also with cpuids starting at 0 and hostbridges starting at pci@0,600000. Just to prove that this worked as expected, I powered on the two domains. If I connect to the console for each domain, I get:
    XSCF> console -yq -d 0

    {0} ok show-disks
    a) /pci@0,600000/pci@0/pci@8/pci@0/scsi@1/disk
    q) NO SELECTION
    Enter Selection, q to quit: q
    {0} ok
    {0} ok exit from console.

    XSCF> console -yq -d 1

    {0} ok show-disks
    a) /pci@0,600000/pci@0/pci@8/pci@0/scsi@1/disk
    q) NO SELECTION
    Enter Selection, q to quit: q
    {0} ok
The {0} shows that both domains have a cpuid 0. And the show-disks shows that both domains have a hostbridge pci@0,600000; in fact, both domains have the exact same device path to completely different SCSI controllers. QED.

Thursday May 17, 2007

Rupert Brauch, Staff Engineer, SPARC Code Generator for the Sun Studio Compilers (and extreme cyclist), in his blog Getting Peak Olympus Performance Using the Studio 12 Compilers writes about the new Sun Studio 12 Compilers optimizations for the SPARC64-VI CPU architecture. Here's a reprint of his blog in case you missed it:

    Getting Peak Olympus Performance Using the Studio 12 Compilers

    The new Sun Studio 12 compilers have been optimized to produce the best performance for SPARC64-VI binaries. It is possible to achieve gains of 30% or more over binaries compiled with the Studio 11 compilers.

    We recommend using the following options when compiling for SPARC64-VI:

    -xchip=sparc64vi
    Generate code that is tuned for SPARC64-VI. The binary will run on any SPARC processor, but will perform best on a SPARC64-VI system.

    -xarch=sparcfmaf -fma=fused
    Both of these options are necessary to enable the usage of the new fused multiply add instructions. The binary will only run on SPARC64-VI, and any future SPARC systems that support the fused multiply add instructions. These instructions improve the performance of some floating point programs.

    -xtarget=sparc64vi
    A combination of -xchip=sparc64vi and -xarch=sparcfmaf. It is still necessary to use -fma=fused to enable fused multiply add instructions.


Thanks Rupert!

Wednesday May 16, 2007

The Sun SPARC Enterprise M-series servers feature a new approach to Solaris/Service Processor communication.

Shared memory has long been a common method for inter-processor communication. Back in the 1980's I worked on embedded systems that used shared RAM to allow a host processor to communicate with digital signal processors (DSPs). The shared RAM was partitioned into mailboxes, with a pair of mailboxes per "application". The host processor would place a command in the incoming mailbox for a given DSP application and signal an interrupt to that DSP. The DSP interrupt service routine would check the mailboxes, find the new command, give it to the application for processing, and place the response in the outgoing mailbox, then interrupt the host processor.

Sun Fire 6800/6900: Shared RAM Mailbox

Years later, when I came to work at Sun I found the same approach was used to allow the embedded service processor (aka, system controller) to send commands to Solaris and get responses back. For example, the Sun Fire 6800/6900 family of servers used this approach. Each domain (a group of UltraSPARC CPUs running a single instance of Solaris) had a separate shared memory with the service processor (SP). The interface between the SP and one Solaris domain looked something like this:

SP
 

Domain(s)
 
Application(s) Application(s)
|
ioctls
|
|
ioctls
|
Mailbox
Driver
  Mailbox
Driver
| |
Hardware
Shared RAM
Interrupts

While the architecture is efficient and effective, one weakness is that is doesn't scale with applications. As new applications are introduced (for example, Dynamic Reconfiguration or Fault Management), a new mailbox needs to be carved out, and a new protocol invented.

Sun Fire 15K/25K: Internal Ethernet

The Sun Fire 15000/25000 servers improved on the basic mailbox approach by running a separate Ethernet connection from the SP to each and every domain (actually, to every expander). The mailbox was used for low-level operation (running POST, booting to OpenBoot, simulating a serial console, etc). But the Ethernet connection, called the Maintenance Area Network, was used for application-level communication. This allowed applications such as Dynamic Reconfiguration to be developed using standard APIs such as sockets and multiplexed using TCP/IP, which helped ease their development. When a new application was rolled out, we didn't need to carve-out a new mailbox; we could just use a different TCP port. The one down side is the added cost and complexity of having a separate Ethernet subnet within the Sun Fire 15K chassis.

Sun SPARC Enterprise Approach

Sun SPARC Enterprise M-series decided the application scalability of the Maintenance Area Network was a benefit, but we wanted to achieve it without having to run a separate Ethernet network. The result is what we called DSCP -- the Domain to Service Processor Communication Protocol. DSCP provides IP communication between the SP and the domain without any new hardware; it uses a single mailbox in the shared RAM, and using a pseudo serial driver on top of that mailbox, we enable PPP (the Point-to-Point Protocol). The DSCP stack looks like this:

SP
 

Domain(s)
 
Application(s) Application(s)
| |
Sockets API   Sockets API
| |
ipv4 ipv4
| |
ppp ppp
| |
tty Driver tty Driver
(dm2s)
| |
Mailbox
Driver
  Mailbox
Driver
(scfd)
| |
Hardware
Shared RAM
Interrupts

Configuring DSCP

The Sun SPARC Enterprise Server Administration Guide explains how to set up DSCP, but it is really quite simple. The easiest method is using the syntax:
    setdscp -i NETWORK -m NETMASK
Choose a network address (be sure to pick a subnet that is not in use at your facility) and the corresponding netmask, and setdscp will do the rest. For example, in my lab the subnet 192.168.244.0 is unused, so I do:
    XSCF> setdscp -i 192.168.224.0 -m 255.255.255.0
There are other ways to set up the DSCP network addresses, but this is really the best approach.

setdscp will assign an IP address to the SP, and reserve one IP address for every possible domain (the M9000-64 supports 24 domains, so a maximum of 25 IP addresses are reserved). A common question that's asked is, if you're running PPP between the SP and each domain, don't you need to two addresses for each domain, one for the domain and one for the SP? No, not really. Since routing is done based on the destination address, we can get away with using the same IP address for the SP on every PPP link. So technically speaking, the NETWORK and NETMASK are not defining a DSCP subnet; they are defining a range of IP addresses from which DSCP selects endpoint addresses. A subtle difference, but still a difference.

On the SP, showdscp will display the IP addresses assigned to each domain and the SP, for example:

    XSCF> showdscp

    DSCP Configuration:

    Network: 192.168.224.0
    Netmask: 255.255.255.0

     Location     Address
    ----------   ---------
    XSCF         192.168.224.1
    Domain #00   192.168.224.2
    Domain #01   192.168.224.3
    Domain #02   192.168.224.4
    Domain #03   192.168.224.5
In Solaris, the prtdscp(1M) command will display the IP address of that domain and the SP (prtdscp is located in /usr/platform/SUNW,SPARC-Enterprise/sbin). You can get the same basic information from ifconfig sppp0:
    % /usr/platform/SUNW,SPARC-Enterprise/sbin/prtdscp
    Domain Address: 192.168.224.2
    SP Address: 192.168.224.1

    % ifconfig sppp0
    sppp0: flags=10010008d1 mtu 1500 index 3
            inet 192.168.224.2 --> 192.168.224.1 netmask ffffff00

Benefits

Plumbing IP between the Solaris domain and the SP brings the obvious benefit of standards-based communication -- networking applications "just work". For example, you can configure the SP as an NTP server and configure the Solaris domains to use NTP to synchronize their time with the SP, all using the internal DSCP network. You can even use ssh to connect to the SP from a Solaris domain using the DSCP network. Since the SP does not have a hostname on the DSCP network, you need to get the IP address using prtdscp, for example
    ssh `/usr/platform/SUNW,SPARC-Enterprise/sbin/prtdscp -s`
Personally, I create an alias sshsp with the above line.

On the SP side, you can't use ssh or scp directly -- they're not available in the XSCF shell. But you can use them indirectly. You can configure log archiving (see the setarchiving man page) to use one of the domains as an archive host:

    XSCF> setarchiving -t rjh@192.168.224.2:/home/rjh/archive
    XSCF> setarchiving enable
[I'm not sure it makes sense to use a domain as a log archive host -- a catastrophic failure with the system means you also lose your log archive host -- but it is technically possible.]

And when you need to take a snapshot of the system for diagnosis purposes (see snapshot man page), you can specify one of the domains as the snapshot host using the -t option, for example:

    XSCF> snapshot -l -t rjh@192.168.224.2:/home/rjh/snap
    Downloading Public Key from '192.168.224.2'...
    Public Key Fingerprint: 44:9a:ad:55:2e:33:99:2e:fd:b7:47:74:de:ad:be:ef
    Accept this public key (yes/no)? yes
    Enter ssh password for user 'rjh' on host '192.168.224.2':
    Setting up ssh connection to rjh@192.168.224.2...
    Collecting data into rjh@192.168.224.2:/home/rjh/snap/mymachine_10.4.55.144_2007-05-07T19-39-40.zip
    Data collection complete
If your domain has internet access or a DVD burner, this might be the easiest way to get a snapshot back to a Sun Service Engineer.

Security

One of the most important security goals with the DSCP design was: Ensure that if one Solaris domain is compromised, that an attacker would not be able to affect the SP or another domain in the same chassis. This primary security requirement drove most of the DSCP design approach.

Using PPP provides an added security benefit. Each shared RAM mailbox represents a single PPP connection between Solaris and the SP. This means there is no opportunity for one domain to snoop the traffic between another domain and the SP, and no way for one domain to directly attach another domain using the DSCP network. There is also no routing between DSCP networks (or from DSCP to Ethernet or vice versa) on the SP. The communication paths of each domain are physically isolated.

Most of the protocols used on the Sun SPARC Enterprise servers place the client on the SP and the server on the domain. This means that the SP does not need to open up well-known ports for incoming connections, reducing the opportunity for attacks. Furthermore, the severs running in Solaris use IPsec to authenticate that incoming connections are coming from the SP.

To prevent the domain from attacking the SP, several methods are used. First, all of the authentication and authorization protocols employed for Ethernet users are in place for the DSCP networks. There is no DSCP "back door", so to speak. Further, the SP employs a firewall that blocks all the ports on the DSCP networks except a couple -- ssh and ntp. There are additional features in place, for example, bandwidth limiting to prevent denial-of-service attacks.

Summary

The Domain to Service Processor Communication Protocol enables IP-based communication between the Solaris domain and the SP, in a secure fashion, which enables standards-compliant applications such as ssh and ntp to "just work" between the SP and Solaris domains.

Monday May 14, 2007


Like the Sun Fire midrange (6800/6900) and high-end (15K/25K) servers, the Sun SPARC Enterprise M-series servers allow you to organize system boards (SBs) into hardware domains (called "Dynamic Systems Domains" by marketing). Hardware domains contain CPUs, memory and I/O which are isolated from each other; one hardware domain may be powered on or off regardless of the other hardware domains. Like Sun Fire, SPARC Enterprise system boards consist of four CPU chip sockets, 32 DIMM sockets, and I/O.

The Sun SPARC Enterprise midrange and high-end servers, however, take system boards and hardware domains one step further. Physical systems boards can be partitioned into four eXtended system boards (XSBs).

The Sun SPARC Enterprise M4000 can have up to 4 CPU chips and is organized as a single system baord. The M5000 can have up to eight CPU chips and is organized as two system boards. The M8000 and M9000 "system board" consists of a CPU/Memory Unit (CMU) plus an I/O Unit (IOU) which together form a system board. The M8000 can have up to four SBs, while the M9000-64 can have up to 16. When all of the resources on a system board are assigned to domains as a single group, the system board is said to be in "Uni-XSB" mode. The following table shows the CPU, memory and I/O resources on the Sun SPARC Enterprise system boards in Uni-XSB mode:

M4000 in Uni-XSB Mode
SB CPUs Memory I/O
00 CPU#0
CPU#1
CPU#2
CPU#3
32 DIMMs 2 SAS Disks
DVD/DAT
2 GBE Ports
PCI-X Slot#0
PCI-E Slot#1
PCI-E Slot#2
PCI-E Slot#3
PCI-E Slot#4
M5000 in Uni-XSB Mode
SB CPUs Memory I/O
00 CPU#0
CPU#1
CPU#2
CPU#3
32 DIMMs 2 SAS Disks
DVD/DAT
2 GBE Ports
PCI-X Slot#0
PCI-E Slot#1
PCI-E Slot#2
PCI-E Slot#3
PCI-E Slot#4
01 CPU#0
CPU#1
CPU#2
CPU#3
32 DIMMs 2 SAS Disks
2 GBE Ports
PCI-X Slot#0
PCI-E Slot#1
PCI-E Slot#2
PCI-E Slot#3
PCI-E Slot#4
M8000/M9000 SB in Uni-XSB Mode
SB CPUs Memory I/O
00
to
15
CPU#0
CPU#1
CPU#2
CPU#3
32 DIMMs PCI-E Slot#1
PCI-E Slot#2
PCI-E Slot#3
PCI-E Slot#4
PCI-E Slot#5
PCI-E Slot#6
PCI-E Slot#7
PCI-E Slot#8

Normally on a Sun Fire system you would only be able to create as many domains as you have system boards. However, with the Sun SPARC Enterprise servers, you can configure each system board into four XSBs (quad-XSB mode). This allows you to create domains as small as a single CPU, 8 DIMMs, and I/O. To make it easier to map XSBs back to the physical SB, the number used for XSBs is xx-y where xx is the physical system board, and y is the XSB on that system board. For example, 01-2 would refer to the XSB containing CPU#2 on physical system board #1. The next table shows how the various resources are partitioned among the four XSBs per SB.

M4000 in Quad-XSB Mode
SB XSB CPUs Memory I/O
00 00-0 CPU#0 8 DIMMs 2 SAS Disks
DVD/DAT
2 GBE Ports
PCI-X Slot#0
PCI-E Slot#1
PCI-E Slot#2
00-1 CPU#1 8 DIMMs PCI-E Slot#3
PCI-E Slot#4
00-2 CPU#2 8 DIMMs No I/O
00-3 CPU#3 8 DIMMs No I/O
M5000 in Quad-XSB Mode
SB XSB CPUs Memory I/O
00 00-0 CPU#0 8 DIMMs 2 SAS Disks
DVD/DAT
2 GBE Ports
PCI-X Slot#0
PCI-E Slot#1
PCI-E Slot#2
00-1 CPU#1 8 DIMMs PCI-E Slot#3
PCI-E Slot#4
00-2 CPU#2 8 DIMMs No I/O
00-3 CPU#3 8 DIMMs No I/O
01 01-0 CPU#0 8 DIMMs 2 SAS Disks
2 GBE Ports
PCI-X Slot#0
PCI-E Slot#1
PCI-E Slot#2
01-1 CPU#1 8 DIMMs PCI-E Slot#3
PCI-E Slot#4
01-2 CPU#2 8 DIMMs No I/O
01-3 CPU#3 8 DIMMs No I/O
M8000/M9000 in Quad-XSB Mode
SB XSB CPUs Memory I/O
00
to
15
XX-0 CPU#0 8 DIMMs PCI-E Slot#1
PCI-E Slot#2
XX-1 CPU#1 8 DIMMs PCI-E Slot#3
PCI-E Slot#4
XX-2 CPU#2 8 DIMMs PCI-E Slot#5
PCI-E Slot#6
XX-3 CPU#3 8 DIMMs PCI-E Slot#7
PCI-E Slot#8

Note in the above table that on M4000 and M5000 servers, XSB 0 gets the internal disks, DVD, Gigabit Ethernet, PCI-X slot and two PCI-Express slots. XSB 1 gets two PCI-Express slots. XSBs 2 and 3 have no I/O. This is a physical limitation -- the M4000 and M5000 I/O units only have two PCI-Express hostbridges. So, while in theory you could create eight domains with a single CPU each, in reality a domain needs I/O so you can only create four hardware domains in an M5000; two domains in an M4000.

The M8000 and M9000 system boards, on the other hand, have symmetric XSBs -- each system board has four CPUs, 32 DIMMs, four PCI-Express hostbridges and 8 PCI-Express slots. When they're placed in quad-XSB mode, each XSB has one CPU, 8 DIMMs, one PCI-Express hostbridge and two PCI-Express slots. So a SPARC Enterprise M8000 with 4 system boards can effectively be split into 16 domains.

For example, with an M5000, you could place system board 00 in quad-XSB mode, and system board 01 in uni-XSB mode. Then you can create one domain with XSBs 00-0 and 00-1 (call this the "green" domain), and a second domain with 00-2, 00-3 and all of 01 (call this the "blue" domain). Here's what that would look like:

 
Example M5000 With Two Domains
SB XSB CPUs Memory I/O
00 00-0 CPU#0 8 DIMMs 2 SAS Disks
DVD/DAT
2 GBE Ports
PCI-X Slot#0
PCI-E Slot#1
PCI-E Slot#2
00-1 CPU#1 8 DIMMs PCI-E Slot#3
PCI-E Slot#4
00-2 CPU#2 8 DIMMs No I/O
00-3 CPU#3 8 DIMMs No I/O
01
CPU#0
CPU#1
CPU#2
CPU#3
32 DIMMs 2 SAS Disks
2 GBE Ports
PCI-X Slot#0
PCI-E Slot#1
PCI-E Slot#2
PCI-E Slot#3
PCI-E Slot#4

The green domain could have 2 CPUs, 16 DIMMs, and lots of I/O, while the blue domain could have 6 CPUs, 48 DIMMs, and lots of I/O.

There are some down-sides to using quad-XSB mode. The primary issue is availability in the face of hardware failures. On an M4000 or M5000 there are two SC chips (officially, these are called "system controller" ASICs; however, due to potential confusion with the Sun Fire System Controllers, I like to just call them SC chips); the M8000/M9000 system board has four SC chips. The SC chips connect the CPUs, memory and I/O on the system board, and connect the system board to the system crossbar (or in the case of the M5000, the SCs on one system board connect directly to the SCs on the other system board). The SC chips are shared by all XSBs on a system baord. If a system board is in uni-XSB mode and there's a fault internal to an SC chip, the system board (and the domain using that system board) may take a fatal error and be reset. If a system board is in quad-XSB mode, an SC fault may require the entire system board to be reset, which would reset all domains using XSBs on that system board.

Using the M5000 example above, if the system experienced a fatal error in CPU#0, only the green domain would be reset. However, if one of the SC chips on system board 00 experiences a fault, then all XSBs on system board 00 are affected; both the blue and the green domains would be reset as a result.

On the other hand, XSBs do offer a great deal of flexibility. With an M4000 which only has one system board, you can create two domains, something you could never do with a Sun Fire 6900/25K with only one system board. On larger systems, you have the flexibility of configuring domains down to the CPU level, rather than at a system board level. If the impact of losing two domains due to a hardware failure is acceptable, then quad-XSB mode offers unprecedented flexibility and configurability.