The XSCF man facility does include a standard Intro(8) topic, which provides a complete list of commands and a short synopsis. For example:
XSCF> man intro
System Administration Intro(8)
NAME
Intro - eXtended System Control Facility (XSCF) man pages
DESCRIPTION
This manual contains XSCF man pages.
LIST OF COMMANDS
The following commands are supported:
Intro, intro eXtended System Control Facility
(XSCF) man pages
addboard configure an eXtended System
Board(XSB) into the domain confi-
guration or assigns it to the domain
configuration
addcodlicense add a Capacity on Demand (COD)
right-to-use (RTU) license key to
the COD license database
addfru add a Field Replaceable Unit (FRU)
adduser create an XSCF user account
...
On the other hand, you may sort of know the command you want to run, but aren't sure of the specifics. For example, to set up the network I sometimes forget if the command is 'setnet' or 'setnetwork'. Like a standard bash shell, the tab can be used to complete the command. for example, 'setnet<TAB>' will expand to 'setnetwork'.
Also, like the bash shell, you can use the double-tab to display a list of possible completions. This comes in handy when you know you need a 'set' command, but you don't recall specifically which command; you can do 'set<TAB><TAB>' and get a complete list of all 'set*' commands. For example (using 'setn<TAB><TAB>' which is a little less verbose):
XSCF> setn<TAB><TAB>
setnameserver setnetwork setntp
XSCF> setn
Finally, all of the XSCF commands consistently implemented the -h option to display the command's synopsis. This isn't the full man page, just the synopsis. For example:
That's usually enough to help you figure out what arguments you need to provide. And when it isn't, you always have man to provide all the details.
XSCF> setnetwork -h
usage: setnetwork [-m addr] interface address
setnetwork -c {up | down} interface
setnetwork -h
The purpose of this tool is:
| Due to proprietary agreements with Fujitsu, the power calculator has been removed from this blog |
To make service a bit easier, each physical disk bay in the Sun SPARC-Enterprise M-class line of servers has two LEDs: one for power, the other for fault indications. We used the fault LED in blinking mode to act as an indicator so the system administrator can identify and locate a specific disk.
The disk LEDs are accessed using cfgadm(1M). The cfgadm PCI plugin always supported LED manipulation, using a -x option. So we expanded the cfgadm SCSI plugin to use the same syntax. We wanted to add the ability to directly control the disk LEDs, so we added "-x led=LED[,mode=MODE]" (where LED could be power, fault, active or attn, just like PCI, and mode can be on, off or blink).
But we also felt there was a general problem being solved here, that of a "locator" indication. In the case of the M-class servers, the "locator" indication happens to be blinking the fault LED, but in the future, other platforms may have separate locator LEDs, or use some other method (for example, a separate locator LED, an annoying wable sound like a car alarm coming from the disk, or an LCD display on the from of the server that draws a hand pointing to the disk you want to remove). Our solution should not pre-suppose that the fault LED, or any LED for that matter, must be used as the locator.
So another -x option was added to the cfgadm SCSI plugin: "-x locator[={on|off}]". This allows a user to turn on or turn off the locator indication, regardless of the underlying implementation.
Here's a snippet from the new cfgadm_scsi(1M) man page showing the two new -x options:
-x hardware_function Some of the following commands can
only be used with SCSI controllers
and some only with SCSI devices.
In the following, controller_ap_id
refers to an ap_id for a SCSI con-
troller, for example, c0.
device_ap_id refers to an ap_id for
a SCSI device, for example:
c0::dsk/c0dt3d0.
The following hardware specific
functions are defined:
locator [=on|off] device_ap_id
Sets or gets the hard disk loca-
tor LED, if it is provided by
the platform. If the [on|off]
suboption is not set, the state
of the hard disk locator is
printed.
led[=LED,mode=on|off|blink]
device_ap_id
If no sub-arguments are set,
this function print a list of
the current LED settings. If
sub-arguments are set, this
function sets the mode of a
specific LED for a slot.
|
To give you an idea of how this works, here's some annotated output from a Solaris session:
# cfgadm -x led=fault c0::dsk/c0t0d0 Disk Led c0t0d0 fault=offTurn on the locator indication for disk c0t0d0:
# cfgadm -x locator=on c0::dsk/c0t0d0 # cfgadm -x locator c0::dsk/c0t0d0 Disk Led c0t0d0 locator=onCheck the current state of the fault LED; it is now blinking:
# cfgadm -x led=fault c0::dsk/c0t0d0 Disk Led c0t0d0 fault=blinkTurn off the fault LED:
# cfgadm -x led=fault c0::dsk/c0t0d0 Disk Led c0t0d0 fault=off # cfgadm -x led=fault,mode=off c0::dsk/c0t0d0And the locator indication is also off:
# cfgadm -x locator c0::dsk/c0t0d0 Disk Led c0t0d0 locator=off
The intention is certainly that the system administrator would use '-x locator' to locate and replace internal SAS disks. But once SCSI FMA is supported in Solaris, we'll have the ability to set the fault LED on disks, and administrators can use '-x led' to view the state of the fault LED.
Normally, PCI card AP IDs (attachment point IDs) are labeled based on the physical location of the card -- its I/O Unit (IOU) and slot. For example, on Sun SPARC Enterprise M-class servers, the AP ID "iou#0-pci#1" is the slot PCI#1 on I/O Unit IOU#0.
With IO Box, however, there is no fixed physical location for the IO Box slots. The IO Box does connect to a host slot (like iou#0-pci#1), so one could label the AP IDs something like "iou#0-pci#1:iob.pci3" to show that it's PCI slot 3 in an IO Box attached to IOU#0-PCI#1.
On the other hand, this introduces issues when the IO Box is physically remote from the server -- it might not be obvious where this box conencts to the host. We don't want customers tracing cables, and a simple mistake could cause you to power off or power on the wrong slot. Something better, something more reliably was needed.
So we augmented the AP ID to include the serial id of the IO Box boat. With this approach, someone can look at an IO Box, write down the serial id and slot they wanted to power off, then go back to Solaris and power off that slot based on serial id. Similarly, if you powered off a slot and wanted to go remove the card, you can write down the serial id and slot, then go find the IO Box boat with matching serial id, and have confidence that you're removing the right card.
The resulting AP ID is a combination of physical location of the host slot, and serial id of the IO Box boat. An example of the AP ID format looks like this "iou#0-pci#4:iobE00E7.pcie1", which slows PCIE slot 1 in the IO Box boat with serial id ending in "E00E7". From the AP ID, it also is clear that the IO Box boat is connected to the host using a link card in host slot IOU#0-PCI#4.
I should also note that the IO Box boat serial id is prominently featured on handle, in plain view. There's no need to remove the boat to get to the product nameplate.
Here's some sample output from 'cfgadm -a' showing just the PCI slots:
# cfgadm -a Ap_Id Type Receptacle Occupant Condition ... iou#0-pci#0 unknown empty unconfigured unknown iou#0-pci#1 unknown empty unconfigured unknown iou#0-pci#2 unknown disconnected unconfigured unknown iou#0-pci#3 pci-pci/hp connected configured ok iou#0-pci#3:iobX00FC.pci1 unknown empty unconfigured unknown iou#0-pci#3:iobX00FC.pci2 fibre/hp connected configured ok iou#0-pci#3:iobX00FC.pci3 scsi/hp connected configured ok iou#0-pci#3:iobX00FC.pci4 unknown empty unconfigured unknown iou#0-pci#3:iobX00FC.pci5 fibre/hp connected configured ok iou#0-pci#3:iobX00FC.pci6 unknown empty unconfigured unknown iou#0-pci#4 pci-pci/hp connected configured ok iou#0-pci#4:iobE00E7.pcie1 unknown empty unconfigured unknown iou#0-pci#4:iobE00E7.pcie2 etherne/hp connected configured ok iou#0-pci#4:iobE00E7.pcie3 etherne/hp connected configured ok iou#0-pci#4:iobE00E7.pcie4 pci-pci/hp connected configured ok iou#0-pci#4:iobE00E7.pcie5 pci-pci/hp connected configured ok iou#0-pci#4:iobE00E7.pcie6 unknown empty unconfigured unknownIn the above output, IOU#0-PCI#0 through IOU#0-PCI#4 are the host slots; IOU#0-PCI#4 is connected to a PCI-X IO Box boat, while IOU#0-PCI#4 is connected to a PCI-Express IO Box boat.
We knew customers would not want an unmanaged, "black box" for external I/O. The system administrator needs access to the status of the IO Box. They also need to know when the IO Box fails, and why it is failing (and how that failure is affecting the rest of the system). And being able to light "locator" indicators to find one of hundreds of IO Boxes located remotely from the host server was critical for service. Providing a managed IO Box, with fault and status information available using standard protocols, and LEDs controlled by the host, is essential to provide a highly available system.
On the other hand, we didn't want the IO Box to be another system that the customer had to manage. The IO Box does not have a service processor; it doesn't have its own MIB; it doesn't need software upgrades; it doesn't have an ethernet port.
The IO Box is just a bunch of I/O slots available to the host; it shouldn't matter if they are located in the same chassis as, or several meters (or several dozen meters) away from, the CPUs and memory.
The IO Box is fully managed, as a part of the host, reporting status and faults back to the host to which it is connected. The host SP includes IO Box status as part of the overall system status. The IO Box does not require any special cables between the host and the box, other than the PCI-Express cables. Separate cabling would introduce the risk that a customer could cross-wire a box to the wrong host. And we wanted to make the wiring as simple as possible.
This is accomplished by using extra signals in the PCI-Express cable to implement a management connection between the host service processor (SP) and the IO Box. The standard PCI-Express connector has two pins for SMBus: SMCLK (B5) and SMDAT (B6). Using those two pins, the host (SP) is able to talk to a "slave" microcontroller on the link card. The microcontroller then uses spare signals in the cable to communicate in a reliable fashion with the microcontroller on the link card in the IO Box to forward requests from the host SP. The link card in the IO Box then uses SMBus on those same two pins B5 and B6, this time as "master", to access devices in the IO Box, reading or writing any device that the host SP requests. In effect, the I2C devices in the IO Box become "local" to the host service processor, even though the IO Box itself is remote. The microcontrollers on the link cards act as proxies.
With this arrangement, the host SP can retrieve environmental information about the IO Box: temperatures, fan speeds, voltages, currents, switch positions, etc. In addition, the host SP can control the IO Box, turning off power to a power supply unit so it can be removed, lighting the "locator" indicator so the IO Box can be found in the datacenter, etc. And when the IO Box experiences an error, the host SP can gather error information, and factor it into other host-detected errors to diagnose the fault.
IO Box management is entirely optional -- the IO Box as a standalone unit can function with no host management. But management by the host SP provides an added dimension of availability and serviceability which is not found on low-end I/O expansion units.
The IO Box addresses a critical problem with previous generations of enterprise servers: The I/O-to-CPU ration was too low for some customer applications. On a Sun Fire 25K fully populated with I/O boards and CPU boards, there are 72 PCI slots, which is plenty for most customers. But with 72 CPU sockets filled with dual-core UltraSPARC IV CPUs, that yields a PCI-slot to CPU core ration of 1/2 -- one PCI slot for ever two CPU cores. The SPARC Enterprise M9000-64 has 128 PCI-Express slots, and while that may seem like a lot, with 64 dual-core SPARC64-VI CPU chips, that's still just 1 PCI-Express slot per core (and it will get worse when we pack more cores into each CPU chip). Some customers really care about I/O, and a higher I/O-to-CPU ratio is important.
The IO Box allows you to connect one PCI-Express slot in the host to an IO Boat, which has six additional, hot-plug-capable PCI-Express or PCI-X slots. Each IO Box can support up to two IO Boats (either PCI-Express or PCI-X), independently connected to the same host. The host-to-box link can either be copper (low cost, but short and bulky, so only really applicable if the server and the IO Box are in the same cabinet) or fibre optic (higher cost, but with 25m cable lengths you can locate the IO Boxes together in a separate cabinet from your servers).
There are other IO expansion units on the market, but the Sun version is really designed for the enterprise-class environment, with features like:
While IO Box is initially supported on the M-class servers, it's history predates the Sun/Fujitsu APL agreements, and will almost certainly be supported on other Sun products in the future. I've got some neat things I'd like to share about IO Box in future posts...
For I/O, the easiest method happens to be the fmtopo command. fmtopo will list all of the I/O devices, the device path, the FRU (Field Replaceable Unit) and FRU Label. Here's a snippet of the output showing the device node /devices/pci@90,600000:
The PCI-Express root complex pci@90,600000 belongs to the logical system board (LSB) 9 (you can tell from the "90" after the @ sign). But the output of fmtopo shows that the device is actually on the physical FRU iou#0, which is part of SB#0.
# /usr/lib/fm/fmd/fmtopo -p
...
hc:///chassis=0/ioboard=0/hostbridge=0/pciexrc=0
ASRU: dev:////pci@90,600000
FRU: hc:///component=iou#0
Label: iou#0
...
Now, knowing that LSB#9 is really SB#0, one can infer that the cpuids associated with LSB#9 (i.e., cpuids 9*32 through 10*32-1, or 288-319) are also on SB#0.
So, how does fmtopo figure out how to map LSBs to SBs? Turns out that there is one memory controller per LSB, and the memory controller node has two properties of interest, board# and physical-board#. The board# is the LSB number, while the physical-board# is the SB number. Other nodes in the device tree have the board# property (I/O hostbridges, CPUs, etc), but only the memory controller node has the physical-board# property.
To see what I mean, you can use prtconf, for example:
If you look at the full output of prtconf, you'll see that the first two lines belong to the memory controller node (pseudo-mc) with an LSB board# of 9 and an SB physical-board# of 0. The other board# properties belong to the CPUs and I/O host bridges.
# /usr/sbin/prtconf -pv | grep "board#"
physical-board#: 00000000
board#: 00000009
board#: 00000009
board#: 00000009
board#: 00000009
board#: 00000009
board#: 00000009
board#: 00000009
board#: 00000009
board#: 00000009
board#: 00000009
XCP 1040 (with which most systems have shipped) has another, undocumented command, called getflashimage(8M). XCP 1041 is now available, and upgrading using getflashimage and flashupdate is easy -- it takes just one minute to download the image with getflashimage, and 15 minutes to install the new version with flashupdate.
Since getflashimage didn't make the XSCF User's Manual, I thought I'd give you a brief overview.
getflashimage allows you to log on to the XSCF and download an XCP image to the XSCF. getflashimage works similar wget: you provide a URL and getflashimage downloads the file. getflashimage supports http, https and ftp protocols, and will even allow you to download the XCP image from a USB flash device (which is useful if your XSCF does not connect to the network where the XCP images are).
The synopsis for getflashimage is:
I think most of the options in the first synopsis are pretty straight forward (including the standard '-v' for "verbose", '-q' for "quiet", "-{y|n}" for "yes" or "no"). For example, I can download a flash image from an https server that requires user authentication by doing:
getflashimage [-v] [-q -{y|n}] [-u username] [-p proxy [-t proxy_type]] URL
getflashimage -l
getflashimage [-q -{y|n}] -d
getflashimage -h
where "rjh" is my webserver user name. getflashimage will prompt for my password, and in about a minute, the image will be downloaded and ready for flashupdate(8M). If you're having problems accessing the server (and the error messages aren't sufficiently clear), you can use the -v option to view the protocol exchange with the server, and see the exact error codes returned by the server (if you find yourself having to use "-v", please let me know what the problem was, and I'll try to improve the error messages). And of course the -p and -t can be used to access the web through a proxy, where -p is the proxy name or IP address, and -t can be used to specify the proxy type (http, socks4 or socks5; the default proxy type is http).
XSCF> getflashaimge -u rjh https://imageserver/images/FFXCP1041.tar.gz
In hindsight, it probably would have made sense to do some integrity checking on the image file, maybe verifying its checksum during the download process. Maybe I'll add that to the next release.
The second synopsis line, 'getflashimage -l' (small L), allows you to list the image file that was downloaded, just in case you forgot whether you finished downloading. It will also display the image file size and download date.
The third synopsis line, 'getflashimage -d', lets you delete any and all image files previously downloaded. If you've done the download and flashupdate, you can delete the image at any time, but it doesn't hurt to leave the file around (the space is reserved and can't be used for anything else). On the other hand, if you downloaded an image file and decided you did not want to flashupdate it (perhaps you downloaded the wrong version), you might want to delete it immediately so you don't accidentally flashupdate it later.
getflashimage can be a real lifesaver if you have a slow connection to the XSCF. For example, when I connect from my home (in the Boston area) to Sun's lab in San Diego, using the Brower User Interface can be slow because I have to ftp the XCP image from San Diego to my home workstation, then upload it using Firefox back to the XSCF in San Diego -- a 6,000 mile journey. But with getflashimage, I can ssh to the XSCF, then use getflashimage to very quickly load an XCP image from a file server to the XSCF on the same subnet, without the bits needing to travel cross-country to me on the East Coast. I also tend to prefer command lines over GUIs.
If you're going to download and install XCP 1041, you'll want to read the product notes about getflashimage. I'll write a separate posting about that.
For example, consider an M9000 system with two system boards shown in the following figure:
System Board #1
|
System Board #2
|
|||||||||||||||||||||
|
||||||||||||||||||||||
Each system board has four I/O hostbridges, four MACs (memory access controllers), and four CPU chips. The IOCs, MACs and CPU chips on a system board are interconnected by four SC chips (system controller), and the SCs connect the system boards to the crossbar unit (XBU). [If you're curious, each SC has direct connections to all four MACs, all four CPU chips, and one of the hostbridges.]
Now let's take two simple transactions: Transaction TA and Transaction TB. Transaction TA is a write from a hostbridge on system board #1 to memory on system board #2. The transaction must go from the hostbridge to the SC on system board #1 (SC#1), then to the XBU, then to the SC on system board #2 (SC#2), then to the MAC on system board #2 (MAC#2). Transaction TB is a write from the same hostbridge to memory on the same system board. This transaction must go from the hostbridge to the SC on system board #1 (SC#1), then directly to the MAC on system board #1 (MAC#1). The following scenario shows how the transactions could get reordered while in flight:
In order for the hostbridge to maintain the strict PCI ordering rules, it is necessary for the hostbridge to wait until the first transaction completes before issuing the next transaction. Using the above example, if TA and TB must adhere to the PCI strict ordering rules, the scenario would look very different:
Therefore, the data writes can employ relaxed ordering; the descriptor must be strictly ordered so that it will not pass the data writes. Assuming the number and size of data transactions are much larger than descriptor updates, the system will see high write-to-memory performance when relaxed ordering is enabled on the data transactions.
An I/O device should only set the relaxed ordering bit in the TLP header if the device is smart enough to know which transactions can be reordered without causing data corruption. Unfortunately, we've encountered some devices which set the relaxed ordering bit incorrectly.
Even though the device hardware did not support (or did not enable) relaxed ordering, good throughput from these devices required that they allow relaxed ordering on the Jupiter Interconnect. To deal with this, Sun added a new flag, DDI_DMA_RELAXED_ORDERING, which allows a device driver to specify which DMA buffers may be relaxed ordered. We also modified the SAS and GBE drivers to tag data buffers with the DDI_DMA_RELAXED_ORDERING bit; control buffers were not tagged.
To enable relaxed ordering, a device driver must set the DDI_DMA_RELAXED_ORDERING in the dma_attr_flags in the ddi_dma_attr_t(9S) structure passed to ddi_dma_alloc_handle(9F). Per the ddi_dma_attr_t man page:
For an example of driver code which uses the DDI_DMA_RELAXED_ORDERING flag to enable relaxed ordering on data buffers, see the bge driver on OpenSolaris.org:
DDI_DMA_RELAXED_ORDERING
This optional flag can be set if the DMA transactions
associated with this handle are not required to observe
strong DMA write ordering among each other, nor with DMA
write transactions of other handles.
It allows the host bridge to transfer data to and from
memory more efficiently and may result in better DMA
performance on some platforms.
1977 /*
1978 * Enable PCI relaxed ordering only for RX/TX data buffers
1979 */
1980 if (bge_relaxed_ordering)
1981 dma_attr.dma_attr_flags |= DDI_DMA_RELAXED_ORDERING;
Take, for example, the I/O architecture of a system board on a Sun SPARC Enterprise M9000 server:
|
|
|
The above diagram shows that an M9000 I/O Unit has two I/O controller chips, each IOC has two hostbridges, and each hostbridge contains the root complexes for two PCI-Express slots.
The ideal situation is if all the cards enable relaxed ordering. On the other hand, let's say you have one legacy card that does not support relaxed ordering (perhaps it's a low-performance card where the vendor did not feel throughput, and therefore relaxed ordering, was important). If you put this low-performance card in, for example, slot 0 along with a high performance card that supports relaxed ordering in slot 1, both cards will share a single hostbridge and therefore a single Jupiter Interconnect interface. If the hostbridge has some strictly-ordered writes to memory from card 0, the relaxed-ordered writes from card 1 may queue up behind the strictly-ordered writes.
For comparison, here is the I/O Unit for an M4000/M5000:
|
|
|
In this case, PCI-X slot 0 and PCI-E slots 1 and 2 all share a hostbridge, while PCI-E slots 3 and 4 share the other hostbridge (there is only one IOC with two hostbridges on an M4000/M5000 I/O Unit). While the hostbridge-to-memory latency is not as large on the M4000/M5000 systems, mixing cards that support relaxed ordering under the same hostbridge as cards that require strict-ordering can impact I/O throughput. Note that the SAS controller and the Gigabit Ethernet conroller already have relaxed ordering enabled using the DDI_DMA_RELAXED_ORDERING flag in their respective drivers.
To maximize write-to-memory throughput, it is best to group cards that do not enable relaxed ordering together below the same set of hostbridges, and group high-performance cards that enable relaxed ordering together below a different set of hostbridges. At the same time, you don't want to not oversubscribe the hostbridge. The hostbridge can easily handle a single x8 PCI-Express link writing at its top bandwidth of about 1.7 GB/s; however, two high-performance x8 cards could be limited by the hostbridge's Jupiter interface bandwidth of 2.1GB/s. Of course, the best arrangement of I/O cards may depend on other factors as well; relaxed ordering is just one thing to keep in mind when building a system.
In the past (using the Sun Fire 6800 as an example, since I happen to have one handy), an SB could be configured into a domain, and the resources on that SB were identified to Solaris based on the board number; similarly, if you knew the resource id, you could infer the physical system board it is on. For example, if psrinfo in Solaris showedyou could infer that your domain consisted of system boards 0 and 2 (the CPU IDs on an SB start at the SB number times 4, so SB0 contains CPU IDs 0 through 3, while SB2 contains CPU IDs 8 through 11). The PCI hostbridge bus addresses are assigned in a similar fashion. For example:
% psrinfo
0 on-line since 03/01/2007 12:16:43
1 on-line since 03/01/2007 12:16:44
2 on-line since 03/01/2007 12:16:44
3 on-line since 03/01/2007 12:16:44
8 on-line since 03/01/2007 12:16:44
9 on-line since 03/01/2007 12:16:44
10 on-line since 03/01/2007 12:16:44
11 on-line since 03/01/2007 12:16:44
shows the hostbridges on I/O boards 6 and 8. (The math here is a bit more complex. I/O board numbers start at 6 with bus address 0x18, and each I/O board has two host bridges, so IB6 has pci@18 and pci@19, while IB8 has pci@1c and pci@1d.)
% ls -1d /devices/ssm@0,0/pci@*000
/devices/ssm@0,0/pci@18,600000
/devices/ssm@0,0/pci@18,700000
/devices/ssm@0,0/pci@19,600000
/devices/ssm@0,0/pci@19,700000
/devices/ssm@0,0/pci@1c,600000
/devices/ssm@0,0/pci@1c,700000
/devices/ssm@0,0/pci@1d,600000
/devices/ssm@0,0/pci@1d,700000
If a system board experienced a fault and needed to be replaced, or worse, the system board slot was at fault so you could not simply replace the system board, you could reconfigure the system from the System Controller to add CPUs, memory or IO from a different system board to restore the domain to full power. You could, for example, configure SB0 out of the domain, and configure SB1 into the domain. At that point, the domain would be running with CPU IDs 4 through 11 (4 through 7 on SB1, and 8 through 11 on SB2). Similarly, you could replace IB6 with IB7, and the PCI hostbridges would change from pci@18 and pci@19 to pci@1a and pci@1b.
That's all fine, unless your boot device was hanging off IB6. Even if you moved the boot device to IB7, the device paths would all be different. The boot device that was "/devices/ssm@0,0/pci@18,700000/pci@1/SUNW,isptwo@4/sd@0,0:a" would change to "/devices/ssm@0,0/pci@1a,700000/pci@1/SUNW,isptwo@4/sd@0,0:a".
In effect, the CPU IDs and hostbridge bus addresses are physical addresses -- they are calculated based on the physical location of the board.
When an XSB is assigned to a domain, it is given a logical system board number. In effect, the LSB number is a virtual address. And as the Sun Fire 6800 assigns CPU IDs based on physical system board number, the Sun SPARC Enterprise assigns CPU IDs based on logical system board number. The same is true for hostbridge bus addresses. The CPUs on LSB 0 are assigned CPU IDs from the range 0 through 31, and the CPUs on LSB 1 are assigned CPU IDs from the range 32 through 63, regardless of the physical system board hosting the CPU chips. Similarly, the first hostbridge on LSB 0 is pci@0,600000, while the first hostbridge on LSB 1 is pci@10,600000, and so forth.
The mapping from LSB-to-XSB is user configurable; you can choose the LSB number for any XSB almost entirely at will, and LSB numbers can be re-used for different XSBs in different domains. As a result, it is possible that every domain in a chassis could have a CPU with cpuid 0. And every domain could have its boot device below /devices/pci@0,600000. You could have a domain that includes SB 0, 1 and 2 assigned as LSBs 0, 1 and 2, but for some reason it is necessary to replace SB 0. You could then assign SB 3 to the domain as LSB 0. The domain would continue to have cpuid 0 and /devices/pci@0,600000. If you move your boot device over to SB 3's I/O unit (either move the PCI-Express card, or simply move the internal SAS disk), you could boot the domain, and device paths and processor sets would remain unaffected.
If we use the analogy of virtual memory, the domain is the context, the LSB is the virtual address, and the SB (or more specifically, the XSB) is the physical address.
setupfru command:
XSCF> setupfru -x 1 sb 0
XSCF> setupfru -x 1 sb 1
The above example places SB 0 and SB 1 in uni-XSB mode, so all of the resources on a system board are assigned to domains in a single configuration unit. At this point, SB 0 is referred to as the single XSB 00-0; SB 1 is referred to as the single XSB 01-0.
setdcl, which stands for "set domain component list". Here are the commands to set up the domains:
# For domain 0, map LSB 0 to XSB 00-0 and LSB 1 to XSB 01-0
XSCF> setdcl -d 0 -a 0=00-0 1=01-0
# For domain 1, map LSB 15 to XSB 00-0 and LSB 0 to XSB 01-0
XSCF> setdcl -d 1 -a 15=00-0 0=01-0
The fact that there's an LSB-to-XSB mapping for a system board for a domain does not mean that the XSB is assigned to the domain. It only means, once the XSB is assigned to the domain, this is the LSB number it will get.
XSCF> # Assign XSB 00-0 to domain 0
XSCF> addboard -c assign -d 0 00-0
XSB#00-0 will be assigned to DomainID 0. Continue?[y|n] :y
XSCF> # Assign XSB 01-0 to domain 1
XSCF> addboard -c assign -d 1 01-0
XSB#01-0 will be assigned to DomainID 1. Continue?[y|n] :y
Once an XSB has been assigned to a domain, that domain owns the XSB; the XSB cannot be assigned to more than one domain. For example, if I tried to give XSB 00-0 to domain 1 after it has been assigned to domain 0:
# Try to assign XSB 00-0 to domain 1 also
XSCF> addboard -c assign -d 1 00-0
XSB#00-0 is already assigned to another domain.
We can use showboards to see what we've done:
The above shows that XSB 00-0 is assigned to domain 0 as LSB 0, and XSB 01-0 is assigned to domain 0, also as LSB 0.
XSCF> showboards -a
XSB DID(LSB) Assignment Pwr Conn Conf Test Fault
---- -------- ----------- ---- ---- ---- ------- --------
00-0 00(00) Assigned n n n Unknown Normal
01-0 01(00) Assigned n n n Unknown Normal
XSCF> console -yq -d 0
{0} ok show-disks
a) /pci@0,600000/pci@0/pci@8/pci@0/scsi@1/disk
q) NO SELECTION
Enter Selection, q to quit: q
{0} ok
{0} ok exit from console.
XSCF> console -yq -d 1
{0} ok show-disks
a) /pci@0,600000/pci@0/pci@8/pci@0/scsi@1/disk
q) NO SELECTION
Enter Selection, q to quit: q
{0} ok
The {0} shows that both domains have a cpuid 0. And the show-disks shows that both domains have a hostbridge pci@0,600000; in fact, both domains have the exact same device path to completely different SCSI controllers. QED.
The new Sun Studio 12 compilers have been optimized to produce the best performance for SPARC64-VI binaries. It is possible to achieve gains of 30% or more over binaries compiled with the Studio 11 compilers.
We recommend using the following options when compiling for SPARC64-VI:
-xchip=sparc64vi
Generate code that is tuned for SPARC64-VI. The binary will run on any SPARC processor, but will perform best on a SPARC64-VI system.
-xarch=sparcfmaf -fma=fused
Both of these options are necessary to enable the usage of the new fused multiply add instructions. The binary will only run on SPARC64-VI, and any future SPARC systems that support the fused multiply add instructions. These instructions improve the performance of some floating point programs.
-xtarget=sparc64vi
A combination of -xchip=sparc64vi and -xarch=sparcfmaf. It is still necessary to use -fma=fused to enable fused multiply add instructions.
Shared memory has long been a common method for inter-processor communication. Back in the 1980's I worked on embedded systems that used shared RAM to allow a host processor to communicate with digital signal processors (DSPs). The shared RAM was partitioned into mailboxes, with a pair of mailboxes per "application". The host processor would place a command in the incoming mailbox for a given DSP application and signal an interrupt to that DSP. The DSP interrupt service routine would check the mailboxes, find the new command, give it to the application for processing, and place the response in the outgoing mailbox, then interrupt the host processor.
| SP   |
Domain(s)   |
|
| Application(s) | Application(s) | |
| | ioctls | |
| ioctls | |
|
| Mailbox Driver |
  | Mailbox Driver |
| | | | | |
| Hardware Shared RAM Interrupts |
||
While the architecture is efficient and effective, one weakness is that is doesn't scale with applications. As new applications are introduced (for example, Dynamic Reconfiguration or Fault Management), a new mailbox needs to be carved out, and a new protocol invented.
| SP   |
Domain(s)   |
|
| Application(s) | Application(s) | |
| | | | | |
| Sockets API |   | Sockets API |
| | | | | |
| ipv4 | ipv4 |
|
| | | | | |
| ppp | ppp | |
| | | | | |
| tty Driver | tty Driver (dm2s) |
|
| | | | | |
| Mailbox Driver |
  | Mailbox Driver (scfd) |
| | | | | |
| Hardware Shared RAM Interrupts |
||
setdscp -i NETWORK -m NETMASK
Choose a network address (be sure to pick a subnet that is not in use at your facility) and the corresponding netmask, and setdscp will do the rest. For example, in my lab the subnet 192.168.244.0 is unused, so I do:
XSCF> setdscp -i 192.168.224.0 -m 255.255.255.0
There are other ways to set up the DSCP network addresses, but this is really the best approach.
setdscp will assign an IP address to the SP, and reserve one IP address for every possible domain (the M9000-64 supports 24 domains, so a maximum of 25 IP addresses are reserved). A common question that's asked is, if you're running PPP between the SP and each domain, don't you need to two addresses for each domain, one for the domain and one for the SP? No, not really. Since routing is done based on the destination address, we can get away with using the same IP address for the SP on every PPP link. So technically speaking, the NETWORK and NETMASK are not defining a DSCP subnet; they are defining a range of IP addresses from which DSCP selects endpoint addresses. A subtle difference, but still a difference.
On the SP, showdscp will display the IP addresses assigned to each domain and the SP, for example:
XSCF> showdscp
DSCP Configuration:
Network: 192.168.224.0
Netmask: 255.255.255.0
Location Address
---------- ---------
XSCF 192.168.224.1
Domain #00 192.168.224.2
Domain #01 192.168.224.3
Domain #02 192.168.224.4
Domain #03 192.168.224.5
In Solaris, the prtdscp(1M) command will display the IP address of that domain and the SP (prtdscp is located in
/usr/platform/SUNW,SPARC-Enterprise/sbin). You can get the same basic information
from ifconfig sppp0:
% /usr/platform/SUNW,SPARC-Enterprise/sbin/prtdscp
Domain Address: 192.168.224.2
SP Address: 192.168.224.1
% ifconfig sppp0
sppp0: flags=10010008d1
prtdscp, for example
ssh `/usr/platform/SUNW,SPARC-Enterprise/sbin/prtdscp -s`
Personally, I create an alias sshsp with the above line.
On the SP side, you can't use ssh or scp directly -- they're not available in the XSCF shell. But you can use them indirectly. You can configure log archiving (see the
setarchiving man page) to use one of the domains as an archive host:
XSCF> setarchiving -t rjh@192.168.224.2:/home/rjh/archive
XSCF> setarchiving enable
[I'm not sure it makes sense to use a domain as a log archive host -- a catastrophic failure with the system means you also lose your log archive host -- but it is technically possible.]
And when you need to take a snapshot of the system for diagnosis purposes (see snapshot man page), you can specify one of the domains as the snapshot host using the -t option, for example:
XSCF> snapshot -l -t rjh@192.168.224.2:/home/rjh/snap
Downloading Public Key from '192.168.224.2'...
Public Key Fingerprint: 44:9a:ad:55:2e:33:99:2e:fd:b7:47:74:de:ad:be:ef
Accept this public key (yes/no)? yes
Enter ssh password for user 'rjh' on host '192.168.224.2':
Setting up ssh connection to rjh@192.168.224.2...
Collecting data into rjh@192.168.224.2:/home/rjh/snap/mymachine_10.4.55.144_2007-05-07T19-39-40.zip
Data collection complete
If your domain has internet access or a DVD burner, this might be the easiest way to get a snapshot back to a Sun Service Engineer.
Using PPP provides an added security benefit. Each shared RAM mailbox represents a single PPP connection between Solaris and the SP. This means there is no opportunity for one domain to snoop the traffic between another domain and the SP, and no way for one domain to directly attach another domain using the DSCP network. There is also no routing between DSCP networks (or from DSCP to Ethernet or vice versa) on the SP. The communication paths of each domain are physically isolated.
Most of the protocols used on the Sun SPARC Enterprise servers place the client on the SP and the server on the domain. This means that the SP does not need to open up well-known ports for incoming connections, reducing the opportunity for attacks. Furthermore, the severs running in Solaris use IPsec to authenticate that incoming connections are coming from the SP.
To prevent the domain from attacking the SP, several methods are used. First, all of the authentication and authorization protocols employed for Ethernet users are in place for the DSCP networks. There is no DSCP "back door", so to speak. Further, the SP employs a firewall that blocks all the ports on the DSCP networks except a couple -- ssh and ntp. There are additional features in place, for example, bandwidth limiting to prevent denial-of-service attacks.
The Sun SPARC Enterprise midrange and high-end servers, however, take system boards and hardware domains one step further. Physical systems boards can be partitioned into four eXtended system boards (XSBs).
The Sun SPARC Enterprise M4000 can have up to 4 CPU chips and is organized as a single system baord. The M5000 can have up to eight CPU chips and is organized as two system boards. The M8000 and M9000 "system board" consists of a CPU/Memory Unit (CMU) plus an I/O Unit (IOU) which together form a system board. The M8000 can have up to four SBs, while the M9000-64 can have up to 16. When all of the resources on a system board are assigned to domains as a single group, the system board is said to be in "Uni-XSB" mode. The following table shows the CPU, memory and I/O resources on the Sun SPARC Enterprise system boards in Uni-XSB mode:
|
|
| ||||||||||||||||||||||||||||||||||||||||
Normally on a Sun Fire system you would only be able to create as many domains as you have system boards. However, with the Sun SPARC Enterprise servers, you can configure each system board into four XSBs (quad-XSB mode). This allows you to create domains as small as a single CPU, 8 DIMMs, and I/O. To make it easier to map XSBs back to the physical SB, the number used for XSBs is xx-y where xx is the physical system board, and y is the XSB on that system board. For example, 01-2 would refer to the XSB containing CPU#2 on physical system board #1. The next table shows how the various resources are partitioned among the four XSBs per SB.
|
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Note in the above table that on M4000 and M5000 servers, XSB 0 gets the internal disks, DVD, Gigabit Ethernet, PCI-X slot and two PCI-Express slots. XSB 1 gets two PCI-Express slots. XSBs 2 and 3 have no I/O. This is a physical limitation -- the M4000 and M5000 I/O units only have two PCI-Express hostbridges. So, while in theory you could create eight domains with a single CPU each, in reality a domain needs I/O so you can only create four hardware domains in an M5000; two domains in an M4000.
The M8000 and M9000 system boards, on the other hand, have symmetric XSBs -- each system board has four CPUs, 32 DIMMs, four PCI-Express hostbridges and 8 PCI-Express slots. When they're placed in quad-XSB mode, each XSB has one CPU, 8 DIMMs, one PCI-Express hostbridge and two PCI-Express slots. So a SPARC Enterprise M8000 with 4 system boards can effectively be split into 16 domains.
For example, with an M5000, you could place system board 00 in quad-XSB mode, and system board 01 in uni-XSB mode. Then you can create one domain with XSBs 00-0 and 00-1 (call this the "green" domain), and a second domain with 00-2, 00-3 and all of 01 (call this the "blue" domain). Here's what that would look like:
|
|
||||
| SB | XSB | CPUs | Memory | I/O |
| 00 | 00-0 | CPU#0 | 8 DIMMs | 2 SAS Disks DVD/DAT 2 GBE Ports PCI-X Slot#0 PCI-E Slot#1 PCI-E Slot#2 |
| 00-1 | CPU#1 | 8 DIMMs | PCI-E Slot#3 PCI-E Slot#4 |
|
| 00-2 | CPU#2 | 8 DIMMs | No I/O | |
| 00-3 | CPU#3 | 8 DIMMs | No I/O | |
| 01 | CPU#0 CPU#1 CPU#2 CPU#3 |
32 DIMMs | 2 SAS Disks 2 GBE Ports PCI-X Slot#0 PCI-E Slot#1 PCI-E Slot#2 PCI-E Slot#3 PCI-E Slot#4 |
|
The green domain could have 2 CPUs, 16 DIMMs, and lots of I/O, while the blue domain could have 6 CPUs, 48 DIMMs, and lots of I/O.
There are some down-sides to using quad-XSB mode. The primary issue is availability in the face of hardware failures. On an M4000 or M5000 there are two SC chips (officially, these are called "system controller" ASICs; however, due to potential confusion with the Sun Fire System Controllers, I like to just call them SC chips); the M8000/M9000 system board has four SC chips. The SC chips connect the CPUs, memory and I/O on the system board, and connect the system board to the system crossbar (or in the case of the M5000, the SCs on one system board connect directly to the SCs on the other system board). The SC chips are shared by all XSBs on a system baord. If a system board is in uni-XSB mode and there's a fault internal to an SC chip, the system board (and the domain using that system board) may take a fatal error and be reset. If a system board is in quad-XSB mode, an SC fault may require the entire system board to be reset, which would reset all domains using XSBs on that system board.
Using the M5000 example above, if the system experienced a fatal error in CPU#0, only the green domain would be reset. However, if one of the SC chips on system board 00 experiences a fault, then all XSBs on system board 00 are affected; both the blue and the green domains would be reset as a result.On the other hand, XSBs do offer a great deal of flexibility. With an M4000 which only has one system board, you can create two domains, something you could never do with a Sun Fire 6900/25K with only one system board. On larger systems, you have the flexibility of configuring domains down to the CPU level, rather than at a system board level. If the impact of losing two domains due to a hardware failure is acceptable, then quad-XSB mode offers unprecedented flexibility and configurability.