Reflections on OS integration Eric Schrock's Weblog
Musings about Fishworks, Operating Systems, and the software that runs on them.

Tuesday Jun 02, 2009

Last week, we announced the Sun Storage 7310 system. At the same time, a less significant but still notable change was made to the Sun Storage product line. The Sun Storage 7210 system is now expandable via J4500 JBODs. These JBODs have the same form factor as the 7210 - 48 drives in a top loading 4u form factor. Up to two J4500s can be added to a 7210 system via a single HBA, resulting in up to 142 TB of storage in 12 RU of space. JBODs with 500G or 1T drives are supported.

For customers looking for maximum density without high availability, the combination of 7210 with J4500 provides a perfect solution.

Thursday Nov 20, 2008

In the past, I've discussed the evolution of disk FMA. Much has been accomplished in the past year, but there are still several gaps when it comes to ZFS and disk faults. In Solaris today, a fault diagnosed by ZFS (a device failing to open, too many I/O errors, etc) is reported as a pool name and 64-bit vdev GUID. This description leaves something to be desired, referring the user to run zpool status to determine exactly what went wrong. But the user is still has to know how to go from a cXtYdZ Solaris device name to a physical device, and when they do locate the physical device they need to manually issue a zpool replace command to initiate the replacement.

While this is annoying in the Solaris world, it's completely untenable in an appliance environment, where everything needs to "just work". With that in mind, I set about to plug the last few holes in the unified plan:

  • ZFS faults must be associated with a physical disk, including the human-readable label
  • A disk fault (ZFS or SMART failure) must turn on the associated fault LED
  • Removing a disk (faulted or otherwise) and replacing it with a new disk must automatically trigger a replacement

While these seem like straightforward tasks, as usual they are quite difficult to get right in a truly generic fashion. And for an appliance, there can be no Solaris commands or additional steps for the user. To start with, I needed to push the FRU information (expressed as a FMRI in the hc libtopo scheme) into the kernel (and onto the ZFS disk label) where it would be available with each vdev. While it is possible to do this correlation after the fact, it simplifies the diagnosis engine and is required for automatic device replacement. There are some edge conditions around moving and swapping disks, but was relatively straightforward. Once the FMRI was there, I could include the FRU in the fault suspect list, and using Mike's enhancements to libfmd_msg, dynamically insert the FRU label into the fault message. Traditional FMA libtopo labels do not include the chassis label, so in the Fishworks stack we go one step further and re-write the label on receipt of a fault event with the user-defined chassis name as well as the physical slot. This message is then used when posting alerts and on the problems page. We can also link to the physical device from the problems page, and highlight the faulty disk in the hardware view.

With the FMA plumbing now straightened out, I needed a way to light the fault LED for a disk, regardless of whether it was in the system chassis or an external enclosure. Thanks to Rob's sensor work, libtopo already presents a FMRI-centric view of indicators in a platform agnostic manner. So I rewrote the disk-monitor module (or really, deleted everything and created a new fru-monitor module) that would both poll for FRU hotplug events, as well as manage the fault LEDs for components. When a fault is generated, the FRU monitor looks through the suspect list, and turns on the fault LED for any component that has a supported indicator. This is then turned off when the corresponding repair event is generated. This also had the side benefit of generating hotplug events phrased in terms of physical devices, which the appliance kit can use to easily present informative messages to the user.

Finally, I needed to get disk replacement to work like everyone expects it to: remove a faulted disk, put in a new one, and walk away. The genesis of this functionality was putback to ON long ago as the autoreplace pool property. In Solaris, this functionality only works with disks that have static device paths (namely SATA). In the world of multipathed SAS devices, the device path is really a scshi_vhci node identified by the device WWN. If we remove a disk and insert a new one, it will appear as a new device with no way to correlate it to the previous instance, preventing us from replacing the correct vdev. What we need is physical slot information, which happens to be provided by the FMRI we are already storing with the vdev for FMA purposes. When we receive a sysevent for a new device addition, we look at the latest libtopo snapshot and take the FMRI of the newly inserted device. By looking at the current vdev FRU information, we can then associate this with the vdev that was previously in the slot, and automatically trigger the replacement.

This process took a lot longer than I would have hoped, and has many more subtleties too boring even for a technical blog entry, but it is nice to sit back and see a user experience that is intuitive, informative, and straightforward - the hallmarks of an integrated appliance solution.

Wednesday Nov 12, 2008

Since our initial product was going to be a NAS appliance, we knew early on that storage configuration would be a critical part of the initial Fishworks experience. Thanks to the power of ZFS storage pools, we have the ability to present a radically simplified interface, where the storage "just works" and the administrator doesn't need to worry about choosing RAID stripe widths or statically provisioning volumes. The first decision was to create a single storage pool (or really one per head in a cluster)1, which means that the administrator only needs to make this decision once, and doesn't have to worry about it every time they create a filesystem or LUN.

Within a storage pool, we didn't want the user to be in charge of making decisions about RAID stripe widths, hot spares, or allocation of devices. This was primarily to avoid this complexity, but also represents the fact that we (as designers of the system) know more about its characteristics than you. RAID stripe width affects performance in ways that are not immediately obvious. Allowing for JBOD failure requires careful selection of stripe widths. Allocation of devices can take into account environmental factors (balancing HBAs, fan groups, backplance distribution) that are unknown to the user. To make this easy for the user, we pick several different profiles that define parameters that are then applied to the current configuration to figure out how the ZFS pool should be laid out.

Before selecting a profile, we ask the user to verify the storage that they want to configure. On a standalone system, this is just a check to make sure nothing is broken. If there is a broken or missing disk, we don't let you proceed without explicit confirmation. The reason we do this is that once the storage pool is configured, there is no way to add those disks to the pool without changing the RAS and performance characteristics you specified during configuration. On a 7410 with multiple JBODs, this verification step is slightly more complicated, as we allow adding of whole or half JBODs. This step is where you can choose to allocate half or all of the JBOD to a pool, allowing you to split storage in a cluster or reserve unused storage for future clustering options.

Fundamentally, the choice of redundancy is a business decision. There is a set of tradeoffs that express your tolerance of risk and relative cost. As Jarod told us very early on in the project: "fast, cheap, or reliable - pick two." We took this to heart, and our profiles are displayed in a table with qualitative ratings on performance, capacity, and availability. To further help make a decision, we provide a human-readable description of the layout, as well as a pie chart showing the way raw storage will be used (data, parity, spares, or reserved). The last profile parameter is called "NSPF," for "no single point of failure." If you are on a 7410 with multiple JBODs, some profiles can be applied across JBODs such that the loss of any one JBOD cannot cause data loss2. This often forces arbitrary stripe widths (with 6 JBODs your only choice is 10+2) and can result in less capacity, but with superior RAS characteristics.

This configuration takes just two quick steps, and for the common case (where all the hardware is working and the user wants double parity RAID), it just requires clicking on the "DONE" button twice. We also support adding additional storage (on the 7410), as well as unconfiguring and importing storage. I'll leave a complete description of the storage configuration screen for a future entry.


[1] A common question we get is "why allow only one storage pool?" The actual implementation clearly allows it (as in the failed over active-active cluster), so it's purely an issue of complexity. There is never a reason to create multiple pools that share the same redundancy profile - this provides no additional value at the cost of significant complexity. We do acknowledge that mirroring and RAID-Z provide different performance characteristics, but we hope that with the ability to turn on and off readzilla and (eventually) logzilla usage on a per-share basis, this will be less of an issue. In the future, you may see support for multiple pools, but only in a limited fashion (i.e. enforcing different redundancy profiles).

[2] It's worth noting that all supported configurations of the 7410 have multiple paths to all JBODs across multiple HBAs. So even without NSPF, we have the ability to survive HBA, cable, and JBOD controller failure.

Tuesday Nov 11, 2008

With any product, there is always some talk from the enthusiasts about how they could do it faster, cheaper, or simpler. Inevitably, there's a little bit of truth to both sides. Enthusiasts have been doing homebrew NAS for as long as free software has been around, but it takes far more work to put together a complete, polished solution that stands up under the stress of an enterprise environment.

One of the amusing things I like to do is to look back at the total amount of source code we wrote. Lines of source code by itself is obviously not a measure of complexity - it's possible to write complex software with very few lines of source, or simple software that's over engineered - but it's an interesting measure nonetheless. Below is the current output of a little script I wrote to count lines of code1 in our fish-gate. This does not include the approximately 40,000 lines of change made to the ON (core Solaris) gate, most of which we'll be putting back gradually over the next few months.

C (libak)                 185386        # The core of the appliance kit
C (lib)                    12550        # Other libraries
C (fcc)                    11167        # A compiler adapted from dtrace
C (cmd)                    12856        # Miscellaneous utilities
C (uts)                     4320        # clustron driver
-----------------------   ------
Total C                   226279

JavaScript (web)           69329        # Web UI
JavaScript (shell)         24227        # CLI shell
JavaScript (common)         9354        # Shared javascript
JavaScript (crazyolait)     2714        # Web transport layer (adapted from jsolait)
JavaScript (tst)           40991        # Automated test code
-----------------------   ------
Total Javascript          146615

Shell (lib)                 4179        # Support scripts (primarily SMF methods)
Shell (cmd)                 5295        # Utilities
Shell (tools)               6112        # Build tools
Shell (tst)                 6428        # Automated test code
-----------------------   ------
Total Shell                22014

Python (tst)               34106        # Automated test code
XML (metadata)             16975        # Internal metadata
CSS                         6124        # Stylesheets

[1] This is a raw line count. It includes blank lines and comments, so interpret it as you see fit.

Monday Nov 10, 2008

It's hard to believe that this day has finally come. After more than two and a half years, our first Fishworks-based product has been released. You can keep up to date with the latest info at the Fishworks blog.

For my first technical post, I'd thought I'd give an introduction to the chassis subsystem at the heart of our hardware integration strategy. This subsystem is responsible for gathering, cataloging, and presenting a unified view of the hardware topology. It underwent two major rewrites (one by myself and one by Keith) but the fundamental design has remained the same. While it may not be the most glamorous feature (no one's going to purchase a box because they can get model information on their DIMMs), I found it an interesting cross-section of disparate technologies and awash in subtle complexity. You can find a video of myself talking about and demonstrating this feature here.

libtopo discovery

At the heart of the chassis subsystem is the FMA topology as exported by libtopo. This library is already capable of enumerating hardware in a physically meaningful manner, and FMRIs (fault managed resource identifiers) form the basis of FMA fault diagnosis. This alone provides us the following basic capabilities:

  • Discover external storage enclosures
  • Identify bays and disks
  • Identify CPUs
  • Identify power supplies and fans
  • Manage LEDs
  • Identify PCI functions beneath a particular slot

Much of this requires platform-specific XML files, or leverages IPMI behind the scenes, but this minimal integration work is common to Solaris. Any platform supported by Solaris is supported by the FishWorks software stack.

Additional metadata

Unfortunately, this falls short of a complete picture:

  • No way to identify absent CPUs, DIMMs, or empty PCI slots
  • DIMM enumeration not supported on all platforms
  • Human-readable labels often wrong or missing
  • No way to identify complete PCI cards
  • No integration with visual images of the chassis

To address these limitations (most of which lie outside the purview of libtopo), we leverage additional metadata for each supported chassis. This metadata identifies all physical slots (even those that may not be occupied), cleans up various labels, and includes visual information about the chassis and its components. And we can identify physical cards based on devinfo properties extracted from firmware and/or the pattern of PCI functions and their attributes (a process worthy of its own blog entry). Combined with libtopo, we have images that we can assemble into a complete view based on the current physical layout, highlight components within the image, and respond to user mouse clicks.

Supplemental information

However, we are still missing many of the component details. Our goal is to be able to provide complete information for every FRU on the system. With just libtopo, we can get this for disks but not much else. We need to look to alternate sources of information.

kstat

For CPUs, there is a rather rich set of information available via traditional kstat interfaces. While we use libtopo to identify CPUs (it lets us correlate physical CPUs), the bulk of the information comes from kstats. This is used to get model, speed, and the number of cores.

libdevinfo

The device tree snapshot provides additional information for PCI devices that can only be retrieved by private driver interfaces. Despite the existence of a VPD (Vital Product Data) standard, effectively no vendors implement it. Instead, it is read by some firmware-specific mechanism private to the driver. By exporting these as properties in the devinfo snapshot, we can transparently pull in dynamic FRU information for PCI cards. This is used to get model, part, and revision information for HBAs and 10G NICs.

IPMI

IPMI (Intelligent Platform Management Interface) is used to communicate with the service processor on most enterprise class systems. It is used within libtopo for power supply and fan enumeration in libtopo as well as LED management. But IPMI also supports FRU data, which includes a lot of juicy tidbits that only the SP knows. We reference this FRU information directly to get model and part information for power supplies and DIMMs.

SMBIOS

Even with IPMI, there are bits of information that exist only in SMBIOS, a standard is supposed to provide information about the physical resources on the system. Sadly, it does not provide enough information to correlate OS-visible abstractions with their underlying physical counterparts. With metadata, however, we can use SMBIOS to make this correlation. This is used to enumerate DIMMs on platforms not supported by libtopo, and to supplement DIMM information with data available only via SMBIOS.

Metadata

Last but not least, there is chassis-specific metadata. Some components simply don't have FRUID information, either because they are too simple (fans) or there exists no mechanism to get the information (most PCI cards). In this situation, we use metadata to provide vendor, model, and part information as that is generally static for a particular component within the system. We cannot get information specific to the component (such as a serial number), but at least the user will be able to know what it is and know how to order another one.

Putting it all together

With all of this information tied together under one subsystem, we can finally present the user complete information about their hardware, including images showing the physical layout of the system. In addition, this also forms the basis for reporting problems and analytics (using labels from metadata), manipulating chassis state (toggling LEDs, setting chassis identifiers), and making programmatic distinctions about the hardware (such as whether external HBAs are present). Over the next few weeks I hope to expound on some of these details in further blog posts.

Thursday Aug 07, 2008

Last week, Rob Johnston and I coordinated two putbacks to Solaris to further the cause of Solaris platform integration, this time focusing on sensors and indicators. Rob has a great blog post with an overview of the new sensor abstraction layer in libtopo. Rob did most of the hard work- my contribution consisted only of extending the SES enumerator to support the new facility infrastructure.

You can find a detailed description of the changes in the original FMA portfolio here, but it's much easier to understand via demonstration. This is the fmtopo output for a fan node in a J4400 JBOD:

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0
    label             string    Cooling Fan  0
    FRU               fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0
  group: authority                      version: 1   stability: Private/Private
    product-id        string    SUN-Storage-J4400
    chassis-id        string    2029QTF0000000005
    server-id         string
  group: ses                            version: 1   stability: Private/Private
    node-id           uint64    0x1f
    target-path       string    /dev/es/ses3

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=ident
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=ident
  group: authority                      version: 1   stability: Private/Private
    product-id        string    SUN-Storage-J4400
    chassis-id        string    2029QTF0000000005
    server-id         string
  group: facility                       version: 1   stability: Private/Private
    type              uint32    0x1 (LOCATE)
    mode              uint32    0x0 (OFF)
  group: ses                            version: 1   stability: Private/Private
    node-id           uint64    0x1f

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=fail
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=fail
  group: authority                      version: 1   stability: Private/Private
    product-id        string    SUN-Storage-J4400
    chassis-id        string    2029QTF0000000005
    server-id         string
  group: facility                       version: 1   stability: Private/Private
    type              uint32    0x0 (SERVICE)
    mode              uint32    0x0 (OFF)
  group: ses                            version: 1   stability: Private/Private
    node-id           uint64    0x1f

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=speed
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=speed
  group: authority                      version: 1   stability: Private/Private
    product-id        string    SUN-Storage-J4400
    chassis-id        string    2029QTF0000000005
    server-id         string
  group: facility                       version: 1   stability: Private/Private
    sensor-class      string    threshold
    type              uint32    0x4 (FAN)
    units             uint32    0x12 (RPM)
    reading           double    3490.000000
    state             uint32    0x0 (0x00)
  group: ses                            version: 1   stability: Private/Private
    node-id           uint64    0x1f

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=fault
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=fault
  group: authority                      version: 1   stability: Private/Private
    product-id        string    SUN-Storage-J4400
    chassis-id        string    2029QTF0000000005
    server-id         string
  group: facility                       version: 1   stability: Private/Private
    sensor-class      string    discrete
    type              uint32    0x103 (GENERIC_STATE)
    state             uint32    0x1 (DEASSERTED)
  group: ses                            version: 1   stability: Private/Private
    node-id           uint64    0x1f

Here you can see the available indicators (locate and service), the fan speed (3490 RPM) and if the fan is faulted. Right now this is just interesting data for savvy administrators to play with, as it's not used by any software. But that will change shortly, as we work on the next phases:

  • Monitoring of sensors to detect failure in external components which have no visibility in Solaris outside libtopo, such as power supplies and fans. This will allow us to generate an FMA fault when a power supply or fan fails, regardless of whether it's in the system chassis or an external enclosure.
  • Generalization of the disk-monitor fmd plugin to support arbitrary disks. This will control the failure indicator in response to FMA-diagnosed faults.
  • Correlation of ZFS faults with the associated physical disk. Currently, ZFS faults are against a "vdev" - a ZFS-specific construct. The user is forced to translate from this vdev to a device name, and then use the normal (i.e. painful) methods to figure out which physical disk was affected. With a little work it's possible to include the physical disk in the FMA fault to avoid this step, and also allow the fault LED to be controlled in response to ZFS-detected faults.
  • Expansion of the SCSI framework to support native diagnosis of faults, instead of a stream of syslog messages. This involves generating telemetry in a way that can be consumed by FMA, as well as a diagnosis engine to correlate these ereports with an associated fault.

Even after we finish all of these tasks and reach the nirvana of a unified storage management framework, there will still be lots of open questions about how to leverage the sensor framework in interesting ways, such as a prtdiag-like tool for assembling sensor information, or threshold alerts for non-critical warning states. But with these latest putbacks, it feels like our goals from two years ago are actually within reach, and that I will finally be able to turn on that elusive LED.

Sunday Jul 13, 2008

Over the past few years, I've been working on various parts of Solaris platform integration, with an emphasis on disk monitoring. While the majority of my time has been focused on fishworks, I have managed to implement a few more pieces of the original design.

About two months ago, I integrated the libscsi and libses libraries into Solaris Nevada. These libraries, originally written by Keith Wesolowski, form an abstraction layer upon which higher level software can be built. The modular nature of libses makes it easy to extend with vendor-specific support libraries in order to provide additional information and functionality not present in the SES standard, something difficult to do with the kernel-based ses(7d) driver. And since it is written in userland, it is easy to port to other operating systems. This library is used as part of the fwflash firmware upgrade tool, and will be used in future Sun storage management products.

While libses itself is an interesting platform, it's true raison d'etre is to serve as the basis for enumeration of external enclosures as part of libtopo. Enumeration of components in a physically meaningful manner is a key component of the FMA strategy. These components form FMRIs (fault managed resource identifiers) that are the target of diagnoses. These FMRIs provide a way of not just identifying that "disk c1t0d0 is broken", but that this device is actually in bay 17 of the storage enclosure whose chassis serial number is "2029QTF0809QCK012". In order to do that effectively, we need a way to discover the physical topology of the enclosures connected to the system (chassis and bays) and correlate it with the in-band I/O view of the devices (SAS addresses). This is where SES (SCSI enclosure services) comes into play. SES processes show up as targets in the SAS fabric, and by using the additional element status descriptors, we can correlate physical bays with the attached devices under Solaris. In addition, we can also enumerate components not directly visible to Solaris, such as fans and power supplies.

The SES enumerator was integrated in build 93 of nevada, and all of these components now show up in the libtopo hardware topology (commonly referred to as the "hc scheme"). To do this, we walk over al the SES targets visible to the system, grouping targets into logical chassis (something that is not as straightforward as it should be). We use this list of targets and a snapshot of the Solaris device tree to fill in which devices are present on the system. You can see the result by running fmtopo on a build 93 or later Solaris machine:

# /usr/lib/fm/fmd/fmtopo
...

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:serial=2029QTF0000000002:part=Storage-J4400:revision=3R13/ses-enclosure=0

hc://:product-id=SUN-Storage-J4400:chassis-id=22029QTF0809QCK012:server-id=:part=123-4567-01/ses-enclosure=0/psu=0

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:part=123-4567-01/ses-enclosure=0/psu=1

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/fan=0

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/fan=1

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/fan=2

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/fan=3

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=2029QTF0811RM0386:part=375-3584-01/ses-enclosure=0/controller=0

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=2029QTF0811RM0074:part=375-3584-01/ses-enclosure=0/controller=1

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/bay=0

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/bay=1

...

To really get all the details, you can use the '-V' option to fmtopo to dump all available properties:

# fmtopo -V '*/ses-enclosure=0/bay=0/disk=0'
TIME                 UUID
Jul 14 03:54:23 3e95d95f-ce49-4a1b-a8be-b8d94a805ec8

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0
    ASRU              fmri      dev:///:devid=id1,sd@TATA_____SEAGATE_ST37500NSSUN750G_0720A0PC3X_____5QD0PC3X____________//scsi_vhci/disk@gATASEAGATEST37500NSSUN750G0720A0PC3X5QD0PC3X
    label             string    SCSI Device  0
    FRU               fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0
  group: authority                      version: 1   stability: Private/Private
    product-id        string    SUN-Storage-J4400
    chassis-id        string    2029QTF0809QCK012
    server-id         string    
  group: io                             version: 1   stability: Private/Private
    devfs-path        string    /scsi_vhci/disk@gATASEAGATEST37500NSSUN750G0720A0PC3X5QD0PC3X
    devid             string    id1,sd@TATA_____SEAGATE_ST37500NSSUN750G_0720A0PC3X_____5QD0PC3X____________
    phys-path         string[]  [ /pci@0,0/pci10de,377@a/pci1000,3150@0/disk@1c,0 /pci@0,0/pci10de,375@f/pci1000,3150@0/disk@1c,0 ]
  group: storage                        version: 1   stability: Private/Private
    logical-disk      string    c0tATASEAGATEST37500NSSUN750G0720A0PC3X5QD0PC3Xd0
    manufacturer      string    SEAGATE
    model             string    ST37500NSSUN750G 0720A0PC3X
    serial-number     string    5QD0PC3X            
    firmware-revision string       3.AZK
    capacity-in-bytes string    750156374016

So what does this mean, other than providing a way for you to finally figure out where disk 'c3t0d6' is really located? Currently, it allows the disks to be monitored by the disk-transport fmd module to generate faults based on predictive failure, over temperature, and self-test failure. The really interesting part is where we go from here. In the near future, thanks to work by Rob Johnston on the sensor framework, we'll have the ability to manage LEDs for disks that are part of external enclosures, diagnose failures of power supplies and fans, as well as the ability to read sensor data (such as fan speeds and temperature) as part of a unified framework.

I often like to joke about the amount of time that I have spent just getting a single LED to light. At first glance, it seems like a pretty simple task. But to do it in a generic fashion that can be generalized across a wide variety of platforms, correlated with physically meaningful labels, and incorporate a diverse set of diagnoses (ZFS, SCSI, HBA, etc) requires an awful lot of work. Once it's all said and done, however, future platforms will require little to no integration work, and you'll be able to see a bad drive generate checksum errors in ZFS, resulting in a FMA diagnosis indicating the faulty drive, activate a hot spare, and light the fault LED on the drive bay (wherever it may be). Only then will we have accomplished our goal of an end-to-end storage strategy for Solaris - and hopefully someone besides me will know what it has taken to get that little LED to light.

Saturday Jun 09, 2007

For those of you who have been following my recent work with Solaris platform integration, be sure to check out the work Cindi and the FMA team are doing as part of the Sensor Abstraction Layer project. Cindi recently posted an initial version of the Phase 1 design document. Take a look if you're interested in the details, and join the discussion if you're interested in defining the Solaris platform experience.

The implications of this project for unified platform integration are obvious. With respect to what I've been working on, you'll likely see the current disk monitoring infrastructure converted into generic sensors, as well as the sfx4500-disk LED support converted into indicators. I plan to leverage this work as well as the SCSI FMA work to enable correlated ZFS diagnosis across internal and external storage.

Saturday May 26, 2007

Two weeks ago I putback PSARC 2007/202, the second step in generalizing the x4500 disk monitor. As explained in my previous blog post, one of the tasks of the original sfx4500-disk module was reading SMART data from disks and generating associated FMA faults. This platform-specific functionality needed to be generalized to effectively support future Sun platforms.

This putback did not add any new user-visible features to Solaris, but it did refactor the code in the following ways:

  • A new private library, libdiskstatus, was added. This generic library uses uSCSI to read data from SCSI (or SATA via emulation) devices. It is not a generic SMART monitoring library, focusing only on the three generally available disk faults: over temperature, predictive failure, and self-test failure. There is a single function, disk_status_get() that reurns an nvlist describing the current parameters reported by the drive and whether any faults are present.

  • This library is used by the SATA libtopo module to export a generic TOPO_METH_DISK_STATUS method. This method keeps all the implementation details within libtopo and exports a generic inerface for consumers.

  • A new fmd module, disk-transport, periodically iterates over libtopo nodes and invokes the TOPO_METH_DISK_STATUS method on any supported nodes. The module generates FMA ereports for any detected errors.

  • These ereports are translated to faults by a simple eversholt DE. These are the same faults that were originally generated by the sfx4500-disk module, so the code that consumes them remains unchanged.

These changes form the foundation that will allow future Sun platforms to detect and react to disk failures, eliminating 5200 lines of platform-specific code in the process. The next major steps are currently in progress:

The FMA team, as part of the sensor framework, is expanding libtopo to include the ability to represent indicators (LEDs) in a generic fashion. This will replace the x4500 specific properties and associated machinery with generic code.

The SCSI FMA team is finalizing the libtopo enumeration work that will allow arbitrary SCSI devices (not just SATA) to be enumerated under libtopo and therefore be monitored by the disk-transport module. The first phase will simply replicate the existing sfx4500-disk functionality, but will enable us to model future non-SATA platforms as well as external storage devices.

Finally, I am finishing up my long-overdue ZFS FMA work, a necessary step towards connecting ZFS and disk diagnosis. Stay tuned for more info.

Saturday Mar 17, 2007

As I continue down the path of improving various aspects of ZFS and Solaris platform integration, I found myself in the thumper (x4500) fmd platform module. This module represents the latest attempt at Solaris platform integration, and an indication of where we are headed in the future.

When I say "platform integration", this is more involved than the platform support most people typically think of. The platform teams make sure that the system boots and that all the hardware is supported properly by Solaris (drivers, etc). Thanks to the FMA effort, platform teams must also deliver a FMA portfolio which covers FMA support for all the hardware and a unified serviceability plan. Unfortunately, there is still more work to be done beyond this, of which the most important is interacting with hardware in response to OS-visible events. This includes ability to light LEDs in response to faults and device hotplug, as well as monitoring the service processor and keeping external FRU information up to date.

The sfx4500-disk module is the latest attempt at providing this functionality. It does the job, but is afflicted by the same problems that often plague platform integration attempts. It's overcomplicated, monolithic, and much of what it does should be generic Solaris functionality. Among the things this module does:

  • Reads SMART data from disks and creates ereports
  • Diagnoses ereports into corresponding disk faults
  • Implements an IPMI interface directly on top of /dev/bmc
  • Responds to disk faults by turning on the appropriate 'fault' disk LED
  • Listens for hotplug and DR events, updating the 'ok2rm' and 'present' LEDs
  • Updates SP-controlled FRU information
  • Monitors the service process for resets and resyncs necessary information

Needless to say, every single item on the above list is applicable to a wide variety of Sun platforms, not just the x4500, and it certainly doesn't need to be in a single monolithic module. This is not meant to be a slight against the authors of the module. As with most platform integration activities, this effort wasn't communicated by the hardware team until far too late, resulting in an unrealistic schedule with millions of dollars of revenue behind it. It doesn't help that all these features need to be supported on Solaris 10, making the schedule pressure all the more acute, since the code must soak in Nevada and then be backported in time for the product release. In these environments even the most fervent pleas for architectural purity tend to fall on deaf ears, and the engineers doing the work quickly find themselves between a rock and a hard place.

As I was wandering through this code and thinking about how this would interact with ZFS and future Sun products, it became clear that it needed a massive overhaul. More specifically, it needed to be burned to the ground and rebuilt as a set of distinct, general purpose, components. Since refactoring 12,000 lines of code with such a variety of different functions is non-trivial and difficult to test, I began by factoring out different pieces individually, redesigning the interfaces and re-integrating them into Solaris on a piece-by-piece basis.

Of all the functionality provided by the module, the easiest thing to separate was the IPMI logic. The Intelligent Platform Management Interface is a specification for communicating with service Pprocessors to discover and control available hardware. Sadly, it's anything but "intelligent". If you had asked me a year ago what I'd be doing at the beginning of this year, I'm pretty sure that reading the IPMI specification would have been at the bottom of my list (right below driving stakes through my eyeballs). Thankfully, the IPMI functionality needed was very small, and the best choice was a minimally functional private library, designed solely for the purpose of communicating with the Service Processor on supported Sun platforms. Existing libraries such as OpenIPMI were too complicated, and in their efforts to present a generic abstracted interface, didn't provide what we really needed. The design goals are different, and the ON-private IPMI library and OpenIPMI will continue to develop and serve different purposes in the future.

Last week I finally integrated libipmi. In the process, I eliminated 2,000 lines of platform-specific code and created a common interface that can be leveraged by other FMA efforts and future projects. It is provided for both x86 and SPARC, even though there are currently no supported SPARC machines with an IPMI-capable service processor (this is being worked on). This library is private and evolving quite rapidly, so don't use it in any non-ON software unless you're prepared to keep up with a changing API.

As part of this work, I also created a common fmd module, sp-monitor, that monitors the service processor, if present, and generates a new ESC_PLATFORM_RESET sysevent to notify consumers when the service processor is reset. The existing sfx4500-disk module then consumes this sysevent instead of monitoring the service processor directly.

This is the first of many steps towards eliminating this module in its current form, as well as laying groundwork for future platform integration work. I'll post updates to this blog with information about generic disk monitoring, libtopo indicators, and generic hotplug management as I add this functionality. The eventual goal is to reduce the platform-specific portion of this module to a single .xml file delivered via libtopo that all these generic consumers will use to provide the same functionality that's present on the x4500 today. Only at this point can we start looking towards future applications, some of which I will describe in upcoming posts.

Wednesday Mar 14, 2007

I've been heads down for a long time on a new project, but occasionally I do put something back to ON worth blogging about. Recently I've been working on some problems which leverage sysevents (libsysevent(3LIB)) as a common transport mechanism. While trying to understand exactly what sysevents were being generated from where, I found the lack of observability astounding. After poking around with DTrace, I found that tracking down the exact semantics was not exactly straightforward. First of all, we have two orthogonal sysevent mechanisms, the original syseventd legacy mechanism, and the more recent general purpose event channel (GPEC) mechanism, used by FMA. On top of this, the sysevent_impl_t structure isn't exactly straightforward, because all the data is packed together in a single block of memory. Knowing that this would be important for my upcoming work, I decided that adding a stable DTrace sysevent provider would be useful.

The provider has a single probe, sysevent:::post, which fires whenever a sysevent post attempt is made. It doesn't necessarily indicate that the syevent was successfully queued or received. The probe has the following semantics:

# dtrace -lvP sysevent
   ID   PROVIDER            MODULE                          FUNCTION NAME
44528   sysevent           genunix                    queue_sysevent post

        Probe Description Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: Unknown

        Argument Attributes
                Identifier Names: Evolving
                Data Semantics:   Evolving
                Dependency Class: ISA

        Argument Types
                args[0]: syseventchaninfo_t *
                args[1]: syseventinfo_t *

The 'syseventchaninfo_t' translator has a single member, 'ec_name',which is the name of the event channel. If this is being posted via the legacy sysevent mechanism, then this member will be NULL. The 'syeventinfo_t' translator has three members, 'se_publisher', 'se_class', and 'se_subclass'. These mirror the arguments to sysevent_post(). The following script will dump all sysevents posted to syseventd(1M):

#!/usr/sbin/dtrace -s

#pragma D option quiet

BEGIN
{
	printf("%-30s  %-20s  %s\n", "PUBLISHER", "CLASS",
	    "SUBCLASS");
}

sysevent:::post
/args[0]->ec_name == NULL/
{
	printf("%-30s  %-20s  %s\n", args[1]->se_publisher,
	    args[1]->se_class, args[1]->se_subclass);
}

And the output during a cfgadm -c unconfigure:

PUBLISHER                       CLASS                 SUBCLASS 
SUNW:usr:devfsadmd:100237       EC_dev_remove         disk
SUNW:usr:devfsadmd:100237       EC_dev_branch         ESC_dev_branch_remove
SUNW:kern:ddi                   EC_devfs              ESC_devfs_devi_remove

This has already proven quite useful in my ongoing work, and hopefully some other developers out there will also find it useful.

Friday Sep 22, 2006

I've been meaning to get around to blogging about these features that I putback a while ago, but have been caught up in a few too many things. In any case, the following new ZFS features were putback to build 48 of Nevada, and should be availble in the next Solaris Express

Create Time Properties

An old RFE has been to provide a way to specify properties at create time. For users, this simplifies admnistration by reducing the number of commands which need to be run. It also allows some race conditions to be eliminated. For example, if you want to create a new dataset with a mountpoint of 'none', you first have to create it and the underlying inherited mountpoint, only to remove it later by invoking 'zfs set mountpoint=none'.

From an implementation perspective, this allows us to unify our implementation of the 'volsize' and 'volblocksize' properties, and pave the way for future create-time only properties. Instead of having a separate ioctl() to create a volume and passing in the two size parameters, we simply pass them down as create-time options.

The end result is pretty straightforward:

        # zfs create -o compression=on tank/home
        # zfs create -o mountpoint=/export -o atime=off tank/export

'canmount' property

The 'canmount' property allows you create a ZFS dataset that serves solely as a mechanism for inheriting properties. When we first created the hierarchical dataset model, we had the notion of 'containers' - filesystems with no associated data. Only these datasets could contain other datasets, and you had to make the decision at create-time.

This turned out to be a bad idea for a number of reasons. It complicated the CLI, forced the user to make a create-time decision that could not be changed, and led to confusion when files were accidentally created on the underlying filesystem. So we made every filesystem able to have child filesystems, and all seemed well.

However, there is power in having a dataset that exists in the hierarchy but has no associated filesystem data (or effectively none by preventing from being mounted). One can do this today by setting the 'mountpoint' property to 'none'. However, this property is inherited by child datasets, and the administrator cannot leverage the power of inherited mountpoints. In particular, some users have expressed desire to have two sets of directories, belonging to different ZFS parents (or even to UFS filesystems), share the same inherited directory. With the new 'canmount' property, this becomes trivial:

        # zfs create -o mountpoint=/export -o canmount=off tank/accounting
        # zfs create -o mountpoint=/export -o canmount=off tank/engineering
        # zfs create tank/accounting/bob
        # zfs create tank/engineering/anne

Now, both anne and bob have directories at '/export/', except that they are inheriting ZFS properties from different datasets in the hierarchy. The adminsitrator may decide to turn compression on for one group of people or another, or set a quota to limit the amount of space consumed by the group. Or simply have a way to view the total amount of space consumed by each group without resorting to scripted du(1).

User Defined Properties

The last major RFE in this wad added the ability to set arbitrary properties on ZFS datasets. This provides a way for administrators to annotate their own filesystems, as well as ISVs to layer intelligent software without having to modify the ZFS code to introduce a new property.

A user-defined property name is one which contains a colon (:). This provides a unique namespace which is guaranteed to not overlap with native ZFS properties. The emphasis is to use the colon to separate a module and property name, where 'module' should be a reverse DNS name. For example, a theoretical Sun backup product might do:

        # zfs set com.sun.sunbackup:frequency=1hr tank/home

The property value is an arbitrary string, and no additional validation is done on it. These values are always inherited. A local adminstrator might do:

        # zfs set localhost:backedup=9/19/06 tank/home
        # zfs list -o name,localhost:backedup
        NAME            LOCALHOST:BACKEDUP
        tank            -
        tank/home       9/19/06
        tank/ws         9/10/06

The hope is that this will serve as a basis for some innovative products and home grown solutions which interact with ZFS datasets in a well-defined manner.

Tuesday Aug 22, 2006

More exciting news on the ZFS OpenSolaris front. In addition to the existing ZFS on FUSE/Linux work, we now have a second active port of ZFS, this time for FreeBSD. Pawel Dawidek has been hard at work, and has made astounding progress after just 10 days (!). This is both a testament to his ability as well as the portability of ZFS. As with any port, the hard part comes down to integrating the VFS layer, but Pawel has already made good progress there. The current prototype can already mount fielsystems, create files, and list directory contents. Of course, our code isn't completely without portability headaches, but thanks to Pawel (and Ricardo on FUSE/Linux), we can take patches and implement the changes upstream to ease future maintenance. You can find the FreeBSD repository Here. If you're a FreeBSD developer or user, please give Pawel whatever support you can, whether it's code contributions, testing, or just plain old compliments. We'll be helping out where we can on the OpenSolaris side.

In related news, Ricard Correia has made significant progress on the FUSE/Linux port. All the management functionality of zfs(1M) and zpool(1M) is there, and he's working on mounting ZFS filesystems. All in all, it's an exciting time, and we're all crossing our fingers that ZFS will follow in the footsteps of its older brother DTrace.

Monday Jun 12, 2006

As Jeff mentioned previously, Ricardo Correia has been working on porting ZFS to FUSE/Linux as part of Google SoC. Last week, Ricardo got libzpool and ztest running on Linux, which is a major first step of the project.

The interesting part is the set of changes that he had to make in order to get it working. libzpool was designed to be run from userland and the kernel from the start, so we've already done most of the work of separating out the OS-dependent interfaces. The most prolific changes were to satisfy GCC warnings. We do compile ON with gcc, but not using the default options. I've since updated the ZFS porting page with info about gcc use in ON, which should make future ports easier. The second most common change was header files that are available in both userland and kernel on Solaris, but nevertheless should be placed in zfs_context.h, concentrating platform-specific knowledge in this one file. Finally, there were some simple changes we could make (such as using pthread_create() instead of thr_create()) to make ports of the tools easier. It would also be helpful to have ports of libnvpair and libavl, much like some have done for libumem, so that developers don't have to continually port the same libraries over and over.

The next step (getting zfs(1M) and zpool(1M) working) is going to require significantly more changes to our source code. Unlike libzpool, these tools (libzfs in particular) were not designed to be portable. They include a number of Solaris specific interfaces (such as zones and NFS shares) that will be totally different on other platforms. I look forward to seeing Ricardo's progress to know how this will work out.

Tuesday Jun 06, 2006

It's been a long time since the last time I wrote a blog entry. I've been working heads-down on a new project and haven't had the time to keep up my regular blogging. Hopefully I'll be able to keep something going from now on.

Last week the ZFS team put the following back to ON:

PSARC 2006/223 ZFS Hot Spares
PSARC 2006/303 ZFS Clone Promotion
6276916 support for "clone swap"
6288488 du reports misleading size on RAID-Z
6393490 libzfs should be a real library
6397148 fbufs debug code should be removed from buf_hash_insert()
6405966 Hot Spare support in ZFS
6409302 passing a non-root vdev via zpool_create() panics system
6415739 assertion failed: !(zio->io_flags & 0x00040)
6416759 ::dbufs does not find bonus buffers anymore
6417978 double parity RAID-Z a.k.a. RAID6
6424554 full block re-writes need not read data in
6425111 detaching an offline device can result in import confusion

There are a couple of cool features mixed in here. Most importantly, hot spares, clone swap, and double-parity RAID-Z. I'll focus this entry on hot spares, since I wrote the code for that feature. If you want to see the original ARC case and some of the discussion behind the feature, you should check out the original zfs-discuss thread.

The following features make up hot spare support:

Associating hot spares with pools

Hot spares can be specified when creating a pool or adding devices by using the spare vdev type. For example, you could create a mirrored pool with a single hot spare by doing:

# zpool create test mirror c0t0d0 c0t1d0 spare c0t2d0
# zpool status test
  pool: test
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        test        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c0t0d0  ONLINE       0     0     0
            c0t1d0  ONLINE       0     0     0
        spares
          c0t2d0    AVAIL   

errors: No known data errors

Notice that there is one spare, and it currently available for use. Spares can be shared between multiple pools, allowing for a single set of global spares on systems with multiple spares.

Replacing a device with a hot spare

There is now an FMA agent, zfs-retire, which subscribes to vdev failure faults and automatically initiates replacements if there are any hot spares available. But if you want to play around with this yourself (without forcibly faulting drives), you can just use 'zpool replace'. For example:

# zpool offline test c0t0d0
Bringing device c0t0d0 offline
# zpool replace test c0t0d0 c0t2d0
# zpool status test
  pool: test
 state: DEGRADED
status: One or more devices has been taken offline by the adminstrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
 scrub: resilver completed with 0 errors on Tue Jun  6 08:48:41 2006
config:

        NAME          STATE     READ WRITE CKSUM
        test          DEGRADED     0     0     0
          mirror      DEGRADED     0     0     0
            spare     DEGRADED     0     0     0
              c0t0d0  OFFLINE      0     0     0
              c0t2d0  ONLINE       0     0     0
            c0t1d0    ONLINE       0     0     0
        spares
          c0t2d0      INUSE     currently in use

errors: No known data errors

Note that the offline is optional, but it helps visualize what the pool would look like should and actual device fail. Note that even though the resilver is completed, the 'spare' vdev stays in-place (unlike a 'replacing' vdev). This is because the replacement is only temporary. Once the original device is replaced, then the spare will be returned to the pool.

Relieving a hot spare

A hot spare can be returned to its previous state by replacing the original faulted drive. For example:

# zpool replace test c0t0d0 c0t3d0
# zpool status test
  pool: test
 state: DEGRADED
 scrub: resilver completed with 0 errors on Tue Jun  6 08:51:49 2006
config:

        NAME             STATE     READ WRITE CKSUM
        test             DEGRADED     0     0     0
          mirror         DEGRADED     0     0     0
            spare        DEGRADED     0     0     0
              replacing  DEGRADED     0     0     0
                c0t0d0   OFFLINE      0     0     0
                c0t3d0   ONLINE       0     0     0
              c0t2d0     ONLINE       0     0     0
            c0t1d0       ONLINE       0     0     0
        spares
          c0t2d0         INUSE     currently in use

errors: No known data errors
# zpool status test
  pool: test
 state: ONLINE
 scrub: resilver completed with 0 errors on Tue Jun  6 08:51:49 2006
config:

        NAME        STATE     READ WRITE CKSUM
        test        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c0t3d0  ONLINE       0     0     0
            c0t1d0  ONLINE       0     0     0
        spares
          c0t2d0    AVAIL   

errors: No known data errors

The drive is actively being replaced for a short period of time. Once the replacement is completed, the old device is removed, and the hot spare is returned to the list of available spares. If you want a hot spare replacement to become permanent, you can zpool detach the original device, at which point the spare will be removed from the hot spare list of any active pools. You can also zpool detach the spare itself to cancel the hot spare operation.

Removing a spare from a pool

To remove a hot spare from a pool, simply use the zpool remove command. For example:

# zpool remove test c0t2d0
# zpool status
  pool: test
 state: ONLINE
 scrub: resilver completed with 0 errors on Tue Jun  6 08:51:49 2006
config:

        NAME        STATE     READ WRITE CKSUM
        test        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c0t3d0  ONLINE       0     0     0
            c0t1d0  ONLINE       0     0     0

errors: No known data errors

Unfortunately, we don't yet support removing anything other than hot spares (it's on our list, we swear). But you can see how hot spares naturally fit into the existing ZFS scheme. Keep in mind that to use hot spares, you will need to upgrade your pools (via 'zpool upgrade') to version 3 or later.

Next Steps

Despite the obvious usefulness of this feature, there is one more step that needs to be done for it to be truly useful. This involves phase two of the ZFS/FMA integration. Currently, a drive is only considered faulted if it 'goes away' completely (i.e. ldi_open() fails). This covers only subset of known drive failure modes. It's possible for a drive to continually return errors, and yet be openable. The next phase of ZFS and FMA will introduce a more intelligent diagnosis engine to watch I/O and checksum errors as well as the SMART predictive failure bit in order to proactively offline devices when they are experiencing an abnormal amount of errors, or appear like they are going to fail. With this functionality, ZFS will be able to better respond to failing drives, thereby making hot spare replacement much more valuable.