Tuesday May 12, 2009

On SPARC systems, when there's a memory uncorrectable error (UE), Solaris will determine if the affected page is in user space or kernel space. If in user space, the affected user process is killed - we term this a contract kill....and it's way better than panicking the entire OS instance. Also, if the affected process is registered with SMF, it will be restarted automatically. And naturally, FMA will diagnose and message the UE, as well as retire the offending page.

However, there's not been any notification of what process has been killed. That changed today with the putback of 6676374 to snv_116. The most pertinent part of the code change from the user perspective is:

uprintf("Killed process %d (%s) in contract id %d " "due to hardware error\n", p->p_pid, p->p_user.u_comm, ct_id);

Now, on a contract kill you'll get the process id, name, and contract id logged in /var/adm/messages.

For a single UE, this could be viewed as (important) supplemental information, since FMA will loudly message a DIMM fault. However, if subsequent UEs occur in the same DIMM or set of DIMMs previously diagnosed, and the offending DIMM(s) have not been replaced, the subsequent UEs are not loudly messaged by FMA. The reason is that the FMA subsystem recognized the subsequent errors are against already-faulted FMRIs. FMD does not re-message such faults. Without this fix, additional contract kills are silent.

Oh...and if you're wondering about x86 platforms, Solaris has the same capability to contract kill user processes instead of panic. But, in the x86 world, operating systems are not typically given the chance to react to memory UEs. The industry norm is that the FW/BIOS pulls reset on a memory UE.

:wq

UPDATE: This same change was putback to the Solaris 10 gates today (07/08/2009). It will be part of Solaris 10 Update 8 (S10U8).

Tuesday May 05, 2009

On my desktop, I use "mouse over" semantics - my active window is set based on where my mouse cursor is. And every now and again, some interesting things result. Such as today. I went to flush my browser cache and history with Ctrl-Alt-Delete and apparently didn't have my mouse over the Firefox window. I was surprised when this popped up:

From google searches, this tool has apparently been around for a while. I don't spend time cruising the application menus (and maybe I should :), so I never new about it.

It's a nifty little thing. I didn't have much luck in the option to search for open files, but nice for a quick glance at a system. Odd thing is my network stats are nil, which I know isn't accurate. Might be the fact that I was VPN'd into Sun's network at the time. Need to check that later.

I found a demo of the System Monitor. It looks a bit dated - at least the menu path has changed. On my snv_111a system, the tool is found in Applications -> System Tools -> Performance Monitor.

:wq

Friday May 01, 2009

Solaris 10 Update 7 is now posted and available for download. And there's been 65+ bug fixes and enhancements for FMA. Here's a few of my favorites (can one have favorite bugs? :) fixed in S10U7:


6540058 libldom enhancements for sun4v root domains
6540055 ETM enhancements for sun4v root domains
6540080 topology enhancements for sun4v root domains

For quite a long time, when running SPARC Logical Domains (LDOMs), FMA had real gaps when the IO subsystem was divided across logical domains. Namely, when an IO root complex is granted to a non-control domain (a so-called "root" domain), FMA in the IO was disjoint and could break. Some of the IO diagnosis rules needs to pair up root complex ereports (created in the SP) with PCIE fabric ereports (created in Solaris). With these fixes, the event transport plumbing is in place so a given instance of Solaris gets all the ereports it needs to produce an accurate diagnosis.

Update 05/04/2009: Eric Sharakan posted a blog detailing some of the LDOM side requirements needed to ensure FMA is fully featured - namely the LDOMs 1.2 release planned for this summer.


6706543 FMA for Intel Nehalem
There's actually several other bug fixes that go along with the Nehalem support. Please refer to my prior blog entry on Nehalem FMA at http://blogs.sun.com/sdaven/entry/xeon_5500_fma.


6722048 diagnosis of and KA for SUNOS-8000-1L should be split

For those of you that have gotten an SUNOS-8000-1L message, you've been annoyed. It means there's a bug in the FMA stack somewhere, and a diagnosis engine received an ereport it couldn't understand. Before you get excited, this fix doesn't fix all the bugs. But, it does help us developers better identify where a bug might be. Several new message IDs are introduced, which better classify why a particular ereport was deemed bad. They are SUNOS-8000-E8, SUNOS-8000-G7, SUNOS-8000-HV, and the unfortunately named SUNOS-8000-FU.


6639248 RFE: Eversholt should allow dynamic SERD engine names
6639255 RFE: Eversholt should allow bumping SERD by an arbitrary value

If you're not developing diagnosis code in Eversholt (which is most everyone on the planet), then you won't care about this. But these changes allow us to do some more interesting things to make diagnosis engines more flexible. I asked for these changes as part of the SPARC/sun4v Platform Independent FMA work. The language extensions allows diagnosis rules to be tailored by ereport payload members. And in the sun4v world, where telemetry is generated outside of Solaris on the Service Processor, we've designed diagnosis rules that can be "guided" by platform-controlled telemetry.

:wq

This blog copyright 2009 by Scott Davenport