Friday Sep 25, 2009

6734814 Intel address translation Phase II

This change hit build 125 today. Solaris has supported memory page retire since the initial launch of Nehalem EP. Today's putback improves that support in the area of fault replay.

FMA persists page retires (and all other faults) across reboots via the on-disk fault cache. When FMD starts, the fault cache is consulted and (provided the affected resource is still in the configuration) replays the cached faults.

For page retires, the faults are associated with a physical address (PA). Between OS reboots, it's possible the memory topology can change - DIMMs can be added/removed, interleaves changed, etc. In such cases, the physical/virtual mappings change, and the PA in the on-disk fault cache could point at a healthy page. FMD would then retire a page that had experienced no errors.

This putback adds code to recalculate the PA (if necessary) after reboots to ensure the correct, faulty page is re-retired.

:wq

Friday Sep 18, 2009

Build 100 of OpenSolaris introduced fast reboot, quite a nifty feature. However, in certain fault scenarios, having fast reboot on by default is not desired. Fast reboot bypasses BIOS involvement in the boot process. But after certain classes of failure, BIOS engagement is desired. Examples include BIOS deconfiguring faulty components or BIOS collecting error information that may be lost or discarded after OS reboot.

Earlier this week 6880616 putback interfaces that allow Solaris FMA to disable fast reboot on terminal errors. The FMA changes to take advantage of these interfaces is close behind, planned to make build 125 (6883623.

:wq

Wednesday Sep 16, 2009

6852259 Add FMA support for new members of Nehalem family

This CR went into OpenSolaris today. Solaris FMA is ready to diagnose when Intel's Westmere and Jasper Forest hit the streets.

:wq

This blog copyright 2009 by Scott Davenport