#

jdh's blog

theme and variations
Thursday Feb 22, 2007

working remotely: Prague Paradise

This past year, I decided to leverage some intersecting reasons for travelling to Prague on business.  But instead of just doing a business trip, I proposed to my management that I work out of the Sun office in Prague for a short time while also mixing in vacation time, so I could both experience the city and make a connection with Sun folks at a non-U.S. site.  My management was very obliging, and before I knew it, my Prague plan panned out.

I was in Prague for the month of September.  Prague is a truly lovely city. A river runs through the center of the old part, with history and statue laden bridges crossing it.  Many of the beautiful old buildings are lit up at night.  I had an apartment in this part of the city.  The transit system is fabulous.  I used a combination of tram and subway to get to work;  each workday, I got to see the city's architectural highlights from the tram.

Sun is well set up for working at a remote site.  I secured 24x7 access to the Prague Sun site, which meant I could work at hours that suited me.  My laptop was already set up with the tools I need (mail reader, StarOffice, browser).  I plugged it in to the local network, which provided seamless access to my home directory and Sun's tools.  I was able to use a local access number from the office phone to dial into U.S. 800-based conference calls, and could even call in from my apartment by dialing first into the office phone system.  The phone connection was quite good -- only occasionally did I have trouble with voice synchronization.  I had brought my headset with me; it worked perfectly, which made my calls as easy as when calling from my main office or from home.

My husband visited for the middle 2 1/2 weeks.  We were fortunate to have glorious weather for the entire month of September, with blue skies and 70 degree days.  We went to the old churches and the famous castle, to the opera, to pubs, and also to several outdoor hangouts overlooking the city and river.  We explored outside the city for 5 days, and enjoyed both natural beauty and several beautiful and historic smaller towns.  Some motorcycle friends rode over from Marseille and stay for a couple of days; we showed them what we'd discovered.

Working in Prague was surprisingly comfortable and similar to my work experiences at home, or at other U.S. based Sun locations.  I could have just worked at the Prague site for a few days, then taken a vacation in the Czech Republic, but this experience of vacation days intermingled with days at work gave me the opportunity to get to know Prague and the Sun culture at this site.  It makes me hungry for more such experiences. Next time, I should try for 6 months or a year!

Friday Dec 08, 2006

Complexity: basic system monitoring

The other day I had my second contact with a Real Customer, and I learned something.  (Come to think of it, I learned something at my first meeting as well.  This is a trend I'm trying hard to extend by insinuating myself into meetings with customers.)  We have a bunch of really cool systems at Sun -- across a wide range, from high end to low end, covering both SPARC and x64 platforms.  Some customers buy lots and lots of them.  And then want to manage them.

We also have some pretty neat tools for managing and monitoring systems: Sun Management Center  (SunMC) has been around for a number of years, and supports pretty much all of our SPARC platforms.  N1 System Manager  (N1SM) is newer; it was initially targetted for the lower end volume systems, starting with x64 platforms, but has been expanding its focus.  Both tools provide a conduit through which to monitor Sun systems.  Both provide a normalized view, hiding the variations in how those systems report status or generate notifications.  Both also provide upward interfaces to enable integration into 3rd party tools.  In this way we're similar to IBM (Director) or HP (System Insight Manager).

I'd heard that customers aren't always so interested in a 3 tiered management solution, where there is the managed system, an element manager such as N1SM managing the basic hardware infrastructure, and a higher level manager for the data center as a whole.  My main reaction was, those pesky customers, they are never satisfied - this solution is quite sufficient for their needs.

Then in comes the Real Customer, with a big dose of Reality.  It turns out that the second tier running our nifty consolidation layer that knows the special ins and outs of our systems requires some real hardware on which to run.  This isn't just theoritical complexity, this is complexity in terms of additional hardware and software to deploy and manage.

So what should we do?   Provide flexibility.   And that's what we're doing.  We've recently provided some packages that enable direct integration between our systems and third party tools, such as CA Unicenter and MOM.  And this is a trend that we are continuing.   It may not be possible to fully represent some special aspects of our systems through non-Sun tools that focus on enabling management across a wide variety of systems, but we can certainly enable the basic system monitoring and management that is common to all systems.

Saturday Nov 11, 2006

Complexity and Completeness: FMA

Complexity brings joy into the life of an engineer: it is so satisfying to find all the nooks and crannies of a problem and come up with a solution that covers them all. No, wait. Complexity is the nemesis of the engineer: it is so hard to be satisfied with an 80% solution. The desire to provide a complete solution is almost unbearable, beyond all reasonable expectations of the company or customer, Must an engineer resort to damned statistics to prove that an inelegant or incomplete solution is sufficient, and a more efficient use of time? Sufficient for whom? For the consumer of course. Being associated with a less than perfect solution repulses me, yet I am the very customer that would never pay what it costs for perfection.

One of my favorite examples of the agony and ecstasy of complexity is FMA (fault management architecture - see Mike Shapiro's blog). FMA is hard.1 I've looked at the blog entry by past Sun luminary Andy Rudoff, in which he provides a summary of the concepts of FMA.  The concepts are so pure and beautiful. And simple!

But it all starts with fault trees. One must explain every fault that could occur, and every symptom (error) it might produce. Then in the middle there are all the timing issues - how long will related errors take to show up? And will they show up, because after all the paths for communication are not perfect? And at the end is the problem of how to isolate the fault. Who are all the constituents who will be affected, and how do we guarantee there is no race condition between multiple actors?

Right now we take an all-or-nothing approach to a vertical segment of faults -- all cpu faults, for example. Essentially we diagnose a complete subtree of faults, but ignore faults caused by a component closer to the root of the fault tree (maybe the fault is really a power supply problem that affects all components in the box). This is how we make the problem tractable. But the complete subtree approach gets particularly hard for I/O, where components can have a great deal of interaction, not necessarily all in a nice hierarchy, and errors are reflected in all directions. Cindi McGuire has lead a herculean effort to wrestle down I/O fault management into something containable and expressible, but getting every device to participate in FMA has stumped this effort. The job is not complete. The ability to diagnose with complete accuracy remains an unrealized vision.

So I wonder - would it be so bad to just sprinkle some FMA around? For example, if we have evidence that a particular subset of faults is most common or catastrophic, can we can provide just sufficient error reporting and diagnosis to narrow the fault landscape to find the instigators of those most egregious faults? Maybe we allow drivers to be enhanced to some minimal level of error reporting to enable just that type of diagnosis, for example. This might make it easier for driver writers inside and outside of Sun to inch their way along the path of enabling the full FMA vision.

1 let's go shopping


Archives
Links
Referrers