Open desktop mechanic

Anybody can reboot

Thursday Sep 23, 2004

My uncle is a shade-tree auto mechanic. Whenever my dad had a problem with his 1969 AMC Rambler, Uncle Dell would recommend, "Jack up the radiator cap and drive another car under it." From what I've seen, that's often how problems in the Microsoft desktop world are solved. Reboot or reinstall the O.S. Fear not, whatever demons were possessing your desktop will be exorcised...for a while anyway. But you don't learn much that way.

Last week I was running SunRay Server 3 beta on a Java Desktop System 2.0. A couple of users had problems logging into their NIS home directories which mounted /home on a local NFS server. I logged into my NIS home directory and experienced the same problem. Nautilus and gconfd-2 weren't happy. df -k showed mounts for dozens of users under home, even though only three of them were being used, and it showed multiple mounts for some users. Earlier I had found that restarting autofs fixed a similar problem so:

/etc/init.d/autofs restart

Well, the extra automounts are gone, but my session is still messed up. I cleared out my .gnome*, .nautilus, .gconf, .gconfd preference files and found the problem only became worse. It would have been easy to just reboot. The system had been working fine for a couple of weeks before that.

Aha, it turns out I accidently wiped out the global configuration database in /etc/gconf while experimenting with an optimization script. I decided to kill the APOC configuration daemon before I reinstalled the gconf RPMS, but I did it in the most clumsy way and killed all JVM instances running on the machine, including those responsible for core SunRay services! The SunRay clients displayed a box indicating that they can't find the server. I should just reboot, but let me look at the SunRay manual. Hmmm, /opt/SUNWut/sbin/utrestart. Like magic my session reappeared along with the sessions anyone else who had been sharing that box.

Earlier versions of nautilus/gnome-vfs had a nasty habit of searching for trash and for a writable directory on any share where it could put trash. This was not nice on a SunRay server with hundreds of deep automounted NFS trees. But I thought I remembered that this problem was solved by Sun engineers and other GNOME community members. So my next suspect was autofs. I found a Sun engineer's whitepaper on some autofs deficiencies.. Further investigation showed these deficiencies to be unrelated to my immediate problem. Then I remembered that NFS home directories were being shared between Solaris 8/9 GNOME 2.0 and the newer GNOME in Java Desktop System 2.0. The file
 ~/.gnome/gnome-vfs/.trash_entry_cache
contained entries for nearly every user under /home. Apparently even the newer gnome-vfs reads this cache and stats everything it sees there. Autofs notices that someone is looking and mounts the shares. Sure enough, if I launch nautilus without gconfd-2 and with the trash cache in place, mtab immediately fills with extra junk. So now how do we solve the problem of forward and backward compatability of GNOME configuration files? I think this will take agreement from the entire GNOME community. As configuration moves from flat files into LDAP backends the problem may become irrelevant. In the meantime, I'm glad I didn't reboot.

10:32am  up 30 days, 13:58,  3 users,  load average: 0.13, 0.09, 0.02 
Yeah, this is a beta.
I once explained my reboot philosophy to my brother as:
  • Microsoft Windows: Reboot for minor configuration changes, even to change IP address or upgrade a library!
  • GNU/Linux: You should only reboot when installing new hardware.
  • Solaris: Why would you reboot just to install new hardware?
Apologies if Linux and Microsoft Windows have improved recently, but can you swap out a CPU without rebooting?

[5] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg
Comments:

GNOME stats everything it can? Maybe you guys should have gone with KDE ... just kidding. Hopefully they will fix this in future releases. On a server that I run X on, a dual-Xeon system, no other processes (including Samba and email)consume as much CPU time as the gnome stuff, even when no one is typing.

Posted by PatrickG on September 23, 2004 at 03:30 PM GMT+00:00 #

AFAIK, the problem is solved in current GNOME, leftover cache files are another story though. Yes some GNOME components are probably heavier than they need to be. Now that we have recent distribution running on Solaris with Dtrace available I hope we can provide some useful information to improve performance.

Posted by bnitz on September 23, 2004 at 03:41 PM GMT+00:00 #

Hmmm, you can hotswap CPU's in Solaris... WHat sort of software based EMC fixes do you use? Interesting concept.... often times we use embedded software to deal with ESD events. We can't divert all of them from causing soft problems, but we can deal with the effects via software. Whats all this stuff about math questions anyhow???

Posted by Ron on September 27, 2004 at 10:22 PM GMT+00:00 #

I don't know how other people solved this sort of problem, and to be honest I'm pretty new to Solaris 10, but Solaris 10's predictive self healing appears to do this.

"Reducing Hardware Failures -- A self-healing system automatically diagnoses problems, and the results can be used to trigger automated reactions such as dynamically taking a CPU, regions of memory, and I/O devices offline before these components can cause a system failure. Solaris Fault Manager isolates and disables faulty components, and helps ensure continuous service even before administrators know there is a problem."

The math questions are to keep spam bots out. Apparently some kiddies write perl scripts to populate blogs with porn links. 21st century graffiti without the slightest hint of art. Of course a script could be coded to answer the question, so maybe roller should use nonlinear math problems. Then the script kiddie would have to use something like Sun's ONE Studio's interval math to get the right answer!
[1,2] + [2,3] = ?

Posted by bnitz on September 28, 2004 at 09:45 AM GMT+00:00 #

Tobin Coziahr's blog has more some details on Solaris 10 predictive self-healing This part of it appears to simplify the management of interdependencies between services.

Posted by bnitz on September 28, 2004 at 02:11 PM GMT+00:00 #

Post a Comment:
Comments are closed for this entry.