It usually works
Wednesday Jun 09, 2004
One of the other Sun engineers titled his blog "sometimes it works", there are some bugs that almost never occur, but when they do occur they are just as vexing. Such intermittent problems can be extremely difficult to track down. Here are two examples of bugs that appeared only about once every 10-30 reboots on a customized Java Desktop System 2.0. The first bug presents the GNOME user with login failure and a "your session has lasted less than 10 seconds" message, about once every 10-30 reboots. Here are some details:
Xlib: connection to ":0.0" refused by server Xlib: No protocol specified /usr/bin/X11/xsetroot: unable to open display ':0' ... (None of the clients can connect to the xserver) The logs in /var/lib/gdm show: AUDIT: Fri Jun 4 15:48:27 2004: 1631 X: client 3 rejected from local host AUDIT: Fri Jun 4 15:49:56 2004: 1631 X: client 2 rejected from local host ...Further investigation showed the root cause of this. The problem occurs in the following circumstance:
A janatorial problem caused the second intermittant bug. The underlying linux distribution (SLED) does not clear the /tmp directory between reboots. Some processes create sockets in /tmp/.ICE-unix with the name of the process id. Unfortunately, after a reboot it is possible (in fact likely) to have a new process with the same pid as a long dead process. Because /tmp/.ICE-unix hasn't been cleaned out, the unlucky process can't create a socket with the name of its PID because the name already exists. Again, the symptom is an intermittent login failure.










