Trond Norbye's Weblog

« Support for Monotone... | Main | Trailing whitespace »

http://blogs.sun.com/trond/date/20090911 Friday September 11, 2009

Don't log, dump core!

A common trend as a profession grows and diversifies is the loss of the good, old craftsmanship; software development is no exception. It seems to me developers who use a debugger are a dying breed, and many who do more than "rm" on a corefile are really hard to find. So what's wrong with logging?? Well, I'll suggest that you start off by reading Don't Log, Debug!

I think Tor has some really good points in his blog post. When you write the program you don't know where the bug is, so you will most likely not include enough information to track down the bug anyway. You will most likely have to provide the customer with an instrumented version if you cannot reproduce the problem locally.

I have heard people trying to excuse themselves by saying: "I can't use a debugger to find this issue because it is a timing issue". Well, I don't buy that, because if they have to add more logging from their code the timing will change as well (and possibly mask out the error).

Tor works in "the Java world", whereas I spend my time developing C++/C programs. We have an option as well: coredumps. When you load the corefile into your debugger you can inspect every variable in your application at the time you generated the coredump, and you can look at the callstacks from all of the threads in your program. Personally I find it much more fun to use the debugger to inspect the corefile instead of reading through miles of logfiles...

With this in mind, rethink the excuse with timing issues. Wouldn't it be better to just dump core when you encounter the problem and load the corefile into your favorite debugger :-) .

You may think that dumping core is brutal to your users, because not all failures are fatal errors. You may be able to recover gracefully from some of the errors, but even if you don't know what leads to the error I still generate a coredump so I can dig into the problem. To avoid shutting down the service, I'll just fork off a copy of the program to generate a dump from:

#define recoverable_assert(ev) do_recoverable_assert(ev, #ev, __FILE__, __LINE__)

int do_recoverable_assert(int eval, const char *expression, const char *file, int lineno)
{
   if (eval == 0) {
      if (fork() == 0) {
         fprintf(stderr, "%s:%u: %s\n", file, line, expression);
         abort();
      }
      return 1;
   }

   return 0;
}

... cut ...

if (recoverable_assert((address % 8) == 0) {
   /* the address for the client buffer isn't aligned, start recover */
}

Unless you have modified your environment, you should get coredumps when your program performs an illegal operation. Unfortunately some engineers/managers think it's inappropriate for the user to get a coredump from a program, so they add logic into their program to trap such signals and exit cleanly. Personally I don't think that this is a good idea because the engineers lose valuable information when a problem occurs at a customer's site. With the corefile available you may investigate on a problem you fail to reproduce locally (and if the customer don't want to release the corefile due to security reasons, it is still possible to debug on-site). If the user doesn't want the coredump, they should turn this feature off in his shell/startup script before starting the program.

Comments:

The link to Tor's blog seems not to work. It seems as though it should point to http://blogs.sun.com/tor/entry/code_advice_1_don_t

The recoverable_assert() idea is quite interesting. I do have a couple ideas for improvement of the implementation:

- Either use a double fork or waitid to avoid zombie/defunct processes. Double fork does run the risk of now requiring 3x the swap reservation - potentially resulting in new failure modes.
- Log the current working directory just before abort() so the person knows where to find the core file.
- Consider renaming the core file after it dumps, presumably including a timestamp in the name to prevent successive cores from overwriting the first one. This is only needed if coreadm hasn't been used to set the core pattern.
- Consider an initialization routine (called very early in the program) that can be used to specify the directory where core files should be stored. That is:

char *coredir = NULL;
recoverable_assert_init() {
char *p;
if ((p=getenv("COREDIR") != NULL) {
coredir = strdup(p);
}
}

Then after the fork():

if (coredir != NULL) {
chdir(coredir);
}

This is done early to avoid problems like:

http://bugs.opensolaris.org/view_bug.do?bug_id=6750659
http://opensolaris.org/jive/thread.jspa?threadID=75146

Posted by Mike Gerdts on September 11, 2009 at 01:50 PM CEST #

Thanks for the comments (I fixed the link :-) )

Yeah the recoverable_assert example was just to throw out the idea on how you could fix such a problem. Another thing you could do is to create a second process to monitor the directory for corefiles, and whenever a new one turns up, you can fire up lets say mdb on the corefile, get the callstack and possibly run findleaks or print variables and send the result in an email to you :-)

Posted by Trond Norbye on September 12, 2009 at 01:24 AM CEST #

Post a Comment:
  • HTML Syntax: NOT allowed

Valid HTML! Valid CSS!

This is a personal weblog, I do not speak for my employer.