Trond Norbye's Weblog

« Previous day (Sep 10, 2009) | Main | Next day (Sep 11, 2009) »

http://blogs.sun.com/trond/date/20090911 Friday September 11, 2009

Don't log, dump core!

A common trend as a profession grows and diversifies is the loss of the good, old craftsmanship; software development is no exception. It seems to me developers who use a debugger are a dying breed, and many who do more than "rm" on a corefile are really hard to find. So what's wrong with logging?? Well, I'll suggest that you start off by reading Don't Log, Debug!

I think Tor has some really good points in his blog post. When you write the program you don't know where the bug is, so you will most likely not include enough information to track down the bug anyway. You will most likely have to provide the customer with an instrumented version if you cannot reproduce the problem locally.

I have heard people trying to excuse themselves by saying: "I can't use a debugger to find this issue because it is a timing issue". Well, I don't buy that, because if they have to add more logging from their code the timing will change as well (and possibly mask out the error).

Tor works in "the Java world", whereas I spend my time developing C++/C programs. We have an option as well: coredumps. When you load the corefile into your debugger you can inspect every variable in your application at the time you generated the coredump, and you can look at the callstacks from all of the threads in your program. Personally I find it much more fun to use the debugger to inspect the corefile instead of reading through miles of logfiles...

With this in mind, rethink the excuse with timing issues. Wouldn't it be better to just dump core when you encounter the problem and load the corefile into your favorite debugger :-) .

You may think that dumping core is brutal to your users, because not all failures are fatal errors. You may be able to recover gracefully from some of the errors, but even if you don't know what leads to the error I still generate a coredump so I can dig into the problem. To avoid shutting down the service, I'll just fork off a copy of the program to generate a dump from:

#define recoverable_assert(ev) do_recoverable_assert(ev, #ev, __FILE__, __LINE__)

int do_recoverable_assert(int eval, const char *expression, const char *file, int lineno)
{
   if (eval == 0) {
      if (fork() == 0) {
         fprintf(stderr, "%s:%u: %s\n", file, line, expression);
         abort();
      }
      return 1;
   }

   return 0;
}

... cut ...

if (recoverable_assert((address % 8) == 0) {
   /* the address for the client buffer isn't aligned, start recover */
}

Unless you have modified your environment, you should get coredumps when your program performs an illegal operation. Unfortunately some engineers/managers think it's inappropriate for the user to get a coredump from a program, so they add logic into their program to trap such signals and exit cleanly. Personally I don't think that this is a good idea because the engineers lose valuable information when a problem occurs at a customer's site. With the corefile available you may investigate on a problem you fail to reproduce locally (and if the customer don't want to release the corefile due to security reasons, it is still possible to debug on-site). If the user doesn't want the coredump, they should turn this feature off in his shell/startup script before starting the program.


Valid HTML! Valid CSS!

This is a personal weblog, I do not speak for my employer.