Thursday June 23, 2005 By Greg Nakhimovsky
In May 2005, Morgan Herrington and I created a tool called AppCrash and published an article about it:
Enabling User-Controlled Collection of Application Crash Data With DTrace
Since it was published, we have come up with a few AppCrash updates and additions. Instead of updating the article itself, we decided it's easier and more flexible to put the updates into a blog.
Note that, since all components of AppCrash are scripts, anyone can easily modify or customize them.
Here are the updates we have so far.
In addition to the version described in the article, the following script "app_crash_global.d" is a special alternate version of the AppCrash DTrace script that catches a SIGSEGV or SIGBUS crash in any process running on the system. The "global" version of the script uses configuration file "/etc/app_crash_global" instead of the ON_APP_CRASH_INVOKE environment variable. If that configuration file exists and contains a full path to a script, the app_crash_global.d daemon will run that script when any process crashes. The "global" version can be used to catch a crash in any process including system processes such as the X-server, window manager, etc.
% cat app_crash_global.d
#!/usr/sbin/dtrace -qws
#pragma D option strsize=500
/*
* Copyright 2005 Sun Microsystems, Inc. ALL RIGHTS RESERVED
* Use of this software is authorized pursuant to the terms of the license found at
* http://developers.sun.com/berkeley_license.html
*
* By Greg Nakhimovsky and Morgan Herrington, Sun MDE, June 2005.
*
* This DTrace script can be run as a daemon started by root
* (perhaps on boot with an RC script like /etc/rc2.d/S97app_crash),
* or it can be run by any user with appropriate DTrace permissions
* (in which case it would work for that user's applications only).
*
* args[1]->pr_pid is the signal-receiving process id;
* Only react if it is the same as the sending process id (pid).
*
* This (global) version reacts to configuration file /etc/app_crash_global
* if it exists and contains a pointer to the user-provided script.
*/
proc:::signal-send
/(args[2] == SIGBUS || args[2] == SIGSEGV) &&
pid == args[1]->pr_pid/
{
stop();
/*
* If file /etc/app_crash_global exists (containing a path to
* the script), run the script as $USER
*/
system(
"%s=%d; %s=%d; %s=%d; %s=%s; %s %s %s %s %s %s %s %s %s",
"CRASH_PID", pid,
"CRASH_UID", uid,
"DTRACE_UID", $uid,
"PROG", execname,
"[ ! -r /etc/app_crash_global ] && exit 0;",
"SCRIPT=`/bin/cat /etc/app_crash_global`;",
"[ -z \"$SCRIPT\" -o ! -x \"$SCRIPT\" ] && exit 0;",
"if [ $DTRACE_UID -eq 0 -a $CRASH_UID -ne 0 ] ; then",
" USER_NAME=`/bin/getent passwd $CRASH_UID|/bin/cut -d: -f1`;",
" /bin/su $USER_NAME -c \"$SCRIPT $CRASH_PID $PROG\";",
"else ",
" $SCRIPT $CRASH_PID $PROG; ",
"fi"
);
/* Continue normal processing */
system("/bin/prun %d", pid);
}
%
It has turned out that Solaris JVM (Java Virtual Machine) uses the SIGSEGV and SIGBUS signals for its own purposes in some cases. Therefore, when the AppCrash daemon is running, it has been noticed to produce Java-related crash reports even when there hasn't been any crash.
When Solaris JVM really crashes (for example, when a native JNI C/C++ program creates a segment violation), JVM calls abort(3C) generating a SIGABRT signal. Therefore, we propose the following modification of the AppCrash DTrace script to make the Solaris JVM treatment special. As a base, we are using the "global" version of the AppCrash script described above.
The idea for this implementation is based on the fact that any Solaris binary containing an embedded JVM (such as "java*" programs or Application Server binary "appserv") has one of the JVM libraries {client,server}/libjvm[_g].so linked in. Therefore, the following script makes an additional check in the system() action: if the signal is not SIGABRT (6) (meaning it's either SIGSEGV or SIGBUS) and the pldd(1) output for the current process contains "libjvm", then don't perform the AppCrash action.
#!/usr/sbin/dtrace -qws
#pragma D option strsize=512
/*
* This is a "global" version, working for the entire system.
* Also, with a special treatment for processes with an embedded JVM.
*/
/*
* Copyright 2005 Sun Microsystems, Inc. ALL RIGHTS RESERVED
* Use of this software is authorized pursuant to the terms of the license found at
* http://developers.sun.com/berkeley_license.html
*
* By Greg Nakhimovsky and Morgan Herrington, Sun MDE, June 2005.
*
* This DTrace script can be run as a daemon started by root
* (perhaps on boot with an RC script like /etc/rc2.d/S97app_crash),
* or it can be run by any user with appropriate DTrace permissions
* (in which case it would work for that user's applications only).
*
* args[1]->pr_pid is the signal-receiving process id;
* Only react if it is the same as the sending process id (pid).
*
* This (global) version reacts to configuration file /etc/app_crash_global
* if it exists and contains a pointer to the user-provided script.
*/
proc:::signal-send
/(args[2] == SIGBUS || args[2] == SIGSEGV || args[2] == SIGABRT) &&
pid == args[1]->pr_pid/
{
stop();
/*
* If file /etc/app_crash_global exists (containing a path to
* the script), run the script as $USER.
*
* Also, do nothing if the signal is SIGSEGV or SIGBUS and
* the process has an embedded JVM (linked with libjvm.so).
*/
system(
"%s=%d; %s=%d; %s=%d; %s=%s; %s=%d; %s %s %s %s %s %s %s %s %s %s",
"CRASH_PID", pid,
"CRASH_UID", uid,
"DTRACE_UID", $uid,
"PROG", execname,
"SIG", args[2],
"[ $SIG -ne 6 ] && /bin/pldd $CRASH_PID|/bin/grep libjvm>/dev/null && exit 0;",
"[ ! -r /etc/app_crash_global ] && exit 0;",
"SCRIPT=`/bin/cat /etc/app_crash_global`;",
"[ -z \"$SCRIPT\" -o ! -x \"$SCRIPT\" ] && exit 0;",
"if [ $DTRACE_UID -eq 0 -a $CRASH_UID -ne 0 ] ; then",
" USER_NAME=`/bin/getent passwd $CRASH_UID|/bin/cut -d: -f1`;",
" /bin/su $USER_NAME -c \"$SCRIPT $CRASH_PID $PROG $SIG\";",
"else ",
" $SCRIPT $CRASH_PID $PROG $SIG; ",
"fi"
);
/* Continue normal processing */
system("/bin/prun %d", pid);
}
Note that we've added SIGABRT to the list of crashing signals. If any program calls abort(3C), that's a good indication that something has gone very wrong and a crash report is warranted.
Also note that the DTrace script above passes one more argument to shell script $SCRIPT that is invoked on crash: signal number $SIG. It can be useful to know exactly which signal generated the crash report. Therefore, it may also be a good idea to print the signal number in a modified runme_on_app_crash shell script as follows.
% cat runme_on_app_crash
#!/bin/sh
# Template for a user-defined script invoked on application crash.
#
# Copyright 2005 Sun Microsystems, Inc. ALL RIGHTS RESERVED
# Use of this software is authorized pursuant to the terms of the license found at
# http://developers.sun.com/berkeley_license.html
#
# By Greg Nakhimovsky and Morgan Herrington, Sun MDE, June 2005.
#
# $1 = process id of the crashing process.
# $2 = name of the crashing program.
PID=$1
PROG=$2
SIG=$3
APPCRASH_OUT=/var/tmp/appcrash.$PROG.$PID
# Unset ON_APP_CRASH_INVOKE to prevent recursion
ON_APP_CRASH_INVOKE=
export ON_APP_CRASH_INVOKE
# Function to print and run a given command:
print_run()
{
COMMAND="$@"
echo "\n> $COMMAND" >> $APPCRASH_OUT 2>&1
# "eval" is needed to recognize the pipe ("|") character:
eval $COMMAND >> $APPCRASH_OUT 2>&1
}
# Write a message to console if we have the permissions:
PROCESS_OWNER=`/usr/xpg4/bin/id -u -n`
CONSOLE_OWNER=`/bin/ls -lL /dev/console | /bin/awk '{ print $3; }'`
if [ $PROCESS_OWNER = "root" -o $PROCESS_OWNER = $CONSOLE_OWNER ] ; then
echo "`/bin/date`: $PROG (pid=$PID) has crashed, see $APPCRASH_OUT" > /dev/console 2>&1
fi
echo "Output from runme_on_app_crash" > $APPCRASH_OUT 2>&1
echo "Program: $PROG" >> $APPCRASH_OUT 2>&1
echo "Process ID: $PID" >> $APPCRASH_OUT 2>&1
echo "Received signal: $SIG" >> $APPCRASH_OUT 2>&1
echo "\nApplication Debugging Data" >> $APPCRASH_OUT 2>&1
echo "--------------------------" >> $APPCRASH_OUT 2>&1
print_run "/bin/pstack $PID"
print_run "/bin/pmap -x $PID"
print_run "/bin/pldd $PID"
print_run "/bin/ptree $PID"
print_run "/bin/pargs -ace $PID"
print_run "/bin/plimit -m $PID"
print_run "/bin/pwdx $PID"
print_run "/bin/pfiles $PID"
# You may want to add "pcred" if interested in user ids:
# print_run "/bin/pcred $PID"
echo "\nSystem Configuration Data" >> $APPCRASH_OUT 2>&1
echo "-------------------------" >> $APPCRASH_OUT 2>&1
print_run "/bin/uname -a"
print_run "/bin/cat /etc/release"
print_run "/usr/sbin/psrinfo -v"
print_run "/usr/sbin/swap -s"
print_run "/usr/sbin/swap -l"
print_run "/usr/sbin/prtconf|/bin/head -2"
print_run "/bin/showrev -p|/bin/cut -d' ' -f2|/bin/sort"
# For more system configuration commands (such as graphics
# configuration) see script "check_config" described in
# http://www.sun.com/technical-computing/ISV/PTCFaq.html#CHECK_CONFIG
# If desired, add application version/build information like this:
# if [ $PROG = "foo" ]; then
# VER=`cat /opt/foo/txt/version.txt`
# elif [ $PROG = "bar" ]; then
# # Use an appropriate method to get the version of program "bar"
# else
# VER="Unknown"
# fi
# print_run "echo Application Version/Build = $VER"
# Optionally, you can automatically email the result to your
# tech support organization, like this:
# /bin/cat $APPCRASH_OUT | /bin/mailx -s \
# "`/bin/uname -n`: $PROG crash: pid $PID" support@whatever.com
%
There are other ways to handle special cases like Java programs, but for now the one described above appears to work adequately.
In the article, we briefly mentioned how AppCrash can be started as a system-wide daemon. Here's a specific example how this can be done. All the following operations are assumed to be performed by root.
% cat S97app_crash #!/sbin/sh # /etc/rc2.d/S97app_crash hardlinked to /etc/init.d/app_crash # case "$1" in 'start') /opt/app_crash/app_crash.pl ;; 'stop') /usr/bin/pkill -9 -f app_crash_global.d ;; *) echo "Usage: $0 { start | stop }" exit 1 ;; esac exit 0 %
% cat app_crash.pl #!/bin/perl -w # As described in http://www.webreference.com/perl/tutorial/9/ # and referenced in # http://developers.sun.com/solaris/articles/app_crash/app_crash.html use integer; use strict; use POSIX qw(setsid); # Auto-flush STDOUT and STDERR: select(STDOUT); $| = 1; select(STDERR); $| = 1; # Daemonize the program: chdir '/' or die "Can't chdir to /: $!"; open STDIN, '/dev/null' or die "Can't read /dev/null: $!"; # Uncomment if you want to suppress STDOUT and STDERR: # open STDOUT, '>>/dev/null' or die "Can't write to /dev/null: $!"; # open STDERR, '>>/dev/null' or die "Can't write to /dev/null: $!"; defined(my $pid = fork) or die "Can't fork: $!"; exit if $pid; setsid or die "Can't start a new session: $!"; umask 0; # Start the daemon and exit the Perl script: system("/opt/app_crash/app_crash_global.d &"); %
% cat install_as_daemon #!/bin/sh -x # Run this as root (make sure to do "su -"): cp /opt/app_crash/S97app_crash /etc/rc2.d ln /etc/rc2.d/S97app_crash /etc/rc2.d/K97app_crash ln /etc/rc2.d/S97app_crash /etc/init.d/app_crash echo "/opt/app_crash/runme_on_app_crash" > /etc/app_crash_global /etc/init.d/app_crash stop /etc/init.d/app_crash start %
/etc/init.d/app_crash startFrom this point on, any crashing process will trigger an AppCrash action, that is the runme_on_app_crash script will be executed with the permissions of the crashing process' owner.