Stephan Grell's Weblog
Stephan Grell's Weblog

20050714 Donnerstag Juli 14, 2005

N1GE 6 - health monitoring

A software such as our Grid Engine can a critical component in a production environment. Its perfect functioning has the highest priority. However there are cases in which the grid goes down or one of its components is not available. When this happens the administrator or the software has to react right a way. N1GE 6 provides two ways to monitor the correct functioning of its components:

- the heartbeat file at: <CELL>/common/heartbeat
- qping.

Qping was enhanced quite a bit with the different update releases. The u4 update contains a fully functional version and that is the version I reference in this blog.

1) Heartbeat file:
The heartbeat file is a simple number that gets increased in a fixed interval. If that number does not change for a couple minutes, that qmaster will most likely stopped its execution.

2) qping:
Qping gives a more comprehensive way of monitoring the grid. It can be used to monitor the qmaster and the execd deamon. Depending on the parameter it is invoked with, one gets a heartbeat replacement or profound information about the status of the daemon. I will give a short introduction into qping for more information consult the qping(1) man page. The monitoring part of the qping command can be executed from every machine under every user.

Heartbeat file replacement:

Command:  qping <MASTER_HOST> $SGE_QMASTER_PORT qmaster 1
                       qping <EXECD_HOST> $SGE_EXECD_PORT execd 1

output:           07/14/2005 14:38:19 endpoint scrabe.workgroup/qmaster/1 at port 7171 is up since 194 seconds

The output format is:
<DATE> <TIME> endpoint <MASTER_HOST/qmaster/1> at port <PORT_NUMBER> is up since <SECONDS> seconds

Extensive health information:

Command: qping -f <MASTER_HOST> $SGE_QMASTER_PORT qmaster 1
                      qping <EXECD_HOST> $SGE_EXECD_PORT execd 1

output:
07/14/2005 14:38:10:
SIRM version:             0.1
SIRM message id:          2
start time:               07/14/2005 14:35:05 (1121344505)
run time [s]:             185
messages in read buffer:  0
messages in write buffer: 0
nr. of connected clients: 3
status:                   0
info:                     TET: R (4.71) | EDT: R (0.71) | SIGT: R (184.61) | MT(1): R (6.17) | MT(2): R (4.62) | OK

The important information, which we did not get in the other output, is a monitoring per thread and the number of messages in the read buffer. The per-thread information allows on to have a more fine grained monitoring and to detect dead locks in the master. The messages in the read buffer can be used as and identifier for an overloaded qmaster.  The qping in update 4 and 5 do only show one MT thread even though 2 are used. This will be changed, as one can see in the output above.

    The other functions of qping are belong into the debug and analysis domain and definetly worth playing with.

( Jul 14 2005, 03:31:55 PM CEST ) Permalink Kommentare [0]

Trackback URL: http://blogs.sun.com/sgrell/entry/n1ge_6_health_monitoring
Kommentare:

Senden Sie einen Kommentar:

Name:
E-Mail:
URL:

Ihr Kommentar:

HTML Syntax: Ausgeschaltet

Archive
Sprache
Links
Referenzierte URLs