As I mentioned
previously, the topic of message build-up in the MTA message queues prompted this writeup. Here's the second part of this installment, providing some details on what you can do if you discover that your deployment is having an actual problem with message build-up.
Before You Begin
At a minimum, you need to enable logging for the channels that you want
to troubleshoot. The amount of logging that you set (log level) depends on
your situation. To enable logging on a channel and learn about other options,
see Managing
MTA Message and Connection Logs in Sun
Java System Messaging Server 6.3 Administration Guide for
more information.
Troubleshooting TCP Channels
This section describes a general approach to troubleshoot build up of
messages in TCP message queues to determine if there is an actual problem.
When you suspect that there is something happening that is
more than the store-and-forward aspect of messages building up in a message
queue, begin by using the imsimta qm summarize command.
For more information, see imsimta qm in Sun Java
System Messaging Server 6.3 Administration Reference.
The imsimta qm summarize command can greatly
impact your system if you have a large backlog of messages. Instead of running
this command frequently, consider using the imsimta qm messages channel command instead.
This command lists the
destination hosts for which messages are queued in the specified channel.
This command also lists how many messages are waiting for their next schedule
retry (the delayed messages column) versus how many are ready to be retried
now (active now). When you use the imsimta qm messages command,
you must specify a channel name; a wildcard is not valid input. For example:
# imsimta qm
qm.maint> messsages tcp_local
host active messages delayed messages
example.com 0 2000
sesta.com 0 3000
In this example, 2,000 messages are waiting for their next scheduled
retry to be delivered to example.com, and 3,000 are waiting
for their next scheduled retry for delivery to sesta.com.
Note - The Messaging Server 6.3 release removed the imsimta
qm messages command. However, Messaging Server 6.3 does contain
a new, useful command—imsimta qm jobs—to
help understand why messages are not being delivered.
This information is also useful for the situation where messages have
failed and are waiting to be retried. If you see there are many messages in
a channel queue, but most of them are delayed, this probably indicates the
problem is with the remote domain. See Destination Host Problems for information on how to workaround this problem.
Note - A message can be in process of being tried by a job, or on a channel
waiting to be tried, or on a channel waiting to be retried. The messages command active messages column includes messages
which have not been tried yet and those which were previously delayed and
are now ready to be tried again. That is why you might see a zero (0) in this
column. In Messaging Server 6.3, you can see the messages
being retried with the new jobs command.
iPlanet Messaging Server 5. Use top -to channel or top -domain_to channel to analyze what is going on in that channel.
Look for trends on your system. For example, when most of
the mail is all destined for one remote domain, check the status of that remote
domain.
Additionally, look in the mail.log_current file
to determine what has happened in recent history when you tried to send mail
to that remote domain.
Use the imsimta qm dir -to address command to select a group of messages. Then use this information
to look at the delivery attempt history of some of the messages. (You use
the sequence numbers from the dir listing). Often, you
will find that these messages are all non-delivery notifications for spam,
which was not deliverable. If this is the case, determine how those original
spam messages got into the system in the first place. Verify that the messages
are spam by using the imsimta qm subcommands dir, read, and history. If this is indeed the case,
think about routing the non-delivery notifications through a different outbound
channel, thereby preventing them from choking the normal tcp_local channel
queue.
For example, use the notificationchannel and dispositionchannel keywords to specify an alternate process channel
to queue delivery status notifications and modify status notifications, respectively.
Then you use source-specific rewrite rules to direct messages from these process
channels to a particular tcp_* channel set to only use
a few processes/threads. For more information, see Source-Channel-Specific Rewrite
Rules ($M, $N) in Sun Java System Messaging
Server 6.3 Administration Guide.
Verify that the master process for the channel is started.
The tcp_* channels all use the smtp_client process.
To find out which process is associated with which channel, in respect to
dequeuing, see the master_command parameter in the associated
channel block in the job_controller.cnf file.
Destination Host Problems
When you have determined that messages are queued to an unavailable
remote host, you have two options:
Create a new channel for the host. If this host is consistently
a problem, all future email will go to this new channel. For existing messages
that are enqueued, you can either wait for the problem with the destination
host to be resolved or delete the messages from the queue.
Increase the number of delivery threads for the channel, or
set a ceiling on the number of queued messages that will trigger a new thread
or process to start. See the max_client_threads parameter
in the channel option file and the threaddepth channel
keyword, respectively.
Troubleshooting the ims-ms Channel
This section describes a general approach to troubleshoot build up of
messages in the ims-ms channel. The four general cases
where the ims-ms channel shows a build-up of messages in
the queue are:
IMAP_MAILBOX_LOCKED. While
you might see this error in a message file that is briefly in the queue area,
typically such a message file doesn't remain for long. The error only repeats
in a message file in the queue area if the mailbox is remaining locked for
an extended period. The job controller retries delivery of these messages
after short delays until either the message gets delivered or a different
error is encountered.
IMAP_MAILBOX_BADFORMAT, IMAP_MAILBOX_NOTSUPPORTED. The mailbox is most likely corrupted. This case rarely occurs.
You might want to use the reconstruct command for these
cases.
IMAP_IOERROR. The message
store is most likely corrupted or otherwise inaccessible. This case occurs
even more rarely.
IMAP_QUOTA_EXCEEDED. The
user or users are over quota. This is the most common case, which this technical
note discusses below.
Note - If the channel gets a permanent delivery failure error, then the
message is immediately bounced and does not remain in the ims-ms queue
area.
To troubleshoot the ims-ms channel, use the following
high-level approach:
Perform a similar investigation as you would for tcp_* channels
by using the imsimta qm summarize command to view what
is happening on the system.
Use the imsimta qm history command to examine
the message IDs to detect if there are different sorts of messages. For example,
you might see:
Message id: 800
Filename: /opt/SUNWmsgsr/data/queue/tcp_local/001/ZZf0b0KaNZykG.00
A message's file name starting with ZZ indicates
that it has not been tried yet. The message file name is a counter starting
at ZZ and decremented (ZY, ZX,
and so on) each time the message is tried, fails, and is reenqueued for later
retry. Thus, a ZZ* file name has not been tried yet, and
there is no history.
In general, but not always, when you have non-ZZ*,
non-.HELD files in the queue area, you have the IMAP_QUOTA_EXCEEDED
case. (The frequency with which you see IMAP_MAILBOX_LOCKED conditions probably
depends upon user and email client characteristics. This condition is more
common with users who like to receive and move around lots of large attachments
but it should typically occur rarely.)
For a site that enforces quota, probably most of the non-ZZ*,
non-.HELD messages in the ims-ms queue
area are there because of the recipient user being over quota. Verify this
is the case by running the imsimta qm command with the history subcommand. You should see “over quota" in the history
of the over-quota messages.
Note - In iPlanet Messaging Server 5.2, the imsimta qm top command
was enhanced to have more sorting options.
Are there any Q status messages in the mail.log_current file pertaining to the ims-ms channel? When
you see “mailbox is busy” and Q status in the mail.log_current file, then the message is put back on the queue to be retried
later as per the job controller's scheduling and the backoff keyword
on the channel.
If not, check that the ims_master process
is running. Are there any errors in its log file (the imta file)?
The ims_master process could be hung. Use the imsimta
process command to verify running processes.
Use the following strategies for users who become over quota:
Inform users of the need to perform mailbox maintenance to
return to under quota status, or increase their quota.
Reduce the time that mail is queued for over quota accounts
before being bounced back as over quota. See the store.quotagraceperiod configutil parameter. If you don't want to queue email for over
quota accounts (and bounce the message straight back), set this parameter
to 0 (that is, no grace period). This parameter is available in iPlanet Messaging
Server 5 as well.
For Messaging Server 6, you can enable the local.store.overquotastatus configutil parameter. This enables quota enforcement
before messages are enqueued in the MTA and prevents the MTA from filling
up.
More About the ims-master Process
At times you might see the ims-master process shutting
down and starting up in the log file:
# > imta /opt/SUNWmsgsr/logs
# grep "Sun Java" imta
[30/Aug/2006:17:05:05 -0400] learn ims_master[19736]: General Notice: Sun Java(t
m) System Messaging Server ims_master 6.2-7.02 (built Jun 13 2006) shutting down
[30/Aug/2006:17:05:20 -0400] learn ims_master[28310]: General Notice: Sun Java(t
m) System Messaging Server ims_master 6.2-7.02 (built Jun 13 2006) starting up
[30/Aug/2006:17:07:24 -0400] learn ims_master[28310]: General Notice: Sun Java(t
m) System Messaging Server ims_master 6.2-7.02 (built Jun 13 2006) shutting down
[30/Aug/2006:17:07:32 -0400] learn ims_master[28380]: General Notice: Sun Java(t
m) System Messaging Server ims_master 6.2-7.02 (built Jun 13 2006) starting up
[30/Aug/2006:17:19:31 -0400] learn ims_master[28380]: General Notice: Sun Java(t
m) System Messaging Server ims_master 6.2-7.02 (built Jun 13 2006) shutting down
This is normal operation and does note indicate a problem. This is a “notice”
message (not an “error” or “critical” level message).
As with all channel jobs, ims-ms channel jobs shut down
from time to time based on either having nothing to do, or based on “timing
out” (getting old). Then the job controller restarts new jobs as needed.
Configuration Issues
You might want to increase the number of processes the job controller
can start for the tcp_local, tcp_intranet,
or other tcp_* channels, or increase the number of threads
each of those processes will start. You might also want to give the tcp_local channel its own pool. If you observe queued messages (total across
all queues) to be greater than 100,000, increase the value of MAX_MESSAGES for the job_controller.cnf setting. See Job Controller
Configuration File in Sun Java System
Messaging Server 6.3 Administration Reference for more
information.
Additional Information
This section contains additional information to help you understand
MTA operations.
What Are .HELD Messages?
If the MTA detects that messages are bouncing between servers or channels,
delivery is halted and the messages are stored in a file with the suffix .HELD in the msg-srv-base/data/queue/channel directory. Typically, a message loop occurs because each server
or channel thinks the other is responsible for delivery of the message. You
need to manually fix these .HELD messages with the imsimta process
held command.
There is an unfortunate collision of terminology and concepts between .held messages and the hold channel. And worse
still, the command to process .held messages is called release, whereas the command to process messages on the hold channel
is called process_held.
You use the hold channel to hold messages of a recipient
temporarily prevented from receiving new messages. For example, you might
be moving a user's mailbox and want to hold new incoming messages. The hold channel is located in the msg-svr-base/queue/hold directory. Messages are written to this queue as ZZxxx.held files. Because
the job controller doesn't “see” these .held files,
they are not dequeued for delivery. You release these files with the imsimta
qm release command, and the reprocess daemon reprocesses them.
See To Temporarily Hold Messages Using the Hold Channel in Sun Java System Messaging Server 6.3 Administration Guide for
more information.
How Does a Message Become a .HELD Message?
Messaging Server makes use of MAX_*_RECEIVED_LINES options
that you set in theoption.dat file to determine when
a message is put into the .HELD state. The most relevant
options and their default values are:
Once a message has looped through the MTA enough to accumulate MAX_RECEIVED_LINES header lines indicating the local MTA, then the message becomes .HELD. You can cause the MTA to immediately recognize that it has
connected to itself, rather than waiting to accumulate MAX_LOCAL_RECEIVED_LINES local Received: headers, by specifying the loopcheck keyword
on the appropriate channel(s) in the imta.cnf file.
For More Information on Troubleshooting the MTA
Use the following to aid in troubleshooting the MTA: