« October 2009
SunMonTueWedThuFriSat
    
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
       
Today
XML

Blog::Navigation

Blog::Editing

Bookmarks::Blogroll

Blog::Referers

Today's Page Hits: 42

Site notes

This page validates as XHTML 1.0, and will look much better in a browser that supports web standards, but it is accessible to any browser or Internet device. It was created using techniques detailed at glish.com/css/.

Powered by Roller Weblogger.
All | Personal | Sun
« A Tale of Two ORBs | Main | Coming changes »
20070509 Wednesday May 09, 2007
Client Side Transport Timeouts and Retries A user recently complained about ORB retry behavior (among other issues).  The following problems are mentioned:
The ORB is logging connectFailure methods, producing a lot of output that looks something like:

18.04.2007 12:26:37 com.sun.corba.ee.impl.transport.SocketOrChannelConnectionImpl <init>
WARNING: "IOP00410201: (COMM_FAILURE) Connection failure: socketType: IIOP_CLEAR_TEXT; hostname: localhost; port: 3700"
org.omg.CORBA.COMM_FAILURE: vmcid: SUN minor code: 201 completed: No
at com.sun.corba.ee.impl.logging.ORBUtilSystemException.connectFailure(ORBUtilSystemException.java:2348)
at com.sun.corba.ee.impl.logging.ORBUtilSystemException.connectFailure(ORBUtilSystemException.java:2369)
at com.sun.corba.ee.impl.transport.SocketOrChannelConnectionImpl.<init>(SocketOrChannelConnectionImpl.java:212)
at com.sun.corba.ee.impl.transport.SocketOrChannelConnectionImpl.<init>(SocketOrChannelConnectionImpl.java:225)
at com.sun.corba.ee.impl.transport.SocketOrChannelContactInfoImpl.createConnection(SocketOrChannelContactInfoImpl.java:104)

We've known for some time that there were some issues with ORB retries. Recently someone filed a high priority bug (6546045, which is a duplicate of 6394769) complaining about too many retries when sending a request to a bad host and port.  I wrote a simple test to reproduce this, and found that the ORB was re-trying 56000 times per minute on my test machine (JDK 1.5_10, Ubuntu 6.06, Athlon 64 3500+ with 2 GB RAM).

The problem is that the ORB needs to retry the entire list of endpoints in a cluster, and wrap from the end of the list to the beginning. The old code intended to use a simple exponential backoff in this case, but it ONLY applied the wait with backoff when the error was either a TRANSIENT or a COMM_FAILURE connectionRebind (both COMPLETED_NO of course). Any other COMM_FAILURE COMPLETED_NO was treated as a "try the next endpoint" kind of retry. Now, this is good in the cluster case, because we want to try the next endpoint quickly. But we also don't want to try all of the endpoints over and over again if they all fail (or, the same single endpoint over and over again, which is what happens in this bug).

The fix is to introduce another use of TcpTimeouts here, and to record the failed endpoints in a list, until we find that the next endpoint to try is already in the list. In that case, we clear the list, sleep for the next timeout, and then start retrying again. That way, we try every endpoint once quickly, and then sleep until we try again, or timeout.  This reduces the number of retries by a factor of at least 1000 with ony a minor impact on responsiveness in the (usually) rare case of a transient failure.

There are a number of ORB config parameters in com.sun.corba.ee.impl.orbutil.ORBConstants (note that this is the GlassFish ORB, not the JDK ORB: see this blog entry).  The constants relevant to transport timeouts are:

Both TcpTimeouts use the same syntax for configuration:

initial:max:backoff[:maxsingle] (a series of 3 or 4 positive decimal integers separated by :)

where:

This works as follows:

  1. The first timeout last for initial milliseconds.
  2. Each subsequent timeout is obtained from the previous by multiplying by the backoff factor (as explained above)
  3. No timeout can exceed maxsingle milliseconds: once this value is reached, any subsequent timeouts have the same value.
  4. The total time spent before the last timeout is less than max.  Note that the last timeout may cause the total time to exceed max.

I've removed the old COMMUNICATIONS_RETRY_TIMEOUT. These changes will be available in GlassFish v2 after the next ORB integration.  I think this will be b47, as  we've probably missed the integration deadline for b46.

Note that these settings are global: once set, the control every request sent to every object reference managed by the ORB instance.  Fine grained control over timeouts is possible, but would require new APIs for RMI-IIOP.


It now looks like it will be b48 (not b47) when this change is integrated.  The ORB integration is taking longer than expected.



Trackback URL: http://blogs.sun.com/ejcorba/entry/client_side_transport_timeouts_and
Comments:

I'm trying to use these timeout properties with Glassfish V2ur2-b04 clients.
I shutdown the Glassfish instance intentionally and, surprisingly, the new InitialContext() client statement does not respect the timeout properties, neither does it rise a NamingException.
Further, I try to lookup some EJB, and no mater what values I have set up in the timeout properties, the lookup takes 1min to complete rising an exception.

Here is my properties file:
java.naming.factory.initial=com.sun.enterprise.naming.SerialInitContextFactory
java.naming.factory.url.pkgs=com.sun.enterprise.naming
java.naming.factory.state=com.sun.corba.ee.impl.presentation.rmi.JNDIStateFactoryImpl
com.sun.corba.ee.transport.ORBWaitForResponseTimeout=5
com.sun.corba.ee.transport.ORBTCPConnectTimeouts=100:500:100:500
com.sun.corba.ee.transport.ORBTCPTimeouts=500:2000:50:1000
org.omg.CORBA.ORBInitialHost=localhost
org.omg.CORBA.ORBInitialPort=3700

Is there something I'm missing ?

Posted by Renato on November 25, 2008 at 07:05 AM PST #

Post a Comment:

Name:
E-Mail:
URL:

Your Comment:

HTML Syntax: NOT allowed
Copyright (C) 2003, opuscula orbis