Today's Page Hits: 42
This page validates as XHTML 1.0, and will look much better in a browser that supports web standards, but it is accessible to any browser or Internet device. It was created using techniques detailed at glish.com/css/.
The ORB is logging connectFailure methods, producing a lot of output that looks something like:18.04.2007 12:26:37 com.sun.corba.ee.impl.transport.SocketOrChannelConnectionImpl <init>
WARNING: "IOP00410201: (COMM_FAILURE) Connection failure: socketType: IIOP_CLEAR_TEXT; hostname: localhost; port: 3700"
org.omg.CORBA.COMM_FAILURE: vmcid: SUN minor code: 201 completed: No
at com.sun.corba.ee.impl.logging.ORBUtilSystemException.connectFailure(ORBUtilSystemException.java:2348)
at com.sun.corba.ee.impl.logging.ORBUtilSystemException.connectFailure(ORBUtilSystemException.java:2369)
at com.sun.corba.ee.impl.transport.SocketOrChannelConnectionImpl.<init>(SocketOrChannelConnectionImpl.java:212)
at com.sun.corba.ee.impl.transport.SocketOrChannelConnectionImpl.<init>(SocketOrChannelConnectionImpl.java:225)
at com.sun.corba.ee.impl.transport.SocketOrChannelContactInfoImpl.createConnection(SocketOrChannelContactInfoImpl.java:104)
We've known for some time that there were some issues with ORB retries. Recently someone filed a high priority bug (6546045, which is a duplicate of 6394769) complaining about too many retries when sending a request to a bad host and port. I wrote a simple test to reproduce this, and found that the ORB was re-trying 56000 times per minute on my test machine (JDK 1.5_10, Ubuntu 6.06, Athlon 64 3500+ with 2 GB RAM).
The problem is that the ORB needs to retry the entire list of endpoints in a cluster, and wrap from the end of the list to the beginning. The old code intended to use a simple exponential backoff in this case, but it ONLY applied the wait with backoff when the error was either a TRANSIENT or a COMM_FAILURE connectionRebind (both COMPLETED_NO of course). Any other COMM_FAILURE COMPLETED_NO was treated as a "try the next endpoint" kind of retry. Now, this is good in the cluster case, because we want to try the next endpoint quickly. But we also don't want to try all of the endpoints over and over again if they all fail (or, the same single endpoint over and over again, which is what happens in this bug).
The fix is to introduce another use of TcpTimeouts here, and to record the failed endpoints in a list, until we find that the next endpoint to try is already in the list. In that case, we clear the list, sleep for the next timeout, and then start retrying again. That way, we try every endpoint once quickly, and then sleep until we try again, or timeout. This reduces the number of retries by a factor of at least 1000 with ony a minor impact on responsiveness in the (usually) rare case of a transient failure.
There are a number of ORB config parameters in com.sun.corba.ee.impl.orbutil.ORBConstants (note that this is the GlassFish ORB, not the JDK ORB: see this blog entry). The constants relevant to transport timeouts are:
Both TcpTimeouts use the same syntax for configuration:
initial:max:backoff[:maxsingle] (a series of 3 or 4 positive decimal integers separated by :)
where:
This works as follows:
I've removed the old COMMUNICATIONS_RETRY_TIMEOUT. These changes will be available in GlassFish v2 after the next ORB integration. I think this will be b47, as we've probably missed the integration deadline for b46.
Note that these settings are global: once set, the control every request sent to every object reference managed by the ORB instance. Fine grained control over timeouts is possible, but would require new APIs for RMI-IIOP.
I'm trying to use these timeout properties with Glassfish V2ur2-b04 clients.
I shutdown the Glassfish instance intentionally and, surprisingly, the new InitialContext() client statement does not respect the timeout properties, neither does it rise a NamingException.
Further, I try to lookup some EJB, and no mater what values I have set up in the timeout properties, the lookup takes 1min to complete rising an exception.
Here is my properties file:
java.naming.factory.initial=com.sun.enterprise.naming.SerialInitContextFactory
java.naming.factory.url.pkgs=com.sun.enterprise.naming
java.naming.factory.state=com.sun.corba.ee.impl.presentation.rmi.JNDIStateFactoryImpl
com.sun.corba.ee.transport.ORBWaitForResponseTimeout=5
com.sun.corba.ee.transport.ORBTCPConnectTimeouts=100:500:100:500
com.sun.corba.ee.transport.ORBTCPTimeouts=500:2000:50:1000
org.omg.CORBA.ORBInitialHost=localhost
org.omg.CORBA.ORBInitialPort=3700
Is there something I'm missing ?
Posted by Renato on November 25, 2008 at 07:05 AM PST #