In this entry, we will touch the crux of the GC tuning for SIP applications.
GC Tuning Tips:
Apart from other regular CMS GC Tunings
for a server, The following tunings will make the CMS run in a more
predictable manner. Lets dive deep into some of these CMS tunings and
see its effect on predictability.
Young Generation Size:
The size of young generation can be
controlled through a flag '-Xmn'. For SIP workloads, keeping this
size as minimum as possible will yield better average response times
and greater probability of meeting your 95th percentile
response time requirements.
-Xmn
Parallel young generation:
Enable the parallel young generation
with the following options. Also, tune the parallel GC threads to
number of cpus or cores less one. The idea is to leave one core for
handling network interrupts etc.
-XX:+UseParNewGC & -XX:ParallelGCThreads=Number of Cores - 1
Maximum Object Tenure Threshold:
In the SIP world, the object life spans are depended on the timers associated with the transactions. Holding these objects longer in the young generation just causes overhead due to coping or checking for liveness. In such cases, we are better off with promoting to old generation. One needs to study the promotion rate after setting this attribute. If more objects are promoted even after having higher tenure threshold then set this value to zero.
-XX:MaxTenuringThreshold=N
Survivor Spaces:
Depending on your tenure threshold, set
the survivor spaces. If you decided to set the tenure threshold to
zero then set this space size to high value. The idea is to promote
objects to old generation directly.
-XX:SurvivorRatio=N
CMS Occupancy Fraction:
The CMS initiating Occupancy Fraction
plays a critical role in controlling things happen in old generation.
Having a lower bound on this value allows the CMS to run frequently
and keep the old generation in check all the time. The benefits
include better control over CPU spikes (an important factor for Telco
carriers with regards to overload protection) and some better handle
on CMS fragmentation failures. But this benefit comes with a price
i.e. more CPU utilization. Since CMS GC runs all the time uses more
CPU utilization than required for cleaning the objects in old
generation. Even though most of this phase happens in parallel to
application threads but some part of this phase happens by stopping
the application threads. Refer CMS GC documents for more details.
-XX:CMSInitiatingOccupancyFraction=N
Future Work:
Even though we have solved some of the
problems related to 'Predictability' and achieved the goal, we have
encountered certain observations including some GC pause spikes as
shown in the above graph in young generation needs to be studied
more. There are two areas identified on this for further study. The first area is the effect of
'UseMembar' JVM option on young generation. This option was suggested
to have some control over these spikes in young generation but our
meticulous SailFin QA stress team reported some regression in some of
their tests. The performance team observed some good data with this
option. The second area which requires some more attention is in the
use of the JVM option 'UseTLAB' which is for Thread Local Allocation
Buffer. For the current work we have enabled the ergonomics for this
option meaning we let the JVM to study the usage patterns and self
correct these sizes in the Eden region. In my opinion, this option's
ergonomics works pretty good. For the purpose of these spikes in
young generation, one can dig more into this option and study its
impact. Even though this is just one option in grand scheme of CMS
JVM options, there is more work behind the scenes taking place for
TLAB. If you are more interested refer to this great blog from Jon
Masamitsu from GC team on this topic here then you will realize the
amount of work went behind this option.
Also, There is another area which might
be interesting to study for SIP workloads which is with the use of
throughput collector with the response time goals.
Conclusion:
The requirements like response time
predictability and time budgeting can be achieved to a degree
required for Telco vendors with the existing JVM technologies. There
is a new trend happening in embracing 'Java' based infrastructure and
moving away from legacy 'C' based infrastructure in Telco arena and a
very good example for this is an Open Source based Project SailFin.
The benefits Telcos get are enormous including more leverage in
existing Java technologies like web services, EJB and JMS etc;
continuous innovation in and around Java based technologies; and a
big support from the community around. This embrace by Telcos at such
low level protocol implementations in 'Java' will spur new level of
requirements in Java and JVM technologies and this is good for
community. The benefit you and I get is a better and efficient way of
living with daily communication infrastructure with kith and kin
across the world.
In addition to the above, The Telco
aspects like performance, scalability and predictability of Project
SailFin are stunning. We, the performance team, think its a
continuous process and there will be always room for improvement. At
Sun Microsystems, we take these enterprise aspects as serious as
serious can be. Project SailFin has great knowledgeable community
around and if you intend to suggest or send us your observations, we
are always open and welcome the community contributions.
Acknowledgments:
I would like to thank sincerely Y.S.
Ramakrishna (Ramki) from GC team for his invaluable suggestions and
relentless efforts at times to make custom VM for our analysis and to
get more visibility in GC generations based on our requirements.
Also, I would like to thank Brian Doherty from JVM team for his
initial suggestions on CMS working internals and sharing his prior
work on CMS. Needless to say, Scott Oaks is our overall GlassFish
performance lead and I have captured some of his findings on CMS
occupancy fraction effect on CPU utilization. Thanks to Robert Handle
and other Ericsson members who helped us to look at the system from
Telco perspective. Finally, I would also like to thank Madhu Konda
(Manager) and Sreeram Duvur (SailFin Architect) for their continued
support and for asking more thought provoking questions in right
direction to keep us on track of solving problems.