Monday Oct 13, 2008
If we refer to the System Topology diagram in my previous blog,
we find that the internal disks of T5440 are connected to PCIe-0. Hence
it is not possible to remove the PCIe-0 from the Primary (or Control)
Domain. However it is possible to remove PCIe-1, PCIe-2 and PCIe-3 from
the Primary Domain and allocate them to IO Domains.
In
order to create a IO-domain using PCIe-1, it has to be removed from
Primary Domain. This would cause the Primary Domain to lose its primary
network interface if it has been using the On-board NICs. However if
there was a network card available on PCIe-0, then the primary network
for Primary Domain can be switched to the ports on the network card
before removing PCIe-1 from Primary Domain. If an additional network
card is not available, it should still be possible to remove PCIe-1
from Primary Domain and create a IO domain (let us call it Secondary
Domain) managing devices off PCIe-1. In such a case, the Primary Domain
would provide the boot-disk service to the Secondary Domain and the
Secondary Domain would provide the primary network service for the
Primary Domain. The Pseudo-steps below outlines how this can be done.
- In the Primary Domain
- set the number of VCPUs to 8 (this is just an example number of VCPUs)
- set the memory to 8GB (just an example size of memory)
- create a vdisk-server
- remove PCIe-1 from its control
- This would cause the Primary Domain to lose its network after reboot
- Reboot the Primary Domain and log back into the Primary Domain from Console
- To
cause VCPUs for Secondary Domain to be allocated from T1 (refer to the
Topology Diagaram), create a dummy domain with the rest of 56 VCPUs
from T0. Bind the dummy domain.
- Associate a vdiskserverdevice as the boot-device for Secondary Domain
- Create the Secondary Domain
- set the number of VCPUs to 8
- set the memory to 8GB
- add PCIe-1 to it
- add the vdiskserverdevice as the vdisk for this domain
- Bind, install-OS and boot the domain
- Create a vswitch-device on the Secondary Domain
- Reboot the Secondary Domain
- Create a vnet-device for the Primary Domain associated with the above vswitch-device
- Plumb
and configure the vnet device on the Primary Domain (assumingthe
On-Board network ports are connected to the primary network of the Data
Center) Now the Primary Domain should have the primary network
available.
- Remove the dummy domain and proceed with creating other domains.
With the above technique, when the Primary Domain is rebooted,
the Secondary Domain may seem to pause until the Primary Domain boots
back. Similarly when the Secondary Domain is rebooted, the Primary
Domain's primary network may appear to freeze until the Secondary
Domain comes back online. But that is far better than losing all the
domains and the applications running in those domains.
Monday Oct 13, 2008
The Sun Fire T5440 can have at most 4 UltraSPARC T2
processor. Each UltraSPARC-T2 Procesor is directly connected to ¼th of
the entire system memory with 1Gigabyte memory interleaving and owns a
PCIe Root-Complex. When fully populated with Processor and memory,
Solaris can see 256 CPUs and 512GB of memory. That is a lot for many
applications except for some large databases. With this class of
system, it is not usually possible to consume the entire system with a
singe instance of most applications. But that is in fact a very good
opportunity to consolidate a bunch of such applications in this system
using LDOMs, there-by reducing Power consumption and rack space. An
example is the SugarCRM application. It is a web based application
written using PHP and has a MySQL database backend. Yun Chew has
written a nice blog
demonstrating how to consolidate SugarCRM application on this system
using LDOMs. I can think of many such applications that can be
consolidated on this and T5140 and T5240 based systems.
The
work done by Yun referred to above, there was no need to create any IO
domains, but because T5440 has 4 PCIe Root-Complex, it is possible to
create up to 4 IO domains for applications sensitive to IO performance.
Such applications, like database can be run in the IO domain so that
the application can have direct access to the physical disks. The other
domains - like application server domains can access the database over
virtual NIC. Each of the application server domains can have another
virtual NIC to communicate with the external world.
The good
thing about LDOMs based virtualization is that, even if the Primary
Domain goes down, other domains continue to be functional. Many other
virtualization technology does not have this advantage, which is why
Live Migration is very critical for such virtualization technology.
To
get the best performance out of a LDOMs based application deployment,
it is important to understand the system topology a bit so that it
becomes easier to determine what to place where. I have tried to create
a sketch of the system topology below for reference.

When
creating domains, IO and CPU requirement for the applications that
would run in the virtualized environment should be estimated. The
IO-performance of virtualized 1Gig network and virtualized disk is same
as native. But compared to native-IO, virtualized-IO consumes more CPU
cycles, often in the range of 5%-25%, depending on the size and
frequency of the IO. Hence, when doing resource planning for LDOMs
environment, couple of points should be considered to get the best
performance from the T5440 LDOMs environment.
- Is the application CPU intensive?
- Does it scale up with additional CPUs?
- Is the application Disk or Network IO intensive?
- Moderately IO intensive applications would consume less than 50% of maximum IO capacity of the device
- Is the application both CPU and IO intensive?
- How many interrupt sources the domain would need to manage?
- PCIe based Fiber Channel HBAs normally have 2 interrupt source.
- PCIe based 1G network devices have either 1 or 2 interrupt sources, while 10G network devices have 8 interrupt sources
- Each virtualized IO device created out of vsw or vds have 1 interrupt sources
The
number of VCPUs that need to be allocated to a Domain depends largely
on the ability of the application to make good use of the VCPUs. In
addition to the VCPUs needed by the application, extra VCPUs should be
allocated to handle interrupts. For optimal performance, when VCPUs
are allocated to a domain, then they should be allocated in multiples
of 4 at least, preferably in multiples of 8 where possible.
In the next section I will describe how to create IO domain with Inter-IO Domain Dependency
Monday Oct 13, 2008
With the introduction of Chip Multi-Threading (CMT) in the SPARC
Processor Family, a new sun4v based architecture was also introduced.

This
sun4v interface allows the Operating System to communicate with the
hardware via a layer called the Hypervisor. The Hypervisor provides a
Hardware Abstraction to the Operating System. The Hypervisor itself is
not an Operating System and is delivered with the platform bundled with
the Firmware. Now it should be possible to carve out different groups
of actual Hardware components and present it to the Operating System.

This
combination of the Hypervisor and sun4v based Operating System are the
key enablers for LDOMs. LDOMs is supported on all UltraSPARC T1 and
UltraSPARC T2 based system. There are some nice documents
about LDOMs including discussion forums that you can join or post your questions.
LDOMs Concept
A
UltraSPARC T1 processor is equipped with up to 8 cores, with 4 Hardware
Threads (Strands) per core. Each Hardware Thread is seen as a CPU by
the Operating System. A UltraSPARC T2 Processor is also equipped with
up to 8 cores per chip with 8 Hardware Threads per core.
When
creating domains, CPUs are allocated to a domain. A CPU allocated to
one domain cannot be shared with another domain. Similarly when memory
is allocated to a domain, the same memory cannot be allocated to
another domain. Hence CPU and memory are partitioned across domains.
However, the IO devices like network cards or disks can be shared.
When sharing disks, a single slice of a disk cannot be shared with
multiple domains, however different slices of a disk can be allocated
to different domains. It is also possible to create large files on a
mounted filesystem and make a file available to a domain as disk.
UltraSPARC
T1 based T2000 have 2 PCI-e Root-Complex, UltraSPARC T2 based T5120
and T5220 have 1 PCI-e Root-Complex along with 2xOn-Chip 10Gigabit
Ethernet, UltraSPARC T2 Plus based T5140 and T5240 also have 2 PCI-e
Root-Complex, and T5440 has 4 PCI-e Root Complex. It is possible to
allocate a Root-Complex to a Guest Domain so that the Guest has direct
access to the devices connecting to the Root-Complex.
LDOMs Components
- Primary
Domain - This is default or the first domain that is available with a
new system. Initially all system resources remain allocated to this
domain. This is the only domain that can be used to configure other
domains. This Domain is sometimes referred as Control Domain.
- Service
Domain - A domain that provides disk and network services to other
domains. For example, if a Guest Domain makes a Disk Image stored in
its filesystem available for booting another domain, then it can be
called a Service Domain
- IO Domain - A domain that owns
physical IO devices. When such domain shares its devices with another
domain , it can also be terms as Service Domain
- Guest Domain - A domain that depends on any of the above three domains for its IO services.
- Virtual Disk Client (vdc) - A device driver component active in Guest Domain to provide disk view to the domain
- Virtual Disk Server (vds)
- A device driver component active in Service Domain, that is
responsible for the physical IO after receiving requests from the vdc.
- Virtual Network Client (vnet) - Similar to vdc above, but provide Virtual NIC service to the Guest
- Virtual Network Switch (vsw) - A switch implementation that communicates with vnet on one side and and with the NIC device-driver on the other side.
- Virtual Console Concentrator (vcc) - Provide Console access to a Guest Domain
- MAU - These are the On-Chip Cryptographic Co-Processors. There is 1 MAU per core.
Steps for Creating a Domain
- Some CPU and Memory resources from the Primary Domain must be removed so that it can be allocated to other domains
- A vcc instance need to be created in the Primary Domain
- A vsw and vds (Virtual Disk Server Device) instance need to be created
- At this time a Guest Domain can be created
- It should be assigned a Console Port (vcc)
- Its vdc should be associated with a Virtual Disk Service
- Its vnet should be associated with a vsw
Tony Shoumack wrote a nice blueprint to provide detailed help with domain creation using LDOMs.
The
per core FPU of UltraSPARC T2 and UltraSPARC T2 Plus are just
functional units of the core. When a Domain need to execute Floating
Point instruction, the core associated with the Domain takes care of it.
If
the Domain need to accelerate Cryptographic Operations by offloading it
to the On-Chip Cryptographic Co-Processor, then, MAUs need to be
assigned to the domain.
In the next section, I will cover how to allocate devices and CPU to get the best performance.
Monday Oct 13, 2008
In my previous blog
I discussed share-management of the resources. Over the years, the
system resources that get managed evolved, requiring the
share-management to get complex. Not long ago, VT100 type dumb
terminals connected to serial line concentrators was a popular
technique to share a server system among end users. With demand for
Desktop Graphics, Sun invented the SunRay
technology allowing thousands of Graphics Desktops to be concentrated
on few servers without having Graphics Cards. The basic
share-management software in these technologies are similar to what a
Multi-User Operating System provides.
With increasing demand for Name-Space and Configuration isolation, Sun created Zones
(a.k.a Solaris Containers), an Operating System level lightweight
Virtualization technology. Each Zone represents a whole system with
its own Name-Space and Configuration that can be different from another
Zone. Zones share the same kernel on given system. But a special type
of Zone, called Branded Zones
allows running Solaris 8 and Solaris 9 Operating Systems instances to
be run on Solaris 10. Branded Zones created on Solaris 10 x86 Operating
Environment can also run 32bit Linux OS. CPU and memory resources can
be shared or dedicated. A new type of scheduler called Fair Share Scheduler helps maintain balance of CPU usage among the Zones.
From
the above, it is evident that at least some system resource must get
shared with an active share management in place for a setup to be
termed as Virtualized. The resources are
- CPU - can be dedicated or shared among the Domains.
- Memory - is normally not shared, but in case it gets shared among Domains, it can lead to performance penalties
- IO - can also be shared or dedicated at a leaf level or an entire IO subsystem can be dedicated to a Domain.
These
new sharing requirements, introduces a concept of an arbitrator, which
owns all the resources of the system and allows access to these
resources. This arbitrator is called the Hypervisor. Traditionally a
CPU executes instructions either in user-mode or in super-user-mode.
But with multiple Domains accessing the the same CPU a new mode need to
be introduced - Hyper-Privileged mode. This mode is assigned to the
Hypervisor. The location and exact role played by the Hypervisor in
allocating, dedicating or sharing resources among Domains
differentiates one Virtualization technology from another. Some
Hypervisors are extensions to existing kernels while other Hypervisors
are part of the System Firmware.
When a IO device is shared by multiple domains, a Proxy
mechanism is normally used. The Proxy performs the task of actual IO on
behalf of the Guest Domain. The Guest communicates with the Proxy over channels.
The channels are allocated and maintained by the Hypervisor. The actual
functionality provided by the channel is dependent on the
Virtualization technology used. The Hypervisor is often also
responsible for managing the IO space between the Guest Domains and the
Proxy. It sometimes perfom the task of copying the data from one IO
space to another, or grant access to a piece of memory belonging to a
Domain or Proxy to another Domain or Proxy so that it can relive itself
from doing the actual copy. This copy can sometime pose as extra
overhead and often is the source of reduced Virtualized IO performance
when compared to Native IO performance. New features in the PCI-Express
subsystem allow a Guest Domain to directly do IO with the physical
device. This advancement in PCI-Express subsystem has led the
Virtualization Technology providers to come up with two new solutions viz. Direct-IO and IOV. I will go into the details of these later.
It
is apparent from above, that the Guest Operating System needs to be
modified to some extent to allow it to communicate with the Proxy. When
the Guest Operating System needs modification or is made
Virtualization-aware, it is called Para-Virtualization.
But it also possible to emulate an entire computer system and present
it to the Guest Operating System. At minimum, if the IO susbsytem is
emulated, then it is possible to run a Guest Domain with un-modified
Native Operating System. This is often termed as Full Virtualization.
Because this technique involves lot of emulation, its performance often
lags that of Para-Virtualized domains. Performance acceleration
requires help from the hardware and is termed as Hardware Assisted Virtualization.
In this new Virtualization space, Sun offers two solutions - xVM Server for x86 Platform and LDOMs for the SPARC Platform.
In the next section, I will write about LDOMs.
Monday Oct 13, 2008
Over the past couple of years, I have been visiting customers ranging
from enterprise to small and mid size web-tier companies and any
customer in between except Telco customers in an attempt to understand
how Virtualization can help them. I find several use cases and some
confusion.
Why Deploy Virtualization?
The
term Virtualization span several technologies that the industry has
offered so far. At a very basic level, I would classify Virtualization
Technology as a way to share the resources of a single server system across multiple users. The users can be human beings and or application software. The resources are generally the components that make up a computer viz.
CPU, Memory, IO and Display/Input devices. Hence at a very basic level,
the Virtualization technology must actively participate in
share-management of these resources.
I should make a
distinction here between Partitioning a large system (for eg. Domains
of a Sun Fire 6900) into Domains, where none of the above resources
are shared (also sometime referred to as Hard Partitioning), and,
Server Virtualization (the term Domain is also used in this context)
where at least some or all of the resources are shared with active
share-management in place. The term Domain encompasses the above set of
resources with some settings viz. OS, patches, tunables, software - that can be different from another Domain and yet co-exist in the same Physical system.
While
Virtualization is just one way of doing Server Consolidation, the
availability of so many Virtualization Technologies with so many
options have created some mis-conceptions and confusions.
- Can
I really re-deploy my application on a new domain without affecting the
end-user experience. The short answer is no, unless it was also
possible to do so in a non-Virtualized environment.
- I can
create lot of domains in a single system creating a situation where CPU
and Memory can be over-subscribed. What about performance?
- Can I move my Guest from one Virtualization technology to another? The answer depends on several factors
- The
Instruction Set supported by the Source system and target system should
be same, unless the Virtualization technology on the target system can
emulate Source System Instruction Set.
- If the Guest Operating
system is un-aware that it is running in a virtualized environment,
then it should be possible to do such migrations.
- I can
setup or move domains around and bring up my application in no time.
Some times that is true; it will depend on how complex the IO
environment is.
- I can use the same Disk device or Boot Image to
boot all my domains. Sharing a Boot Image is not possible in
Virtualization. Sharing a Physical Disk is possible with some caveats.
Next,
I plan to write about some Virtualization technologies that are
available today. It should help clear some confusions and
mis-conceptions.
Tuesday Aug 22, 2006
We have just released "CoolThreads Optimized Open Source Software Stack" (Cool Stack)
This is a collection of some of the most commonly used open source applications optimized for the Solaris OS and UltraSPARC platform. Using these binaries can enhance system performance. Cool Stack is built with the Sun Studio 11 Compiler resulting in performance increases of anywhere between 25-200% depending on your application.
Please visit
http://cooltools.sunsource.net/coolstack/index.html
Tuesday Dec 06, 2005
Awesome Scalability of Sun Fire T2000
The UltraSPARC T1 processor with
CoolThreads technology is very unique in its kind, has 8 cores, 4 threads per-core and draws power equivalent to a household light bulb. With 32 simultaneous processing threads, it is a breakthrough technology for reducing data center energy consumption without compromising performance or reliability.
Being less power hungry, Sun Fire T2000, with a single UltraSPARC T1 processor,
and 4 built-in Gigabit ports, operating at line speed, is ideal for many
types of web and edge server applications. The Sun Fire T2000 system comes
loaded with Solaris (TM) 10 Operating System and J2SE 1.5 and is suitable
for deploying many applications including
- Simple Web server Application with lots of static pages
- Java Servlet/JSP based On line Auctioning Applications
- Java Servlet/JSP based On line Banking Applications
- Java Servlet/JSP based Ecommerce Applications
Any of the application can be taken as-is from current two-way deployment and can be easily deployed as a single instance application on the Sun Fire T2000 system. Unlike several myths from our Competition a Java based application does not need to be re-complied or re-architected for deployment on the new platform, the application class/jar files may well have been created on a x86 based system. Being involved in numerous Early Access customer evaluations, C/C++ based applications running on Lintel based systems get easily ported to SPARC (R)/Solaris (TM). Again unlike the myths, no application code change is necessary.
Do I need to change my Application to run o T2000?
Occasionally what we do see are scaling issues inside the Application
itself. These applications were designed for 1 to 2 processor systems
and in production there are bunch of these systems in a rack making a
web/application/server farm. We are now deploying the application on a
T2000 with the same load that was previously applied to the two-way system.
Looking at the system vital statistics (mpstat/vmstat/netstat) the customer and
myself felt that we can push more load to this single instance of the
web application. As we increase the load on the single instance of
the application, the hotspots in the application shows up in the
system statistics - like mpstat shows high Spin-Mutex-Count or
sudden increase in
lpw_mutex*() function calls. I am fortunate that
the Solaris (TM) Operating System is bundled with many tools that
can be used to identify hot-spots in customer application. One simple
mechanism for Java based Applications is sending a
kill -QUIT < pid_of_jmv >
to shows hot monitors used in the application. Others are
plockstat for
hot Solaris (TM) thread locks or the extremely powerful DTrace too to profile
an application.
While in the EA program, I have used a Sun Fire T2000 system to highlight hot
spots in customer application. Some customers decided to modify the code to get
the single instance scaling. This is where the "Changing the Application"
story was born and was generalized for "all applications running on the
new Chip from Sun"
The sophisticated threading architecture of
Sun Java System Web Server 6.1 is highly suitable for Sun Fire T2000. A single instance
of the WebServer has demonstrated superb scalability on this system. A 64bit version of the WebServer is scheduled to be released in couple of months. A single instance of this 64bit WebServer can manage Gigabytes of static documents, run a 64bit in-process JVM and manage in excess of 80,000 TCP connections without compromising security or response time.
For Java based applications the following techniques should be
investigated to improve performance
- If the application is using an older JVM, try to move the customer
application to the JDK bundled with the OS
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-b04)
This has many UltraSPARC T1 specific optimizations.
- If J2SE1.5.0_06 is not an option, try to move the customer application
to at least JDK 1.4.2
- If the cusomter application demonstrate pause in seconds due to Garbage
Collection, make sure that the customer is using these JVM flags
-XX:+UseParallelGC -XX:ParallelGCThreads=N
'N' can be as small as 2 and a max of 8
J2SE1.5.0_06 also offers -XX:-UseParallelOldGC to parallelize Old Generation
Garbage collection
- Applications configured to use the 64bit JVM and running with Java Heap size
larger than 4GB can use this flag in addition.
-XX:LargePageSizeInBytes=256m
There are several other flags, that can be used to scale a single instance
up on the Sun Fire T2000 system. A list of JMV Options are listed at this
URL
Java based applications deployed on Sun Fire T2000, will often perform much better or atleast
similar to the two-way system where they are currently running, without applying any further tunables
Sometimes after ensuring that a single instance can scale to its limit, we find
that the Sun Fire T2000 system is not even half utilized. At this point,
we use system consolidation techniques - like Run multiple instances of the application, as if consolidating multiple of those two-way systems into a single Sun Fire T2000 system. Depending on the type of the application, a single Sun Fire T2000 system can consolidate anywhere
from 3 to 8 two-way systems.
Solaris (TM) 10 provides rich facilities for providing isolation as well
as consolidation. If strong isolation is required, one can use a Zone for
each two-way system that we consolidate. We take the ip-address of the two-way system
and make it the ip-address of the Zone and make the application run in the Zone.
If isolation is not that important, we can still consolidate multiple of
those two-way system into a single Sun Fire T2000. Take this example :
How to Consolidate
An existing application farm is addressed by :
http://www.waterfalls.com
The farm has a load balancer that load balances incoming requests into
a rack of 16 two-way system with an ip-address in the range of
192.168.1.2 ..... 192.168.1.18
Each of these two-way system runs a Servlet/JSP application in a web container
As a first step, we pick the interface on Sun Fire T2000 that will connect to
the load balancer - let it be ipge2
we plumb the ipge2 interface and assign a ip-address to it - 192.168.1.1
we then create 16 logical interfaces and assign the ip-address of each
of the two-way system like this:
ifconfig ipge2:1 plumb
ifconfig ipge2:1 192.168.1.2 netmask + broadcast + up
...
...
ifconfig ipge2:16 plumb
ifconfig ipge2:16 192.168.1.17 netmask + broadcast + up
Install a SPARC (R) /Solaris (TM) version of the web container on the
Sun Fire T2000. Create 16 instances of the server, (like in case of Tomcat,
it is as simple as untarring the Tomcat package in 16 different directories,
tomcat1...tomcat16, or use the same binary location for 16 different instances).
Each web-application server can be bound to a specific ip-address, configure
each of the server to bind to one unique logical interface from the above
list, start the servers and we have just consolidated 16 two-way systems in a
single Sun Fire T2000 system.
[ Technorati: NiagaraCMT ]