Monday Oct 13, 2008
If we refer to the System Topology diagram in my previous blog,
we find that the internal disks of T5440 are connected to PCIe-0. Hence
it is not possible to remove the PCIe-0 from the Primary (or Control)
Domain. However it is possible to remove PCIe-1, PCIe-2 and PCIe-3 from
the Primary Domain and allocate them to IO Domains.
In
order to create a IO-domain using PCIe-1, it has to be removed from
Primary Domain. This would cause the Primary Domain to lose its primary
network interface if it has been using the On-board NICs. However if
there was a network card available on PCIe-0, then the primary network
for Primary Domain can be switched to the ports on the network card
before removing PCIe-1 from Primary Domain. If an additional network
card is not available, it should still be possible to remove PCIe-1
from Primary Domain and create a IO domain (let us call it Secondary
Domain) managing devices off PCIe-1. In such a case, the Primary Domain
would provide the boot-disk service to the Secondary Domain and the
Secondary Domain would provide the primary network service for the
Primary Domain. The Pseudo-steps below outlines how this can be done.
- In the Primary Domain
- set the number of VCPUs to 8 (this is just an example number of VCPUs)
- set the memory to 8GB (just an example size of memory)
- create a vdisk-server
- remove PCIe-1 from its control
- This would cause the Primary Domain to lose its network after reboot
- Reboot the Primary Domain and log back into the Primary Domain from Console
- To
cause VCPUs for Secondary Domain to be allocated from T1 (refer to the
Topology Diagaram), create a dummy domain with the rest of 56 VCPUs
from T0. Bind the dummy domain.
- Associate a vdiskserverdevice as the boot-device for Secondary Domain
- Create the Secondary Domain
- set the number of VCPUs to 8
- set the memory to 8GB
- add PCIe-1 to it
- add the vdiskserverdevice as the vdisk for this domain
- Bind, install-OS and boot the domain
- Create a vswitch-device on the Secondary Domain
- Reboot the Secondary Domain
- Create a vnet-device for the Primary Domain associated with the above vswitch-device
- Plumb
and configure the vnet device on the Primary Domain (assumingthe
On-Board network ports are connected to the primary network of the Data
Center) Now the Primary Domain should have the primary network
available.
- Remove the dummy domain and proceed with creating other domains.
With the above technique, when the Primary Domain is rebooted,
the Secondary Domain may seem to pause until the Primary Domain boots
back. Similarly when the Secondary Domain is rebooted, the Primary
Domain's primary network may appear to freeze until the Secondary
Domain comes back online. But that is far better than losing all the
domains and the applications running in those domains.
Monday Oct 13, 2008
The Sun Fire T5440 can have at most 4 UltraSPARC T2
processor. Each UltraSPARC-T2 Procesor is directly connected to ¼th of
the entire system memory with 1Gigabyte memory interleaving and owns a
PCIe Root-Complex. When fully populated with Processor and memory,
Solaris can see 256 CPUs and 512GB of memory. That is a lot for many
applications except for some large databases. With this class of
system, it is not usually possible to consume the entire system with a
singe instance of most applications. But that is in fact a very good
opportunity to consolidate a bunch of such applications in this system
using LDOMs, there-by reducing Power consumption and rack space. An
example is the SugarCRM application. It is a web based application
written using PHP and has a MySQL database backend. Yun Chew has
written a nice blog
demonstrating how to consolidate SugarCRM application on this system
using LDOMs. I can think of many such applications that can be
consolidated on this and T5140 and T5240 based systems.
The
work done by Yun referred to above, there was no need to create any IO
domains, but because T5440 has 4 PCIe Root-Complex, it is possible to
create up to 4 IO domains for applications sensitive to IO performance.
Such applications, like database can be run in the IO domain so that
the application can have direct access to the physical disks. The other
domains - like application server domains can access the database over
virtual NIC. Each of the application server domains can have another
virtual NIC to communicate with the external world.
The good
thing about LDOMs based virtualization is that, even if the Primary
Domain goes down, other domains continue to be functional. Many other
virtualization technology does not have this advantage, which is why
Live Migration is very critical for such virtualization technology.
To
get the best performance out of a LDOMs based application deployment,
it is important to understand the system topology a bit so that it
becomes easier to determine what to place where. I have tried to create
a sketch of the system topology below for reference.

When
creating domains, IO and CPU requirement for the applications that
would run in the virtualized environment should be estimated. The
IO-performance of virtualized 1Gig network and virtualized disk is same
as native. But compared to native-IO, virtualized-IO consumes more CPU
cycles, often in the range of 5%-25%, depending on the size and
frequency of the IO. Hence, when doing resource planning for LDOMs
environment, couple of points should be considered to get the best
performance from the T5440 LDOMs environment.
- Is the application CPU intensive?
- Does it scale up with additional CPUs?
- Is the application Disk or Network IO intensive?
- Moderately IO intensive applications would consume less than 50% of maximum IO capacity of the device
- Is the application both CPU and IO intensive?
- How many interrupt sources the domain would need to manage?
- PCIe based Fiber Channel HBAs normally have 2 interrupt source.
- PCIe based 1G network devices have either 1 or 2 interrupt sources, while 10G network devices have 8 interrupt sources
- Each virtualized IO device created out of vsw or vds have 1 interrupt sources
The
number of VCPUs that need to be allocated to a Domain depends largely
on the ability of the application to make good use of the VCPUs. In
addition to the VCPUs needed by the application, extra VCPUs should be
allocated to handle interrupts. For optimal performance, when VCPUs
are allocated to a domain, then they should be allocated in multiples
of 4 at least, preferably in multiples of 8 where possible.
In the next section I will describe how to create IO domain with Inter-IO Domain Dependency
Monday Oct 13, 2008
With the introduction of Chip Multi-Threading (CMT) in the SPARC
Processor Family, a new sun4v based architecture was also introduced.

This
sun4v interface allows the Operating System to communicate with the
hardware via a layer called the Hypervisor. The Hypervisor provides a
Hardware Abstraction to the Operating System. The Hypervisor itself is
not an Operating System and is delivered with the platform bundled with
the Firmware. Now it should be possible to carve out different groups
of actual Hardware components and present it to the Operating System.

This
combination of the Hypervisor and sun4v based Operating System are the
key enablers for LDOMs. LDOMs is supported on all UltraSPARC T1 and
UltraSPARC T2 based system. There are some nice documents
about LDOMs including discussion forums that you can join or post your questions.
LDOMs Concept
A
UltraSPARC T1 processor is equipped with up to 8 cores, with 4 Hardware
Threads (Strands) per core. Each Hardware Thread is seen as a CPU by
the Operating System. A UltraSPARC T2 Processor is also equipped with
up to 8 cores per chip with 8 Hardware Threads per core.
When
creating domains, CPUs are allocated to a domain. A CPU allocated to
one domain cannot be shared with another domain. Similarly when memory
is allocated to a domain, the same memory cannot be allocated to
another domain. Hence CPU and memory are partitioned across domains.
However, the IO devices like network cards or disks can be shared.
When sharing disks, a single slice of a disk cannot be shared with
multiple domains, however different slices of a disk can be allocated
to different domains. It is also possible to create large files on a
mounted filesystem and make a file available to a domain as disk.
UltraSPARC
T1 based T2000 have 2 PCI-e Root-Complex, UltraSPARC T2 based T5120
and T5220 have 1 PCI-e Root-Complex along with 2xOn-Chip 10Gigabit
Ethernet, UltraSPARC T2 Plus based T5140 and T5240 also have 2 PCI-e
Root-Complex, and T5440 has 4 PCI-e Root Complex. It is possible to
allocate a Root-Complex to a Guest Domain so that the Guest has direct
access to the devices connecting to the Root-Complex.
LDOMs Components
- Primary
Domain - This is default or the first domain that is available with a
new system. Initially all system resources remain allocated to this
domain. This is the only domain that can be used to configure other
domains. This Domain is sometimes referred as Control Domain.
- Service
Domain - A domain that provides disk and network services to other
domains. For example, if a Guest Domain makes a Disk Image stored in
its filesystem available for booting another domain, then it can be
called a Service Domain
- IO Domain - A domain that owns
physical IO devices. When such domain shares its devices with another
domain , it can also be terms as Service Domain
- Guest Domain - A domain that depends on any of the above three domains for its IO services.
- Virtual Disk Client (vdc) - A device driver component active in Guest Domain to provide disk view to the domain
- Virtual Disk Server (vds)
- A device driver component active in Service Domain, that is
responsible for the physical IO after receiving requests from the vdc.
- Virtual Network Client (vnet) - Similar to vdc above, but provide Virtual NIC service to the Guest
- Virtual Network Switch (vsw) - A switch implementation that communicates with vnet on one side and and with the NIC device-driver on the other side.
- Virtual Console Concentrator (vcc) - Provide Console access to a Guest Domain
- MAU - These are the On-Chip Cryptographic Co-Processors. There is 1 MAU per core.
Steps for Creating a Domain
- Some CPU and Memory resources from the Primary Domain must be removed so that it can be allocated to other domains
- A vcc instance need to be created in the Primary Domain
- A vsw and vds (Virtual Disk Server Device) instance need to be created
- At this time a Guest Domain can be created
- It should be assigned a Console Port (vcc)
- Its vdc should be associated with a Virtual Disk Service
- Its vnet should be associated with a vsw
Tony Shoumack wrote a nice blueprint to provide detailed help with domain creation using LDOMs.
The
per core FPU of UltraSPARC T2 and UltraSPARC T2 Plus are just
functional units of the core. When a Domain need to execute Floating
Point instruction, the core associated with the Domain takes care of it.
If
the Domain need to accelerate Cryptographic Operations by offloading it
to the On-Chip Cryptographic Co-Processor, then, MAUs need to be
assigned to the domain.
In the next section, I will cover how to allocate devices and CPU to get the best performance.
Monday Oct 13, 2008
In my previous blog
I discussed share-management of the resources. Over the years, the
system resources that get managed evolved, requiring the
share-management to get complex. Not long ago, VT100 type dumb
terminals connected to serial line concentrators was a popular
technique to share a server system among end users. With demand for
Desktop Graphics, Sun invented the SunRay
technology allowing thousands of Graphics Desktops to be concentrated
on few servers without having Graphics Cards. The basic
share-management software in these technologies are similar to what a
Multi-User Operating System provides.
With increasing demand for Name-Space and Configuration isolation, Sun created Zones
(a.k.a Solaris Containers), an Operating System level lightweight
Virtualization technology. Each Zone represents a whole system with
its own Name-Space and Configuration that can be different from another
Zone. Zones share the same kernel on given system. But a special type
of Zone, called Branded Zones
allows running Solaris 8 and Solaris 9 Operating Systems instances to
be run on Solaris 10. Branded Zones created on Solaris 10 x86 Operating
Environment can also run 32bit Linux OS. CPU and memory resources can
be shared or dedicated. A new type of scheduler called Fair Share Scheduler helps maintain balance of CPU usage among the Zones.
From
the above, it is evident that at least some system resource must get
shared with an active share management in place for a setup to be
termed as Virtualized. The resources are
- CPU - can be dedicated or shared among the Domains.
- Memory - is normally not shared, but in case it gets shared among Domains, it can lead to performance penalties
- IO - can also be shared or dedicated at a leaf level or an entire IO subsystem can be dedicated to a Domain.
These
new sharing requirements, introduces a concept of an arbitrator, which
owns all the resources of the system and allows access to these
resources. This arbitrator is called the Hypervisor. Traditionally a
CPU executes instructions either in user-mode or in super-user-mode.
But with multiple Domains accessing the the same CPU a new mode need to
be introduced - Hyper-Privileged mode. This mode is assigned to the
Hypervisor. The location and exact role played by the Hypervisor in
allocating, dedicating or sharing resources among Domains
differentiates one Virtualization technology from another. Some
Hypervisors are extensions to existing kernels while other Hypervisors
are part of the System Firmware.
When a IO device is shared by multiple domains, a Proxy
mechanism is normally used. The Proxy performs the task of actual IO on
behalf of the Guest Domain. The Guest communicates with the Proxy over channels.
The channels are allocated and maintained by the Hypervisor. The actual
functionality provided by the channel is dependent on the
Virtualization technology used. The Hypervisor is often also
responsible for managing the IO space between the Guest Domains and the
Proxy. It sometimes perfom the task of copying the data from one IO
space to another, or grant access to a piece of memory belonging to a
Domain or Proxy to another Domain or Proxy so that it can relive itself
from doing the actual copy. This copy can sometime pose as extra
overhead and often is the source of reduced Virtualized IO performance
when compared to Native IO performance. New features in the PCI-Express
subsystem allow a Guest Domain to directly do IO with the physical
device. This advancement in PCI-Express subsystem has led the
Virtualization Technology providers to come up with two new solutions viz. Direct-IO and IOV. I will go into the details of these later.
It
is apparent from above, that the Guest Operating System needs to be
modified to some extent to allow it to communicate with the Proxy. When
the Guest Operating System needs modification or is made
Virtualization-aware, it is called Para-Virtualization.
But it also possible to emulate an entire computer system and present
it to the Guest Operating System. At minimum, if the IO susbsytem is
emulated, then it is possible to run a Guest Domain with un-modified
Native Operating System. This is often termed as Full Virtualization.
Because this technique involves lot of emulation, its performance often
lags that of Para-Virtualized domains. Performance acceleration
requires help from the hardware and is termed as Hardware Assisted Virtualization.
In this new Virtualization space, Sun offers two solutions - xVM Server for x86 Platform and LDOMs for the SPARC Platform.
In the next section, I will write about LDOMs.
Monday Oct 13, 2008
Over the past couple of years, I have been visiting customers ranging
from enterprise to small and mid size web-tier companies and any
customer in between except Telco customers in an attempt to understand
how Virtualization can help them. I find several use cases and some
confusion.
Why Deploy Virtualization?
The
term Virtualization span several technologies that the industry has
offered so far. At a very basic level, I would classify Virtualization
Technology as a way to share the resources of a single server system across multiple users. The users can be human beings and or application software. The resources are generally the components that make up a computer viz.
CPU, Memory, IO and Display/Input devices. Hence at a very basic level,
the Virtualization technology must actively participate in
share-management of these resources.
I should make a
distinction here between Partitioning a large system (for eg. Domains
of a Sun Fire 6900) into Domains, where none of the above resources
are shared (also sometime referred to as Hard Partitioning), and,
Server Virtualization (the term Domain is also used in this context)
where at least some or all of the resources are shared with active
share-management in place. The term Domain encompasses the above set of
resources with some settings viz. OS, patches, tunables, software - that can be different from another Domain and yet co-exist in the same Physical system.
While
Virtualization is just one way of doing Server Consolidation, the
availability of so many Virtualization Technologies with so many
options have created some mis-conceptions and confusions.
- Can
I really re-deploy my application on a new domain without affecting the
end-user experience. The short answer is no, unless it was also
possible to do so in a non-Virtualized environment.
- I can
create lot of domains in a single system creating a situation where CPU
and Memory can be over-subscribed. What about performance?
- Can I move my Guest from one Virtualization technology to another? The answer depends on several factors
- The
Instruction Set supported by the Source system and target system should
be same, unless the Virtualization technology on the target system can
emulate Source System Instruction Set.
- If the Guest Operating
system is un-aware that it is running in a virtualized environment,
then it should be possible to do such migrations.
- I can
setup or move domains around and bring up my application in no time.
Some times that is true; it will depend on how complex the IO
environment is.
- I can use the same Disk device or Boot Image to
boot all my domains. Sharing a Boot Image is not possible in
Virtualization. Sharing a Physical Disk is possible with some caveats.
Next,
I plan to write about some Virtualization technologies that are
available today. It should help clear some confusions and
mis-conceptions.