Hadoop on CMT
Recently I have been involved in scaling Hadoop for our Chip Multi-Threading (CMT) machines (e.g., systems running up to 256 threads per node.) Hadoop is a Java-based open-source MapReduce framework that enables users to deploy their cloud-friendly workloads across many computing nodes. For example, Yahoo's Hadoop cluster recently performed a 1TB sort in less than a minute! Hadoop has been widely used for various cloud-friendly workloads such as word counting, sorting, index search, content optimization, data mining, machine learning, and so on.
- My webinar@Sun : Scaling Hadoop for Multi-Core and Highly Threaded Systems
Intra-Node Virtualization
One of our key findings in scaling Hadoop on CMT boxes was to create multiple logical nodes within a single node so that we can distribute a large workload across multiple nodes without introducing task startup overheads. Suppose that we want to utilize our system as much as possible. Assigning too many small tasks on a single CMT node incurs significant task startup costs despite achieving higher task parallelism. On the other hand, assigning only few large tasks on a single CMT node increases each job's running time significantly due to the low parallelism.
For this work, we used two virtualization techniques available in our CMT boxes.
- ZONE (a.k.a. Solaris Containers)
- S/W-based virtualization technique embedded in Solaris OS
- Application-level isolation within a single operating system
- All Zones freely share H/W resources (CPU, memory, network, etc)
- LDOM (Logical Domain)
- H/W-assisted virtualization technique requiring CMT boxes with LDOM software installed
- H/W-level isolation: independent operating system and dedicated hardware resources for each LDOM
Logical Domain (LDOM)

This figure shows my example 4 virtual node setup for T5240 box, implemented with LDOM. T5240 has two Niagara processors on board so that it can execute 128 threads simultaneously. The single node has 128GB memory, 16 disk drives, and 4 network interfaces. Below are steps that I followed to create the example setup.
First, download and install LDOM softwares (if not done yet.)
Tips: install LDOM on the latest Solaris (or OpenSolaris) and firmware to avoid extra patching.
My OS version:
root$ cat /etc/release
Solaris Express Community Edition snv_114 SPARC
Copyright 2009 Sun Microsystems, Inc. All Rights Reserved.
Use is subject to license terms.
Assembled 04 May 2009My LDOM version (after LDOM is installed):
root$ /opt/SUNWldm/bin/ldm --version
Logical Domain Manager (v 1.1)
Hypervisor control protocol v 1.3
Using Hypervisor MD v 0.1
System PROM:
Hypervisor v. 1.7.0. @(#)Hypervisor 1.7.0.build_19****PROTOTYPE*** 2008/12/05 10:58 [4.30.0.
OpenBoot v. 4.30.0. @(#)OBP 4.30.0.build_12***PROTOTYPE BUILD*** 2008/12/03 13:57My PATH:
export PATH=$PATH:/opt/SUNWldm/bin;
export MANPATH=$MANPATH:/opt/SUNWldm/man;
Second, create a primary domain.
LDOM setup requires a primary domain (a.k.a. control domain) which manages and control all hardwares. The primary domain serves as a virtual service provider to allowing only virtualized H/W resources to each guest domain (e.g., CPU, memory, network, and disks). Below are example commands.
# creating 16 virtual disks
for i in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15; do
ldm add-vds primary-vds$i primary
done
# creating virtual switch network ( you can create more)
ldm add-vsw net-dev=nxge1 primary-vsw0 primary
# creating virtual ports to each LDOM OBP (Open Boot Prom): allow console access
ldm add-vcc port-range=5000-5100 primary-vcc0 primary# save initial configuration
ldm add-spconfig initial
ldm list-spconfig- Reboot the system.
# see your primary LDOM configuration after reboot
ldm list-services primary
# enable ldom and virtual network server daemon
root$ svcadm enable ldmd
root@ svcadm enable vntsd
- Add and bind specific services.
# I gave 1 cryptographic unit (out of 8), 8 cpus (out of 128), 7G RAM (out of 128)
# need to give at least 1 MAU to the primary domain
ldm set-mau 1 primary
ldm set-vcpu 7 primary
ldm set-memory 8G primary
# mount read-only Solaris images (if you want to install OS from DVD images)
# probably network-based install might be easier
for i in 0 1 2 3; do
ldm add-vdsdev options=ro /export/home/Solaris/sol-nv-b114-sparc-dvd.iso vol-dvd@primary-vds$i
done# mounded block devices to vds0-15
for i in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15; do# method 1: in case of partitions
ldm add-vdsdev /dev/dsk/cXtXdXsX vol-disk@primary-vds$i
# method 2: in case of block files
mkfile 10G /mnt/$i/10G_BL_FILE
ldm add-vdsdev /mnt/$i/10G_BL_FILE vol-disk@primary-vds$idone
# check your primary domain setup
ldm list-bindings primary
Third, create guest domains.
- Configure virtual switches
# remember your physical network information
ifconfig -a- Creating pseudo-IP network (192.168.1.x) using nxge1, while nxge0 (my primary network) is alive
# enable virtual switch
ifconfig vsw0 plumb# disable the physical interface (CAUTION: do this from the console to remain connectivity)
ifconfig nxge1 down unplumb# give the info to the virtual switch
ifconfig vsw0 IP_of_nxge1(=192.168.1.1) netmask netmask_of_nxge1(=0xffffff00) broadcast + upifconfig vsw0
vsw0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
inet 192.168.1.1 netmask ffffff00 broadcast 192.168.1.255
ether 0:14:4f:fb:40:5d# make the setting permanent
mv /etc/hostname.nxge1 /etc/hostname.nxge1.bk
echo primary > /etc/hostname.vsw0# add static pseudo-IPs in /etc/hosts
$grep 192.168 /etc/hosts
192.168.1.1 primary
192.168.1.10 ldom0
192.168.1.11 ldom1
192.168.1.12 ldom2
192.168.1.13 ldom3# you can also use dynamic IP addressing (DHCP) in a similar way
- Create guest LDOMs by inheriting services from the primary domain.
for i in 0 1 2 3; do
ldm add-domain ldom$i
ldm add-vcpu 30 ldom$i
ldm add-memory 30G ldom$i
ldm add-vnet vnet0 primary-vsw0 ldom$i
done# add virtual disks
node=4
for i in 0 1 2 4 5 6 7 8 9 10 11 12 13 14 15; do
num=$(($i/$node))
ldm add-vdisk vdisk-dvd vol-dvd@primary-vds$i ldom${num}
ldm add-vdisk vdisk vol-disk@primary-vds$i ldom${num}
done
# bind and start all guest domains
for i in 0 1 2 3; do
ldm bind-domain ldom$i
ldm start-domain ldom$i
done
Finally, configure guest LDOMs from console
# record ports to OBP for guest domains (e.g., 5000 for ldom0)
ldm list-domain
# case of ldom0 with port 5000
$ telnet localhost 5000
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Connecting to console "ldom0" in group "ldom0" ....
Press ~? for control options ..# check disks dedicated to this domain
{0} ok show-disks
a) /virtual-devices@100/channel-devices@200/disk@1
b) /virtual-devices@100/channel-devices@200/disk@0
q) NO SELECTION
Enter Selection, q to quit: q# check network dedicated to this domain
{0} ok show-nets
a) /virtual-devices@100/channel-devices@200/network@0
q) NO SELECTION
Enter Selection, q to quit: q# record the device names (necessary to set boot devices later)
{0} ok devalias
vdisk /virtual-devices@100/channel-devices@200/disk@1
vdisk-dvd /virtual-devices@100/channel-devices@200/disk@0
vnet0 /virtual-devices@100/channel-devices@200/network@0
net /virtual-devices@100/channel-devices@200/network@0
virtual-console /virtual-devices/console@1
name aliases
# DVD BOOT #
{0} ok boot /virtual-devices@100/channel-devices@200/disk@0:f -v
--> continue to install OS,
then set network vnet's IP as 192.168.1.X.
# repeat for ldom1, ldom2, ldom3- You can set automatic boot devices
for i in 0 1 2 3; do
# for DVD boot
ldm set-variable auto-boot\?=false ldom$i
ldm set-variable boot-device=/virtual-devices@100/channel-devices@200/\disk@0 ldom$i# for disk boot (after OS installed)
ldm set-variable auto-boot\?=true ldom$i
ldm set-variable boot-device=/virtual-devices@100/channel-devices@200/\disk@1 ldom$i
done
Finally, once everything is done successfully...
$ ldm list-domain
NAME STATE FLAGS CONS VCPU MEMORY UTIL UPTIME
primary active -n-cv- SP 8 7G 1.6% 26d 7h 22m
ldom0 active -n---- 5000 30 30G 0.1% 26d 5h 28m
ldom1 active -n---- 5001 30 30G 0.5% 26d 5h 28m
ldom2 active -n---- 5002 30 30G 0.4% 26d 5h 28m
ldom3 active -n---- 5003 30 30G 0.4% 26d 5h 28m
After all LDOMs can be accessed by SSH, configure each LDOM as Hadoop task trackers. In other words, simply copy Hadoop package to each LDOM and set conf/master and conf/slave files properly. Below is my example LDOM setup. See complete isolations among LDOMs.
Solaris Zones (or Containers)
Zone administration is much simpler.
Create Zone configuration files:
- create zone configuration scripts ( one zone with 4 disk partitions)
zone0_4disk.cfg:
create
set zonepath=/zone/0/root # path to install OS root
set autoboot=true
add net
set physical=nxge1
set address=192.168.1.1/24 # assign pseudo-IP
end
add fs # mount disk partition
set dir=/export/home1
set special=/dev/dsk/c0t0d0s7
set raw=/dev/rdsk/c0t0d0s7
set type=ufs
end
add fs # mount disk partition
set dir=/export/home2; mount disk partition
set special=/dev/dsk/c0t1d0s7
set raw=/dev/rdsk/c0t1d0s7
set type=ufs
end
add fs # mount disk partition
set dir=/export/home3
set special=/dev/dsk/c0t2d0s7
set raw=/dev/rdsk/c0t2d0s7
set type=ufs
end
add fs # mount disk partition
set dir=/export/home4
set special=/dev/dsk/c0t3d0s7
set raw=/dev/rdsk/c0t3d0s7
set type=ufs
end
verify
commit
info
exit
- create zone1_4disk.cfg, zone2_4disk.cfg and zone3_4disk.cfg similarly.
# setup 4 zones
zonecfg -z zone0 -f zone0_4disk.cfg
zonecfg -z zone1 -f zone1_4disk.cfg
zonecfg -z zone2 -f zone2_4disk.cfg
zonecfg -z zone3 -f zone3_4disk.cfg
Install and boot zones:
# list all zones
zoneadm list -vi # should show that 4 zones are ready to boot.
# install zones
zoneadm -z zone0 install
zoneadm -z zone1 install
zoneadm -z zone2 install
zoneadm -z zone3 install
# boot zones
zoneadm -z zone0 boot
zoneadm -z zone1 boot
zoneadm -z zone2 boot
zoneadm -z zone3 boot
Post-install OS configuration:
# pseudo-OS installation (similar to real OS installation)
zlogin -C zoneX
After all ZONES can be accessed by SSH, configure each ZONE as
Hadoop task trackers. Below is
my example ZONE setup. See the global operating system shared by all Zones.

In my next blog entires, I will present my analytic performance model for Hadoop and related tuning tips.