When you are optimizing benchmarks, the typical process involves running the same benchmark
N times, and picking an arbitrary run of the benchmark (called a
run) from these N runs to get the representative run. Another option is to average these N runs (creating a new run N') and pick that one as the representative run. In
fenxi, we have discussed automatically averaging a bunch of runs. Performance data can be of two types
- Numerical Data (Throughput, Response time, etc)
- Textual Data (OS Patch level, syslog messages, etc.)
Averaging numerical data is very easy. Averaging textual data is not possible, or desired. However, since we are creating a new run
N', we need to select textual data to be part of this new run. Which run do we pick it from? We are trying to solve this via the
Fenxi project. If you have any thoughts or suggestions regarding this, please feel free to contact us.
We just opensourced a nice performance analysis tool called
Fenxi. Fenxi is a pluggable
Java-based post-processing, performance analysis tool that parses and
loads the data from a variety of tools into a database, and then allows
you to query and compare different sets of performance data. Fenxi can
also be used to graph data from performance tools. Fenxi (mandarin for
analyze) is the successor to the Sun-internal tool called
Xanadu. It is integrated with the
Faban Benchmark harness.
If you have ever worked with performance data, you will pretty soon
realize that
- Performance Data can get huge.
- Consider a benchmark running on a 64 core system with 100's of disks
attached, with multiple network interfaces for 30 minutes. If you collect
mpstat at 10 second intervals for the whole run, you end with
more than 11,000 lines of data! (That is 400 CNTRL-F's if you are using VI
in a regular sized termial). If you collect data from more tools like
vmstat, iostat, trapstat, busstat, cpustat, etc you will end up with
much more! Going through each of them line by line is not a scalable
approach.
- Performance Data is interrelated.
- The tool outputs are just different views of the system
behavior. We want to look at the system as a whole, rather than at its
individual views. If your incoming network packets peaks, your
interrupts in your mpstat most likely peaks. We may want to see if
throughput was impacted as a result of a burst of writes to our disks,
etc.
- Some performance data makes sense visually.
- For large data, a visual view gives a quick summary of the data.
As Tim Cook states it, "the human
brain is a powerful pattern-recognition machine - graphs allow you to
spot things you would never see in numbers (like waves of CPU migrations
moving across different cores)". Look at the bottom of the blog for
more details
- Performance Data should be queryable
- We want to be able to query or ask questions to the performance
data. For ex, you might want to know "What are my hot disks?".
Traditionally, people have answered such questions by writing
custom scripts using sed/awk/perl. This can get tedious very fast. We
need a better way of asking questions. In Fenxi, we store the data in
the database, and questions are formulated in SQL.
- Performance Data should be comparable, averageable, etc.
- Since I work in the performance group at Sun, we run a lot of
benchmarks. Since the goal of [most] benchmarks is to maximize the
performance of a system, we are always constantly trying out new changes
to the system. Typically, we change a parameter and repeat the benchmark
and see if it has improved performance.
- Performance Data should be sharable.
- We rarely work in isolation. We should be able to share data with
our peers and collaborate on finding performance fixes.
Fenxi tries to solve all of the
above problems.
Sample Graph
Sample Text
You can see a
sample
database run processed by Fenxi. I urge you to check it out!
My last entry provided some recommendations regarding the use of ZFS
with databases. Time now to share some updated numbers.
Before we go to the numbers, it is
important to note that these results are for the OLTP/Net workload,
which may or may not represent your
workload. These results are also specific to our system configuration,
and may not be true for all system configurations. Please test your own
workload before drawing any conclusions. That said, OLTP/Net is based
on well known standard benchmarks, and we use it quite extensively to
study performance on our rigs.
Filesystem
|
FS
Checksum
|
Database
Checksum1
|
Normalized
Throughput2
|
| UFS
Directio |
N/A
|
No
|
1.12 |
| UFS
Directio |
N/A
|
Yes
|
1.00
|
ZFS
|
Yes
|
No
|
0.94 |
1 Both block checksumming as well as block checking
2 Bigger is better
Databases usually checksum its blocks
to maintain data integrity.
Oracle for example, uses a per-block checksum. For Oracle, checksum
checking is on by default. This is typically recommended as most
filesystems do not have a checksumming feature. With ZFS checksums are
enabled by default. Since
databases are not tightly integrated with the filesystem/volume
manager, a checksum error is handled by the database. Since ZFS
includes volume manager functionality, a checksum error will be
transparently handled by ZFS (i.e if you have some kind of redundancy
like mirroring or raidz), and the situation is corrected before
returning a read error to the database. Moreover ZFS will repair
corrupted
blocks via self-healing. While
RAS
experts will note that end-to-end
checksum at the database level is slightly better than end-to-end
checksum at
the ZFS level, ZFS checksums give you unique advantages while providing
almost the same level of RAS.
If you do not penalize ZFS with
double checksums, you can note that we
are within 6% of our best UFS number. So 6% gives you
provable data integrity, unlimited snapshots, no fsck, and all the
other good features. Quite good in my book

Of course, this number
is
only going to get better as more performances enhancements make it into
the
ZFS code.
More about the workload.
The tests were done with OLTP/Net
with a
72
CPU Sun Fire E25K connected to 288
15k rpm spindles. We ran the test with around 50% idle time to simulate
real customers. The test was done on
Solaris
Nevada build 46. Watch
this space for numbers with the latest build of Nevada.

Wednesday Aug 02, 2006
Solaris Internals

Monday Jul 31, 2006
Real-World Performance
Performance for the real-world, where it matters the most.
A major portion of my job (@ PAE) is spent trying to optimize
Solaris for
real customer workloads. We tend to focus on databases, but work with
other applications too. We have tons (both weight wise and dollar wise
) of equipment
in our labs, where we try to replicate a real enterprise data center.
Of course, the term "real customer workload" is a loaded term. Since
most big customers are rarely willing to share their workloads, we
have to simulate them or write something close it in house. Trying to
rewrite every customer's workload is not a scalable approach. Hence
we have developed a workload called OLTP/Net that can be retrofitted
to fit most customer workloads. Using several tuning knobs we can control
the amount of reads, writes, network packet per transaction, connects,
disconnects, etc.. Think of it like a super workload! We have used it
quite effectively to simulate several customer workloads.
There is a big difference in trying to get the best numbers for a benchmark and
in replicating a customer's setup. PAE has traditionally focused on getting
the most out of the system. Our machines typically run at 100% utilization,
run the latest and greatest Solaris builds, have lot of tunings applied to
the system. We believe fully in Carry Millsap's statement
Each CPU cycle that passes by unused is a cycle that
you will never have a chance to use again; it is
wasted forever. Time marches irrevocably onward."
(Performance Management: Myths & Fact, Cary V. Millsap,
Oracle Corp, June 28, 1999)
However, many customers run their machines at less than 100%
utilization to leave enough headroom for growth. When machines are not
running at 100% utilization, things like idle loop performance matter
a lot. If you have followed Solaris
releases closely, there were several
enhancements to the idle loop performance that increase the efficiency of
lightly loaded systems by quite a bit. Similarly we have seen
quite a few UFS + Database performance enhancements over the past few
releases of Solaris.
So while benchmark numbers do matter, real performance also matters, and we
are working on it!

Monday Dec 12, 2005
Six OS's on a disk? Wait I can do seven!!
Update:
In my previous blog I showed how to install 6 os's on a disk. Well, actually you can have seven (7). Disk partitions are numbered from 0 to 7. Ignoring slice 2, that leaves us with 7 free slices on which to install our OS. Although I am yet to log on to a machine with 7 OS's on disk!!
Richard Elling pointed it out that you could also
use slice 2 (the loopback/backup/overlap slice) also. So that's 8. He also mentions that some SCSI devices support 16 slices, and so you could install quite a lot more OS installations!
Maybe we should have a completion of how many OS's you have installed on a single disk
My personal best is 6.

Friday Dec 02, 2005
Six OS's in one disk? Yes it is possible
Six (6) OS's in one disk
Do you want to install 6 OS's on a single disk? If so read on..
The goal is to have
6 bootable OS on a single disk. Why should one do it? Because better
sharing, more reliability, easier comparisons between OS versions,
quicker recovery, ...BTW, I have only tried this on sparc.
Although I am sure that people have been doing this for ages, I first
heard it from Charles
Suresh,
who encouraged me to go ahead and give it a try.
Create Partitions
Disk partitions usually are from 0 - 7, with 2 being the overlap.
For our experiment, we set 1 to be the swap. We sized the other
partitions equally, with 0 being a little smaller than others. On my
36G disk, the partition looks like the following
Part
Tag Flag
Cylinders
Size
Blocks
0 root
wm 2178 -
5655 4.79GB
(3478/0/0) 10047942
1 swap
wu 0 -
2177 3.00GB
(2178/0/0) 6292242
2 backup
wm 0 -
24619 33.92GB
(24620/0/0) 71127180
3 root
wm 5656 -
9285 5.00GB
(3630/0/0) 10487070
4 root
wm 9286 -
12915
5.00GB (3630/0/0) 10487070
5 root
wm 12916 - 16545
5.00GB (3630/0/0) 10487070
6 root
wm 16546 - 20175
5.00GB (3630/0/0) 10487070
7 root
wm 20176 - 24619
6.12GB (4444/0/0) 12838716
|
Install The OS
Install Solaris from any source. I typically download the images from
nana.eng, and use my jumpstart server. You can also install from CD,
DVD etc.. Once you install on a slice, you can
dd(1) it to other slices, and
fix
/etc/vfstab. This is
the fastest way of installing multiple solaris instances on a disc. If
you want another version, or a different build, bfu is your friend. You
can also save off these slices to some
/net/... place and restore an
OS at will (again using
dd
both ways since you need to preserve the boot blocks). If you slice
multiple machines this way, you can even copy slices across machines
(assuming same architecture etc) - more scripts are needed to change
/etc/hosts,
hostname,
net/*/hosts etc
Install via Jumpstart: Setup Profile
If you like things automated, you could perform a hands-off install via
custom jumpstart. The first step is to setup the profile for your
server. Since you want to preserve the
existing partitions, you have to use the
preserve keyword. The
profile for my machine looks like the following
$cat zeeroh_class
install_type
initial_install
system_type server
partitioning explicit
dontuse c1t0d0
filesys c1t1d0s0 existing /
filesys c1t1d0s1 existing swap
filesys c1t1d0s3 existing /s3
preserve
filesys c1t1d0s4 existing /s4
preserve
filesys c1t1d0s5 existing /s5
preserve
filesys c1t1d0s6 existing /s6
preserve
filesys c1t1d0s7 existing /s7
preserve
cluster SUNWCall |
To install an OS on another slice, just change the root disk (
c1t0d0s0 above).
Make sure that the directory where the profiles are stored is shared
read-only.
Also ensure that you have a
sisidcfg
file setup correctly.
[neel@slc-olympics] config > cat
sysidcfg
name_service=NIS
{domain_name=xxx.yyy.sun.com}
root_password=XXXXXXXXXX
security_policy=NONE
system_locale=en_US
terminal=vt100
timezone=US/Pacific
timeserver=localhost
network_interface=PRIMARY{protocol_ipv6=no}
[neel@slc-olympics] config
>
|
Run the
check script.
Note that these profiles can be stored on any server. That machine does
not need to have anything special installed. You only need to make sure
that the location of the profile, and other custom jumpstart scripts
are shared via NFS in a "
read-only"
mode.
Jumpstart
On the jumpstart server (
abc.yyy
in my case),
we added our machine to the list
of clients as follows
./add_install_client -i
bbb.aaa.xxx.xxx -e a:b:c:d:e:f -c
slc-olympics:/export/config -p slc-olympics:/export/config zorrah sun4u
Now reboot your machine as follows
$ reboot -- net - install
Booting via multiple disks/partitions
- Find the path (ls -l /dev/rdsk/..)
- At the ok prompt, type show-disks and select disk
- Type nvalias diskX # this paste's the selected path
- init 0
- boot diskX
Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Tuesday Nov 15, 2005
Introduction
I guess an introduction is necessary!
I am Neelakanth Nadgir and I am a part of PA2E
(Performance Architecture, and Availability Organization)
group. I work out of Menlo Park, CA. My professional
interests include scalability, networking, filesystems,
distributed systems etc.
Before joining PA2E, I worked at Sun's Market
Development Engineering, where I spent 4 years working on
Performance tuning, Porting, Sizing, and ISV account
management.
I am was also involved with several open source projects. I am an active member of the JXTA community and jointly started two projects viz
Ezel Project and
JNGI Project. I have
also served as web-master to the
GNU project
for 2 years. I also contributed to the
Mozilla project in the past by providing sparc binaries and misc performance fixes.
Before working at Sun, I graduated with a masters in
Computer Sc from
Texas Tech University at Lubbock, TX (GO Raiders!). My thesis was on the Reliability
of distributed systems, where I devised a faster algorithm for
calculating minimal file spanning trees. I have a Bachelor's degree in Computer Sc from
Karnatak University, India.
My other interests include
Cricket, and tropical aquarium fish (
African cichlids in
particular)
My favorite fish is known as
Pseudotropheus demasoni. My wife got me hooked on to the
aquarium hobby after we got married, and even before I knew, we had more than 60 fishes in 6 tanks :-)
I plan to use this blog to share the knowledge that
I gained from working with lots of cool people here at
Sun. Keep tuned for more insights!