High Performance Productivity

     
 

Dtrace solution for mission critical applications


Please read first Frederic's blog, for a detailed introduction and how the solution of this tricky problem was identified with dtrace.

In summary, a telephone application  of the service and hosting provider, Portrix, had a problem: after a couple of days of running the application, the system time went up, and eventually, became so high that the voice quality suffered. Restarting the application was not a solution and the machine had  to be rebooted. 

Our first suggestion and traditional approach, is to use the SUN profiler. We wanted to get three profiles : A first in the beginning, a second after some normal load, and a last one when the system time is high. All three profiles can then be loaded into the SUN analyser, for an easy comparison.
This would have given us two things : first a time line which could allow us to see where unexpected high system timings are spent, and second a performance profile which can be used to tune the application.

The latest version of the SUN profiler has option to attach to a running  process, and record a profile

%collect -P <pid> -t 180

should use dbx to attach, collect data for 180 seconds, and then detach and finish the experiment.

If you are using an older version of the profiler, you can do this through the debugger directly

dbx
  attach PID
  collector enable
  cont
  leave it running the time you want and stop it with Control C 
 ^C 

Any of these commands uses the debugger to attach a running process. The debugger has to stop the process for doing this (this is why you see the continue after the attach command). Furthermore, there is another risk: the debugger and profiler may catch or delay application important signals. For a mission critical application, running real users, this is completely unacceptable. At best, the application can handle the interrupts, and users will experience a small delay, at worst, the telephone conversation is cut.

How can you possible debug or profile an application without even having the possibility to use a debugger ?  Well, Solaris has a solution: Over 30-thousand probes are coming with the kernel and are waiting for you to be listened. The D scripting language is used to enable these probes, and to get a formatted and readable output.

The most general way to query a running system are these dtrace command lines. The first will grab all functions that are called, while the second while dive into the call stacks.

#dtrace -n 'profile-10ms{@a[func(arg0)]=count()} END{trunc(@a,100)}'
#dtrace -n 'profile-10ms{@a[stack()]=count()} END{trunc(@a,100)}' 

Start  any of these probes, let them run during one minute and stop them with ControlC. Firing up the first one, we see the following output:

 unix`mutex_vector_enter                                         629
 unix`default_lock_delay                                         770
 SUNW,UltraSPARC-T1`cpu_mutex_delay                              889
 unix`cas_delay                                                 1214
 zaptel`process_timers                                          6242
 unix`cpu_halt                                                 40277 

while the second one looks into the stacks of the functions and we are seeing at the end of this list

zaptel`__zt_process_getaudio_chunk+0xfbc
zaptel`__zt_transmit_chunk+0x70
zaptel`zt_receive+0xd7c
ztdummy`ztdummy_timer+0x30
genunix`cyclic_softint+0xa4
unix`cbe_level1+0x8
unix`intr_thread+0x168
60

SUNW,UltraSPARC-T1`bcopy+0x68c
zaptel`__zt_getbuf_chunk+0x20c
zaptel`__zt_transmit_chunk+0x28
zaptel`zt_receive+0xd7c
ztdummy`ztdummy_timer+0x30
genunix`cyclic_softint+0xa4
unix`cbe_level1+0x8
unix`intr_thread+0x168
96
From this listing you are already see that the ztdummy driver is doing strange things, and that it is the culprit of the high system time usage. This explains why the restart of the application did not help. The driver was not restarted.

The problem with the dtrace command line above may be that they are taken across the entire system, in both user and kernel space, with no restriction to a PID, program, user or CPU. The profile event fires with either arg0 or arg1 set. arg0 means your in the kernel, arg1 means your in user.  Typically you want to narrow it down to a PID or at least exexname (pid == $1, execname == "myprog", etc). You can do this by adding /cpu==6/ before the command, and putting either arg0 or arg1 into the func calls.

dtrace -n 'profile-10ms/cpu==6/{@a[func(arg0)]=count()} END{trunc(@a,100)}'
dtrace -n 'profile-10ms/pid==1234/{@a[func(arg0)]=count()} END{trunc(@a,100)}'

Or you can directly use this script (kindly provided by Jim Fiori) which directly attaches to a PID only.

# cat uprof.d

#!/usr/sbin/dtrace -qs BEGIN {        interval = $2 - 1; } profile-997 /arg1 && pid == $1/ {        @s1[ustack(5)] = count();        @s2[ufunc(arg1)] = count();        s1total++; } tick-1sec /--interval <= 0/ {        printf("\nHottest stacks...\n");        trunc (@s1,30);        printa(@s1);        printf("\nHottest functions...\n");        trunc (@s2,30);        printa(@s2);        printf("\nTOTAL samples %d\n",s1total);        exit(0); } #chmod 755 uprof.d #./uprof.d <pid> <#seconds>  

There is another tool in the Sun Studio warchest, project D-Light, that will allow you to attach to a running process and observe whatever attributes you want. Under the covers, it uses Dtrace, so anything you can collect with Dtrace is possible to show in
D-Light. Since D-Light is graphical (think the timeline view that Analyzer has) the best Dtrace data to display is time on the
X-axis and the data you want to observe on the Y-axis. Here is a whitepaper that describes an overview of this.

[Read More]
 
 
 
 

Attention with Lustre 1.6.7 and new configuration whitepaper


Lustre 1.6.7 had some serious bugs with the MDT server, and it was withdrawn from the download list. These bugs have been fixed, and a new version 1.6.7.1 has been placed in the download area. If you did download or use Lustre 1.6.7 please upgrade it. Personally, I had a problem in mounting the Lustre filesystem (for both OSS and MDT). After creating the volumes and installing the patches, here on a RedHat CentOS 5.2 machine, I could do the
# mkfs.lustre --fsname lustre --mdt --mgs /dev/vg00n/mdt

but then I could not mount it
# mount -t lustre /dev/vg00/mdt /mdt
mount.lustre: mount /dev/vg00/mdt1 at /mdt failed: No such device
Are the lustre modules loaded?
Check /etc/modprobe.conf and /proc/filesystems
Note 'alias lustre llite' should be removed from modprobe.conf


and there only is a ldiskfs and no lustre  in /proc/filesystems ....
When using Lustre 1.6.6 everything works fine.

Furthermore, there is an excellent whitepaper explaining the configuration and benchmarks of two different hardware setups. The first one uses a Sun Fire X4250 OSS server connected to a Sun Storage J4200 array with 12 300GB SAS drives, while the other uses one Sun Fire X4540 server (THOR) with 48 internal 1TB 7200rpm SATA drives. The first one uses the disk in a RAID0, while the second uses a RAID6 setup. All configuration descriptions include a HA (high availability) version.  Please download this excellent paper from here.

[Read More]
 
 
 
 

Lustre quick start guide is available


As promised in my last blog, Torben Kling's Whitepaper about a step by step Lustre set-up is available now! Please get it from here. This whitepaper explains everything from the installation of Linux, creating the virtual volumes, downloading the Lustre packages, setting up a Metadata server, the Object Store Servers and the clients and finally some examples of managing the file system. Congratulations to Torben for this !
[Read More]
 
 
 
 

Lustre Parallel File Sytem for CFD analysis Part 2


As already said, Lustre stores files, or blocks of files, that are considered as objects, of one or more OSTs. This is called striping in the Lustre terminology. You will need striping :

- if your file is too large to be stored on a single OST.
- if the required aggregate bandwidth for a single file cannot be offered by a single OST.
- if a client, i.e., your program running on the cluster, needs more bandwidth than a single OSS can offer.

Lustre allows you to configure the number of stripes, the stripesize and the servers (OSS) to use, for every file, directory or directory-tree.

Practically, the smallest recommended stripe size is 512 KB because Lustre tries to batch I/O into 512 KB chunks over the network. This is a good amount of data to transfer at one time.

Perhaps you will see a problem, that the file that you are using is considered as a single object. In this case, the file (or object) is stored on a single OST, i.e., a single disk, and you do not see any performance improvement.[Read More]
 
 
 
 

Lustre Parallel File Sytem for CFD analysis


Whether you are looking at crash simulations, implicit-explicit computations, or CFD analysis, all computing numerical solutions for very different physical models, they have in common, that the size of the data sets becomes bigger and bigger. This is true for the input data, the temporarily computed scratch files, and the final output data. Generally, I/O times have been considered small compared to the runs times of the solver. This may not be longer true today. Not all ISV codes propose a parallel I/O option, and if, it is not always easy to use.

Look for example at the StarCD input files for StarCD V3, V4 and finally StarCCM+ : The same geometry, a 34M cell case, uses 3.5GB for StarCD V3 .geom file and climbs up to 5.1GB when converted into the V4 .ccmg file (or the CCM+ .sim file).[Read More]
 
 
 
 

Blogging is like HPC benchmarking


HPC benchmarks are characterized by the urgency to run them, and by the common thought that everything is doable in a single day.

Apparently, this is very similar to set up a blog. What was thought a five minutes exercise, took the whole day (minus lunch-time, as I live in France and lunch time is a French must).

It took the whole morning, to prepare my iMac to run a VNC server. As I am running MacOSX 10.3.9, many applications do not work, as they need 10.4. Finally I could install the "Vine Server" from Redstone. Then, I had to configure IP forwarding from my Linksys Router. After this, we entered the already mentioned lunch break.

This can be compared to finding the desired benchmark hardware and installing the desired operating system and the correct software versions. By the way, OpenSolaris is out, and running the most stunning desktop I have ever seen.

 During the afternoon, back to business, with my boss running a vnc client on my desktop ... no more secrets from now. The next pain we had to go through was to reconfigure my SUN accounts, as I already had an older private subscription. Finally, I managed, to change the settings in my account, and I could use the default SUN account for login. For quite a while, I felt like a dog biting its tail ..

Again, this is very similar to benchmarking, when you try to enter the NFS mounted directory and yo use, that the NFS server is down, that the FlexLM license is not working. By the way, the latest FlexLM problem, was due to the geographical location of the physical server .. I still wonder how they can figure this out, but they can ..

Now, we had to create 3 more additional accounts, on technocrati, google and feedburner ... and finally, yes, it was a five minutes exercise.

 In parallel to HPC benchmarking, you can now take your scripts and execute the data sets for 10minutes of run time .... and your day is over.

Here my day was over as well, but as I spent it virtually with my real boss, he did not ask any questions of what I have done during this day.





 
 
 
 
Add to Technorati Favorites
 

« October 2009
SunMonTueWedThuFriSat
    
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
       
Today

[This is a Roller site]
Theme by Rowell Sotto.
 
© Gunter Roeth