e premte shtator 02, 2005 | Paul Rogers' Weblog Notes during my pilgrimage
|
|
Katrina's impact on our customers My job at Sun is to assist our customers with the architecture, design and implementation of complex systems. Naturally I am reticent to mention customer names lest there be legal or public relations issues. However, I spent a large part of last year preparing a customer in New Orleans for Business Continuity and Disaster Recovery and now those plans have been executed. I will be oblique about the precise details but two large systems integrators are assisting one of our government agencies to do a large Enterprise Resource Planning application that involves a web front end, application servers in the middle tier and a very large database on the back end. The task was so large that I worked extensively with one of our partners, Mr. Chip Elmblad of Sub2 Technology Consulting. Feeding this system are many computers and users from around the world. There are also many systems associated with the core of the ERP system that perform functions like reporting and ad hoc queries. Hopefully this diagram will help you get the picture.
Yesterday's email brought a link to Nicholas Carr's provocative article in the MIT Sloan Management Review entitled "The End of Corporate Computing. The summary of the article begins with this quote..."Information technology is undergoing an inexorable shift from being an asset that companies own — in the form of computers, software and myriad related components — to being a service that they purchase from utility providers. Three technological advances are enabling this change: virtualization, grid computing and Web services." It concludes with this paradigm shattering, future as a tsunami coming at you assessment: "IT’s shift from an in-house capital asset to a centralized utility service will overturn strategic and operating assumptions, alter industrial economics, upset markets and pose daunting challenges to every user and vendor. The history of the commercial application of IT has been characterized by astounding leaps, but nothing that has come before — not even the introduction of the personal computer or the opening of the Internet — will match the upheaval that lies just over the horizon."
Customer Engineering Conference 2005 Top Ten Great Things about CEC 9. My roommate, Mike Belch, from the UK. If he will ever get it together and get started on his external blog, I will add him to my blogroll. Excellent guy who is mad about motorcycles. So I had some fun this weekend playing around with Solaris 10 x86 on my Toshiba 9100 laptop. Here are a few notes and observations. First an embarrassing confession. My network got screwed up a while back. When I booted up, everything came up. An ifconfig -a and a netstat -rn command showed reasonable values, but Mozilla would barf on most sites after loading part of the data. I tried a couple of nslookup www.cnn.com and dig google.com commands to make sure name resolution was working. The name lookups worked but still Mozilla wouldn't cooperate. Friday afternoon I finally checked /etc/nsswitch.conf and it was hosed. Looked like part of my xorg.conf had overlayed it. I copied /etc/nsswitch.dns to /etc/nsswitch.conf, rebooted and voila, back in business. Annoying to me that nslookup and dig don't complain about the bogus nsswitch file but its all better now. An email for from a friend today asked the question "What do you suggest to help a programmer understand Solaris Memory internals?" I thought about it and suggested Richard McDougall and Jim Mauro's book Solaris Internals. However, that book is a perfect illustration of my theory of the "half life of information." The book was released in the year 2000 and covered Solaris 7. Mssrs. McDougall, Leventhal, Cantrill, Bonwick, Price, Shrock et al. have been extremely busy and much improved Solaris from the days of priority paging in Solaris 7. In Solaris 8 and beyond the page scanning algorithm is now called Cyclical Page Cache so the book is outdated in some respects. The term 'half life' is drawn from radioactivity and refers to "the length of time in which half the nuclei of a species of radioactive substance would decay." The image of 'information half life' is how much of the material in the book from 2000 is still accurate. My belief is that much of the material in the book is still relevant since the early architecture of Solaris has carried through to Solaris 10 (download and play with your copy from here.) The information in the book has been updated for later versions of Solaris (8 to 10) in a set of 367 slides, dated November, 2004, in an Adobe acrobat file available here. Those of you on dialup do not want to download that file and you are already mad at me because of the number of images on my page. One of my first posts was a plea to sign up for the United Devices Grid to participate in cancer research in cooperation with the University of Oxford and the National Foundation for Cancer Research.
Here's part of the gang that got together at the Q center outside of Chicago last week for Sun's Immersion Week. This fine group is part of the Central US Data Center Practice that got together for a 'Birds of a Feather' meeting Thursday night. Standing on the left is Bill Pilarski, our fearless Practice Manager, and standing on the right side is Brian Ahearn, our Director. Squatting 2nd from the right is Phil Morris, our CTO. We got together to learn about Sun's new technology and strategy for the next year. As usual for this type of gathering, the classes contained important material but some presenters could have had better skills. The Solaris 10 Dtrace sessions and Zones sessions were good but I was in too few of them. Famous Sun Bloggers who I know were there include John Clingan, Glenn Brunette, John Beck, and Bart Smaalders. If I missed any other famous bloggers who attended, I apologize.
FireEngine aka Solaris 10 Network Stack How did they get this past the lawyers??? They are actually saying that the new network stack is up to 45% faster. For a performance guy, this announcement is truly amazing. This article also discusses the coming 10 Gigabit networking. You can download the latest version of Solaris 10 x86 from here and take this screaming network stack out for a spin. Run your own speed comparisions against Linux, Windows, or whatever. (Disclaimer, your results may vary. Please do not use ftp as a networking benchmark, it sucks. Use the ttcp utility.)
Good News - Niagara in the public eye
Yesterday's news was depressing, but The Inquirer has this article, Sun's Niagara Falls Neatly into Multithreaded Place, discussing our 8 CPU core massively multithreaded processor code named Niagara. The diagram below attempts to illustrate the text of the article which says in part, "On a macro level, it will have eight cores, each core capable of running 4 threads in parallel, for 32 concurrently running threads." Naturally the illustration is chopped off at 4 cores, but its for illustrative purposes only. The C's in the diagram are compute time for the thread and the M's are the memory latency of the thread. By switching between threads on a core, we hope to minimize the time waiting for memory to catch up. The front page of the Wall Street Journal was depressing today with a lead article entitled Drag on High-Tech Recovery: Companies Do More with Less (Free this week only.) A few relevant quotes (read 'em and weep with me):
When I started this blog I said that I did performance and capacity planning for Sun's customers. I want to offer up a technical study or two to help others with performance issues. I entitled this Capacity vs Performance in order to highlight the difference. Often a capacity limitation manifests itself as a performance issue. In order to differentiate between performance and capacity, performance might be defined as 'How fast it is going' while capacity is 'the maximum performance of the system or an individual component.' Imagine capacity as the dump truck carrying a load and performance as a sports car racing. Even a sports car has to slow down for corners. Not to be too simple but we need to look at each component of the system's performance, CPU, memory, network, disk and tape. One specific example was a customer who has a directory on the internet. Their customers submit searches from multiple sites and the Service Level Agreement (SLA) was no more than 5% of requests with response times of over 3 seconds. Currently 15% of request take more than 3 seconds which puts our customer in a penalty situation. The system is a 6800 with 12x900MHz CPUs. Unfortunately someone attempted to fix the problem by 'throwing more iron' at it and adding CPUs and memory without knowing why there was a problem. Lets look at a few numbers. From vmstat: procs memory page disk faults cpu r b w swap free re mf pi po fr de sr m0 m1 m1 m1 in sy cs us sy id 0 2 0 8948920 5015176 374 642 10 12 13 0 2 1 2 1 2 132 2694 1315 14 3 83 0 19 0 4089432 188224 466 474 50 276 278 0 55 5 5 4 3 7033 6191 2198 19 4 77 0 19 0 4089232 188304 430 529 91 211 211 0 34 8 6 5 4 6956 9611 2377 16 5 79 0 18 0 4085680 188168 556 758 96 218 217 0 40 12 4 6 4 6979 7659 2354 18 6 77 0 18 0 4077656 188128 520 501 75 217 216 0 46 9 3 5 2 7044 8044 2188 17 5 78 There is something odd about these numbers. On vmstat, we look at the right 3 columns, us=user, sy=system and id=idle, so there is over 50% idle CPU available to throw at the problem. One way to detect a memory problem is to look at the sr, Scan Rate, column of vmstat (near the middle of the display.) If the page scanner ever starts running, or sr gets over 0, then we need to dig deeper into the memory system. The very odd part of this display is that the blocked queue on the left of the display has 18 or 19 processes in it but there are no processes in the run queue. That means we are blocking somewhere in Solaris without using all the CPUs available to us. So now, we need to turn to the I/O subsystem. With Solaris 8, the iostat command has a new switch, -C which will aggregate I/Os at the controller level. My favorite iostat command is iostat -xnMCz -T d (interval in seconds) (count of iostat outputs):
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
396.4 10.7 6.6 0.1 0.0 20.3 0.0 49.9 0 199 c1
400.2 8.8 6.7 0.0 0.0 20.2 0.0 49.4 0 199 c3
199.3 6.0 3.3 0.0 0.0 10.1 0.0 49.4 0 99 c1t0d0
197.1 4.7 3.3 0.0 0.0 10.2 0.0 50.4 0 100 c1t1d0
198.2 3.7 3.4 0.0 0.0 9.4 0.0 46.3 0 99 c3t0d0
202.0 5.1 3.3 0.0 0.0 10.8 0.0 52.4 0 100 c3t1d0
Whoa! On controller 1 we are doing 396 reads per second and on controller 3 we are doing 400 reads per second. On the right side of the data we see that iostat thinks the controller is almost 200% busy (iostat error...never checked to see if there has been a bug filed.) So then the individual disks are doing almost 200 reads per second and iostat figures thats 100% busy on the disks. That leads us to a rule of thumb or hueristic, that individual disks perform at approximately 150 I/Os per second. This does not apply to LUNs or LDEVs from the big disk arrays. So our examination of the numbers lets us suggest adding 2 disks to each controller and relaying out the data. Unfortunately, due to the disk array configurations, we could only add 1 disk to each controller. That did improve the situation as seen by the next iostat:
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
410.6 5.4 4.8 0.0 0.0 5.7 0.0 13.7 0 218 c1
386.0 9.0 4.6 0.0 0.0 5.3 0.0 13.4 0 211 c3
129.4 2.2 1.5 0.0 0.0 1.9 0.0 14.7 0 73 c1t0d0
139.4 1.8 1.6 0.0 0.0 2.3 0.0 16.0 0 79 c1t1d0
141.8 1.4 1.7 0.0 0.0 1.5 0.0 10.4 0 66 c1t2d0
133.0 1.0 1.6 0.0 0.0 2.1 0.0 15.6 0 76 c3t0d0
125.4 2.2 1.5 0.0 0.0 1.9 0.0 14.6 0 72 c3t1d0
127.6 5.8 1.5 0.0 0.0 1.4 0.0 10.2 0 63 c3t2d0
We are still close to the top end of the performance of an individual disk but we dropped from 15% of transactions out of the SLA down to 6 or 7% of transactions out of the SLA. And the CPUs look good: procs memory page disk faults cpu r b w swap free re mf pi po fr de sr m0 m1 m1 m1 in sy cs us sy id 0 2 0 9283064 5482928 787 1293 36 0 0 0 0 0 23 0 13 5145 14763 1394 27 6 67 0 1 0 6547512 2483056 869 984 110 0 0 0 0 0 14 0 8 5377 8114 1372 23 6 71 0 1 0 6525816 2461496 1190 1230 0 0 0 0 0 0 0 0 0 6414 17808 1402 33 9 58 0 1 0 6516240 2451976 1316 481 0 0 0 0 0 0 0 0 0 5432 8226 1509 30 7 63 0 1 0 6506616 2442768 684 660 0 0 0 0 0 0 0 0 0 5188 16922 1259 26 7 67 Now we still have plenty of idle CPU time and only 1 or 2 in the blocked queue. It would have been nice to be able to add 2 disks to each controller but even 1 disk on each alleviated this problem. After this, the customer studied some of the internal design of their directory search algorithms. As the proverb says, Fixing one performance or capacity limitation only exposes the next issue. The point of this exercise is looking at all the numbers and attempting to locate the precise nature of the problem. Do not assume adding CPUs and memory will fix all peformance problems. In this case the search programs were exceeding the capacity of the disk drives which manifested itself as a performance problem of transactions with extreme response times. All those CPUs were waiting on the disk drives. One other thing to note in this example is that there were no 'magic' /etc/system parameters to tweak. There are fewer and fewer knobs (or parameters) in Solaris to adjust to improve performance. (2004-10-18 20:39:25.0) Permalink Comments [2] Team, I tend to participate in aliases and share problems I have run into. Tonight I have to confess that I croaked my Windows 2000 partition when I attempted to dual boot my laptop last Saturday. First let me reassure you that I know this can be done and I completed the process Thursday night. However, I will confess my failure in order to save you heartache and grief. I am excited about trying out the new features of Solaris 10 x86 like DTrace and Containers. Download your copy of Solaris 10 x86 here. I first cleaned up my harddrive by deleting outdated files and taking out the trash. Then I defragmented my drive using the Windows 2000 Disk Defragmenter. This is important for resizing the Windows partition. Then I did follow RULE #1 and made a backup of my important files. I used the backup facility of the Nero tool (burning CDs, not Rome, get it.) I made 2 mistakes here. Mistake #1 was I did not follow RULE #2...Verify your backup and so several files would not restore later because of media problems. Mistake #2 was that I backed up my data starting at the level of 'My Documents,' not at the level of my User ID (one level up) which would have included my Application Data folder, that is my bookmarks file and my Outlook PST file. Now I have 'recent' backups of those files, but I lost 2 weeks of data when I thought I had fresh backups of these files. My problem was not understanding some features of the tools I attempted to use. I picked the tool Partion Commander to resize the 40GB harddrive into a 25GB partition for Windows 2000 and a 14+GB partition for Solaris 10 x86. Unfortunately for me, Partition Commander installed a utility, checkmbr (Check Master Boot Record) which automatically attempts to reinstall a base Master Boot Record. When you install another OS like Linux or Solaris x86, the new OS must update the master boot record and offer you the choice of which OS partition to boot. The repartion worked and the Solaris x86 install worked fine. I rebooted Solaris x86 several times and was fine. The problem occurred when I rebooted the Windows 2000 partition and the automatic utility checkmbr found the Solaris boot partition chooser in the master boot record. It attempted to restore it to its original state and then neither partition would boot. I believe you can and should do this. There are issues in doing this that are challenging but documents like this one can help you. I happened to have a Toshiba Tecra 9100 laptop which needs some BIOS updates: Disable USB Legacy FDD support Disable USB Legacy support for keyboard and mouse if a separate setting Disable Parallel port On Thursday night, I got out my Knoppix CD, which everyone should have in their CD case as a rescue CD. It has a utility, qtparted which I used to partition my hardrive. Other versions of Linux also have this utility. I then rebooted Windows 2000 and let checkdisk run to get used to the new partition size. Then I took my stack of 4 Solaris 10 x86 CDs and ran the install. Sucess! I am ironing out a few display issues but looking forward to writing my first DTrace program tomorrow. (2004-10-13 20:34:44.0) Permalink Everbody is positively mental over Grid computing (aka distributed computing) today. You may be curious about the Grid phenomena but you are worried that you have not yet installed Solaris x86 at home. Not to worry. I know you are going to make the switch Real Soon Now, but even before you do, you can participate in the grid with your home computer even if it runs an operating system from another company. :)
Grids have been a long time. In the 90's I participated in SETI@Home aka Search for Extraterrestrial Intelligence aka Search for Little Green Men. Its an impressive project that has 5.2 million users who have donated 2 million YEARS of computer time to the project. Today they are working at 66.5 Teraflops (trillion floating point operations) per second. That's some serious computational rock and roll. For my money though Little Green men is just so last millenium. I mean if Agent Mulder has given up the search, why am I still working on it? :)
Another of my favorite grid projects is Folding@Home . This Grid project is modeling "protein folding, misfolding, aggregation and related diseases" (like Alzheimer's.) Currently they have 171,628 CPUs running Windows, Mac OS X and Linux, which means that they have 196 Teraflops on the problems. The image preceding this paragraph is a Beta Amaloid peptide, "thought to be responsible for nerve cell death in Alzheimer's Disease." It is a part of Projects 722-724 so your computer could be a part of helping with Alzheimer's research. Truly a worthy cause and you can join up here. I was folding proteins for Professor Pande, but 2 years ago I had a colon polyp removed that was becoming cancerous. So these days my computers are working on cancer research headed up by the University of Oxford. There are over a million members with almost 3 million computers working on the problem and yesterday 270,000 results sets were submitted. If you are concerned about cancer, download the software from here. Then join my team-- SunONE. I am not actually the captain but I liked the name of the team. I urge you to pick one of these projects and let your computer(s) crunch numbers while you are not actively using it. Each one of these processes runs in the background at a lower priority than all other work your computer is doing. Most can be paused but you will not notice them in the background because your OS only schedules them when there are idle cycles available. If you are surfing the web, they back off. Even while you are typing they work, and trust me, a modern CPU is doing a lot of waiting around even if you type 300 words per minute. If "a mind is a terrible thing to waste," wasting good computer cycles is just criminal. Thanks for your support (2004-10-08 08:08:29.0) PermalinkI've been with Sun for 7 years in Professional Services. I primarily do large systems, HA clusters, Oracle 9iRAC clusters and performance and capacity planning on these systems for our customers. I will talk more about my work later. (2004-10-06 19:55:56.0) Permalink Comments [1] |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||