Wednesday March 28, 2007 | Richard McDougall's Weblog Commentary from Race Control |
|
Eeek: Time isn't accurate in a VM! I noticed that time was all over the place on my Solaris guest, and it made me wonder just how to measure and quantify time on a virtualized guest. The problem is, if you use gettimeofday() in the guest as a reference, it too may not be accurate. So, I used an external time reference to measure the guest, and low and behold, time was indeed out! On my most recent VMware test configuration, Solaris was jumping forward several seconds/minutes at random times with snv as a guest to vmware on Ubuntu. The test host was Ubuntu 6.10 on a 2x2core opteron rev.e system. Basically, the problem is that the ubuntu dom0 power manages the opteron cores, AND it seems that some virtualization layers (in this case we used vmware) don't take into account that the time registers (tsc's) are not in sync across cores. When this happens, time jumps forward at random intervals, sometimes up to an hour. This particular problem only happens if numa systems are used with non syncrhonous tsc's. To solve the problem, I bound my snv guest to a core, and tell VMware not to adjust the tsc: here's what I have as a description: "The host.noTSC and ptsc.noTSC lines enable a mechanism that tries to keep the guest clock accurate even when the time stamp counter (TSC) is slow."
processor0.use = FALSE processor1.use = FALSE processor2.use = FALSE host.cpukHz = 2200000 host.noTSC = TRUE ptsc.noTSC = TRUE
Here's what I did to quantify the issue: I ran an externally controlled timed benchmark of a time program, one of a reference (unvirtualized host) and the other on the virtualized snv guest. That way, I could know what the real elapsed time was rather than assuming what the guest was telling me. Interestly, the guest was indeed lying about it's notion of wallclock time. Here's what 18 seconds looks like on (a) a vmware solaris guest, and (b) a reference machine (old SPARC), e.g. both of these test ran for exactly 18 real seconds. Sec is the # of seconds via gettimeofday() from the start, the tod is a delta of gettimeofday() between 1s sleeps.
Reference: Sec tod hrtime 0 1009354 1009476 1 1009801 1009817 2 1010484 1010505 3 1009315 1009333 4 1009905 1009923 5 1009909 1009930 6 1009905 1009924 7 1009911 1009945 8 1009842 1009869 9 1009897 1009918 10 1009887 1009906 11 1009918 1009933 12 1009903 1009920 13 1009913 1009931 14 1009892 1009910 15 1009908 1009928 16 1009895 1009921 17 1009900 1009929 18 1009876 1009896 snv Guest: Sec tod hrtime 0 1000665 1000702 1 1007738 1007756 2 177169798 177169813 <= Argh! 179 1011251 1011275 180 1008404 1008432 181 1009989 1010016 182 1009618 1009644 183 1009896 1009924 184 1009747 1009766 185 1000265 1000291 186 1009336 1009360 In the Solaris vmware guest with numanode = "1" set, it gets better, but now time runs slow (setting this binds the guest onto a numa & time coherent set of cores):
Sec tod hrtime 0 1000122 1000139 1 1004468 989472 2 4973883 939326 6 1009682 1009689 7 1005019 991275 8 4975168 939355 13 1009630 1009638 14 1003097 989955 With the following params set:
processor0.use = FALSE processor1.use = FALSE processor2.use = FALSE host.cpukHz = 2200000 host.noTSC = TRUE ptsc.noTSC = TRUE
Sec tod hrtime 0 1004787 1004911 1 1009783 1009802 2 1009914 1009935 3 1009894 1009913 4 1009895 1009913 5 1009900 1009918 6 1037644 1037680 7 1002091 1002117 8 1009910 1009929 9 1009897 1009920 10 1009904 1009923 11 1009893 1009913 12 1009916 1009934 13 1009876 1009894 14 1009893 1009918 15 1009873 1009891 16 1009901 1009921 17 1009883 1009911 18 1009903 1009922 Success! ( Mar 28 2007, 06:39:01 PM PDT ) Permalink Comments [2]
New Features of the Solaris Performance Wiki We've added a few new features and some more content at the Solaris Performance Wiki. Namely:
Of course, everyone is welcome to contribute!
[ T: OpenSolaris Solaris ] ( Mar 15 2007, 01:30:21 PM PDT ) Permalink
I put up a few of the Blackbox release-day photos at my website. IMHO, a great new breakthrough; putting datacenters next to power stations and eliminating the power transmission costs is actually a pretty big deal when deploying at massive scale.
On a related note, this site is running in an OpenSolaris zone, connected via gigabit ethernet to an OpenSolaris ZFS NFS file server, along side of solarisinternals.com and about a dozen other websites... ( Oct 19 2006, 01:55:30 PM PDT ) Permalink
Last week I presented on server consolidation & virtualization technologies in Paris, at Sun's SunUP-Network conference. We had a good turnout, around 100 customers from around Europe. There was a tremendous interest in virtualization, and some of the customers are quite a long way down the path of deploying virtualization. I presented on OS virtualization vs Hardware virtualization, and talked about the differences between the two; including some of the performance studies we have been doing. Some of the interesting tid-bits from the discussions:
All in all a very interesting meeting and set of discussions!
( Sep 24 2006, 07:46:56 PM PDT ) Permalink Comments [1]
DTrace, MDB: Solaris Internals Podcast During Jim's recent west coast tour, we were apparently overheard talking in the local about Solaris Internals. Catch the Podcast here, or the raw mp3 here.
[ T: OpenSolaris Solaris ] ( Aug 04 2006, 03:50:57 PM PDT ) Permalink
Fun at the OpenSolaris Users Group last night Jim and I had fun at last nights Silicon Valley OpenSolaris Users Group. We finally met a few more great OpenSolaris community members, including Ben Rockwood, who has blogged about the meeting already! We ran a quiz and gave away a set of signed books too.
I just posted the slides we used to talk about the book here, for reference.
[ T: OpenSolaris Solaris ] ( Jul 26 2006, 04:24:08 PM PDT ) Permalink Comments [3]
I'm very happy to be finally able to say that Solaris Internals is shipping! I received a box from the same batch that went to Amazon this week, and Amazon have updated their status to available. We expect the 2nd book (Solaris Performance and Tools) to ship next week. Also, we've started creating a Peformance FAQ for the 2nd book. It's in early stages right now, but growing quickly. On a final note, Jim and I hope to do a talk about the books at the Silicon Valley OpenSolaris Users Group next week in Santa Clara; hope to see you there!
[ T: OpenSolaris Solaris ] ( Jul 19 2006, 04:49:02 PM PDT ) Permalink Comments [4]
Performance, Observability, DTrace and MDB Solaris Internals, 2nd Edition is finally done! At 5:30am this morning, Jim, I and Brendan Gregg submitted two completed books to the publisher. You may notice two easter eggs here; first there are now TWO books, and there is another primary author in the fold - Brendan Gregg. The first of the two books is an update to Solaris Internals, for Solaris 10 and OpenSolaris. It covers Virtual Memory, File systems, Zones, Resource Management, Process Rights etc (all the good stuff in S10). This book is about 1100 pages. The TOC for this book is here The second book is aimed at Administrators to learn about performance and debugging. It's basically the book to read to understand and learn DTrace, MDB and the Solaris Performance tools, and a methodology for performance observability and debugging. This book is about 550 pages. The TOC for this book is here We need your help to name the two books. The current proposals are:
We would very much like to hear your thoughts on what you feel would be great titles and subtitles for the books. We welcome and look forward to your thoughts on the titles!
[ T: OpenSolaris Solaris ] ( Apr 03 2006, 03:42:43 PM PDT ) Permalink Comments [21]
Previously confidential SPARC docs released via OpenSPARC I see the OpenSPARC folks have opened up the specifications for the Niagara processor AND the new Hypervisor over at the OpenSPARC community website.
[ T: NiagaraCMT OpenSPARC ( Jan 25 2006, 11:40:30 AM PST ) Permalink
Solaris Internals - 2nd Edition It's coming. Really! Jim, I and team think we are within a couple of weeks of finishing the writing phase. You can check the TOC here, and please do comment.
Solaris Internals, 2nd Edition
[ T: OpenSolaris Solaris ] ( Jan 17 2006, 06:29:01 PM PST ) Permalink Comments [17]
You've no doubt heard a lot of noise about a new chip from Sun code-named Niagara. It's Sun's first chip level multiprocessor, with 32 virtual CPUs (threads) on a single chip. But wait, isn't this just another product release on the roadmap? Heck no. This is the dawn of the CMT era, which I believe represents a significant shift in the way we build and deploy massive scale systems. The official name is UltraSPARC T1, but personally I like the code-name Niagara. Today, we released two systems around the Niagara chip, the T1000 and T2000. I was convinced of the significance of CMT about two years ago by Rick Hetherington, Distinguished engineer and architect of the Niagara based system. I was working with extreme scale web provider here in the bay area, who roles out thousands of web facing servers. So many in fact that they had already concluded that server power consumption was responsible for up to 40% of the cost of running their data center; due to the relationship between power, ac, ups, floorspace and infrastructure costs. I went in with an open mind, considering SPARC (at the time), commodity x86, and a range of low power x86 options. Rick Hetherington and Kunle Olukotun (a founding architect of the chip) started sketching out how much throughput they would expect from their CMT design - 8 1.2GHz cores on a single 60 watt die, which was still being taped out at the time. Being a skeptic, I thew in some what-if questions comparing the throughput from some of the new break away x86 ultra-low power cores, like the AMD Geode or Via EPIA's -- about 1GHz @ 10-20 watts. It turns out that they were right; while the Geodes and EPIAs were much more efficient than commmodity x86, none of these options came close to the throughput per watt and cost per throughput delivered by a single die with many cores. Two years later, it seems so obvious to conclude that the more cores you put on a single die, the greater the savings in both cost and power, and the beginning of the tag-line "cool-threads". Check out Sim Datacenter, a downloadable power simulator for the datacenter. I'm pleased today to be able to walk you through some of today's Niagara blog entries from the microprocessor, hardware, operating system and application performance teams. There's some great articles on all aspects of the technical details around Niagara:
A hearty congratulations to the whole team who brought this technology together. I've personally observed one of the most significant cross-company collaboration efforts ever -- this technology brought together teams from the microprocessor group, the Solaris kernel group, the JVM, compilers, and application experts all across the company over the past two years, with an enthusiasm level that's hard to put words to. On a final note, there's two easter eggs: Oracle have just announced that they recognise Niagara as 2 cpu system from a licencing persepective. And, we've Open Sourced SPARC!. We hope you enjoy exploring CMT and the new Niagara based servers. We'll be opening up a discussion forum shortly, to connect you directly with the developers and application performance experts who work with these systems. Stay Tuned!
[ T: NiagaraCMT OpenSolaris Solaris] ( Dec 06 2005, 12:25:34 PM PST ) Permalink Comments [22]
Welcome to the CMT Era! Richard McDougall: Today is the release of the most exciting processor development in the last decade: UltraSPARC T1 - the first Chip Level Multithreading based system from Sun code-named Niagara. Today, you'll find an exciting set of discussions direct from the experts; discussing CMT processor principles, blazing application performance, and what all the buzz around "cool threads" is about. Check out my introductory story linking to all the discussions. ( Dec 05 2005, 12:44:35 PM PST ) Permalink Comments [1]
Tuning for Maximum Sequential I/O Bandwidth John asks the question "what does maxphys do, and how should I tune it?" The maxphys parameter use to be the authoritive limit to the maximum I/O transfer size in Solaris. A large transfer size, if the device and I/O layer supports it, generally provides better large I/O throughput. This is generally supported by the fact that disks like larger transfers if you are trying to get absolute maximum throughput. For a single disk, 64k is generally the point at which maximum transfer rate occurs, and for disk arrays, typically 1MB. Historically, there was a maxphys of 56k set, due to some older (VME?) bus max xfer limitations. The maxphys limit was increased to 128k on SPARC around the Solaris 2.6 timeframe. Since Solaris 7, the sd/ssd drivers (SCSI and Fibre Channel) override maxphys if the device supports tag queuing, up to a default of 1MB. On x86/x64, the default is still 56k (we need to look at this!). In summary, the defaults for SPARC with SCSI or Fibre channel are optimal defaults. You can always check by doing a large I/O test and observing the average I/O size:
# dd if=/dev/rdsk/c0t0d0s0 of=/dev/null bs=8192k
# iostat -xnc 3
cpu
us sy wt id
10 4 0 86
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
22.4 0.0 22910.7 0.0 0.0 1.0 0.1 44.6 0 100 c0t0d0
Here we can see that we are doing 22 reads per second, and 22MB/s; thus the disks are performing optimal 1MB I/O's. Also, a simple set of test with varying block sizes will help identify the best I/O size for your device. One more comment about max transfer size; these parameters (the driver's max transfer size) is read by UFS's newfs command, and used to set the cluster size of the file system; i.e. the number of contiguous blocks to read ahead or write behind. If you put a file system on a device and want to see large I/O's, you'll need to ensure that the maxcontig parameter also reflects the devices's max transfer size. It can be tweaked after with tunefs. # mkfs -m /dev/dsk/c0t0d0s0 mkfs -F ufs -o nsect=248,ntrack=19,bsize=8192,fragsize=1024,cgsize=22,free=1,rps=120,nbpi=8155,opt= t,apc=0,gap=0,nrpos=8,maxcontig=128,mtb=n /dev/dsk/c0t0d0s0 18588840 Here you can see that maxcontig=128 8k blocks, which is 1MB. If you tune maxphys, then reset the file system's max cluster size with tunefs after, too.
[ T: OpenSolaris Solaris ] ( Nov 11 2005, 12:12:55 PM PST ) Permalink Comments [8]
I looks like there is now a NFS option for the Buffalo Terastation, and a community around the device: ( Oct 31 2005, 10:27:33 AM PST ) Permalink Comments [5]
CMT is coming: Is your application ready? We're close to seeing some of the most exciting SPARC systems in over a decade. The new Niagara based systems are the most aggressive CMT systems the industry has seen to date, with 32 threads in a single chip. A chip like this will be able to deliver the performance of up to 15 UltraSPARC processors while using less than one third of the power. This represents a compelling advantage not only in performance, but as a significant reduction power, cooling and space. Since even a single Niagara chip presents itself to software as a 32-processor system, the ability of system and application software to exploit multiple processors or threads simultaneously is becoming more important than ever. As CMT hardware progresses, the software is required to scale accordingly to fully exploit the parallelism of the chip. Current efforts are delivering successful scaling scaling results for key applications. Oracle, Sun Web Server, SAP are among many examples of applications which have already shown scalability which can fully exploit all the threads of a Niagara based system. To maximize the success of CMT systems we need renewed focus on application scalability. Many of the applications we migrate to CMT systems will have been developed on low end Linux systems; they may have never been tested on a higher end system. The Association for Computing Machinery (ACM) is running a special feature on the impact of CMT on software this month. There are several relevant articles in this issue:
"the transition to CMPs is inevitable because past efforts to speed up processor architectures with techniques that do not modify the basic von Neumann computing model, such as pipelining and superscalar issue, are encountering hard limits. As a result, the microprocessor industry is leading the way to multicore architectures" Throughput computing is the first and most pressing area where CMPs are having an impact. This is because they can improve power/performance results right out of the box, without any software changes, thanks to the large numbers of independent threads that are available in these already multithreaded applications." "We can break down the TCO (total cost of ownership) of a large-scale computing cluster into four main components: price of the hardware, power (recurring and initial data-center investment), recurring data-center operations costs, and cost of the software infrastructure. ...And it gets worse. If performance per watt is to remain constant over the next few years, power costs could easily overtake hardware costs, possibly by a large margin." "We need to consider the effects of the change in the degree of scaling on the way we architect applications, on which operating system we choose, and on the techniques we use to deploy applications - even at the low end." "But concurrency is hard. Not only are today's languages and tools inadequate to transform applications into parallel programs, but also it is difficult to find parallelism in mainstream applications, and - worst of all - concurrency requires programmers to think in a way humans find difficult. Nevertheless, multicore machines are the future, and we must figure out how to program them. The rest of this article delves into some of the reasons why it is hard, and some possible directions for solutions." In addition to the ACM queue articles, there was a recent NetTalk on Scaling My Apps, featuring Bryan Cantrill. There will be a followup experts exchange on this topic, where customers can live chat with the technical scaling experts. Also, look for a new whitepaper on scaling applications for CMT, from Denis Sheahan of the Niagara architecture group. ( Oct 31 2005, 09:39:46 AM PST ) Permalink Comments [5]
|
|
||||