Friday December 07, 2007 I've now been at Sun for the best part of two decades. It was Solaris and SPARC which first attracted me to Sun, and it's exciting to see both very much alive and kicking all these years on. But this didn't happen by chance. Sun currently spends $2B per year on R&D, making Sun one of the top 50 technology investors worldwide.
Having myself spent the last five years doing R&D, why have I decided to move back into Sun UK's field organisation? Simply, it's because I think Sun has a very compelling story to tell, and a very exciting portfolio of products and services to offer. In short, I think I'm going to have a lot of fun!
In my new role I'm finding that the thinking behind my Killer Combination, Wicked Bible and Brief History postings is resonating well with a lot of people. Quite frankly, I'm astonished by the number of downloads of my extremely minimalist presentation given to the Sun HPC Consortium in Dresden.
Such has been the interest of late that I thought it would be worth sharing my latest minimalist slide deck: Solaris: greater than the sum of its parts. A lot of the material may be familiar, although I have been experimenting with tag clouds as an alternative to boring bullet points. My basic thesis is: Sun is where the innovation is, so why flirt with a imitations?
I've been a big Lego fan for as long as I can remember. Whilst some kids are content to follow the supplied instruction guides, the real thrill for me has always been that Lego allows me to design and build my very own creations, limited only by my imagination. I feel the same way about UNIX.
UNIX has always been about innovation. UNIX has always provided a rich set of simple, consistent, elegant and well defined interfaces which enable to developer to "go create". This "innovation elsewhere" often takes the seed innovation to places unforeseen by its inventors, and this in turn leads to new innovations.
Lego has undergone a similar evolution. At first I only had chunky 2x2 and 2x4 bricks in red and white to work with. Then came 1xN bricks and 1/3 height plates and more colours. Next came the specialised pieces (e.g. window frames, door frames, wheels, flat plates, sloping roof tiles bricks, fences and so on). But all the time these innovations extended the original, well-designed interfaces, with a big commitment to compatibility, thus preserving investment in much the same way as the Solaris ABI (application binary interface).
Obviously, there are places where these parallels break down, but I think we can push the analogue a little further yet [note to self: a USENIX paper?]. In my mind, Solaris is somewhat akin to Lego Technics, and Solaris 10 to Lego Mindstorms. And in this vein, I see Linux rather in the Betta Builder mould (i.e. innovative interfaces copied from elsewhere, actually cheaper and lighter, but not quite the same build quality as the original implementation). And this is where I'm going to get a little more provocative.
In my new presentation I experiment with tag clouds to enumerate some of Sun's more important UNIX innovations over time. The first cloud lists prehistoric stuff from SunOS 3 and 4 days. The second focussed mostly of Solaris 2. The third focusses on Solaris 10. And while Sun may not be able to take sole credit for ideas such as /proc and mmap, it can claim to have the first substantive implementations.
The fourth tag cloud is included to demonstrate that Sun does not suffer from terminal NIH (not invented here) syndrome. Indeed, I think it recognises that Sun is a pretty good judge of excellence elsewhere (most of the time).
Whatever you think of the detail (and I concede some of it could do with a little more research) I do think it is helpful to ask "where does the innovation happen?". At the very least, I think I've shown that there is heaps of innovation in Solaris which we simply take for granted.
To put it another way: as a Solaris enthusiast I can't help feeling at ease in a Linux environment because I find so many familiar objects from home (I guess a GNU/BSD devotee might say something similar). That's not to deny the achievements of the Linux community in implementing interfaces invented elsewhere, but when I look at the flow of innovation between Solaris and Linux it does feel rather like a one-way street.
We live in interesting times! My own particular area of interest is multithreading. With the dawning of the brave new world of large scale chip multithreading Solaris seems uniquely placed to ride the next wave. This is not by accident. Sun has made a huge investment in thread scalability over the past 15 years.
One of my slides asks "What is the essential difference between single and multithreaded processes?" For some this is not a trivial question. For some it depends on which thread library is being used. But with Solaris, where application threads have been first class citizens ever since Solaris 10 first shipped, the answer is simply "the number of threads".
Enough of this! The weekend is upon us. Where's my Lego?
( Dec 07 2007, 07:48:27 AM PST ) PermalinklibMicro is a portable, scalable microbenchmarking framework which Bart Smaalders and I put together a little while back. It is available to the world via the OpenSolaris website, under the CDDL license, so there really is nothing to stop you recreating the data in this posting.
Although designed for testing individual APIs, the libMicro framework has proven useful for other investigations. For instance, the memrand case, which does negative stride pointer chasing, can be configured to test processor and cache and memory latencies...
huron$ bin/memrand -s 128m -B 1000000 -C 10
prc thr usecs/call samples errors cnt/samp size
memrand 1 1 0.15232 12 0 1000000 134217728
huron$
The above shows 12 samples (we asked for at least 10) of 1,000,000 negative stride pointer references striped across 128MB of memory. The platform is a Sun SPARC Enterprise T5220 server, with a UltraSPARC T2 processor running at 1.4GHz. This simple test indicates a memory read latency of 152ns.
But we can also use libMicro's multiprocess and multithread scaling capabilities to extend this test to measure memory throughput scaling...
huron$ for i in 1 2 4 8 16 32 64; do bin/memrand -s 128m -B 1000000 -C 10 -T $i; done
prc thr usecs/call samples errors cnt/samp size
memrand 1 1 0.15223 12 0 1000000 134217728
prc thr usecs/call samples errors cnt/samp size
memrand 1 2 0.15176 12 0 1000000 134217728
prc thr usecs/call samples errors cnt/samp size
memrand 1 4 0.15208 12 0 1000000 134217728
prc thr usecs/call samples errors cnt/samp size
memrand 1 8 0.25472 12 0 1000000 134217728
prc thr usecs/call samples errors cnt/samp size
memrand 1 16 0.26242 12 0 1000000 134217728
prc thr usecs/call samples errors cnt/samp size
memran 1 32 0.24964 12 0 1000000 134217728
prc thr usecs/call samples errors cnt/samp size
memrand 1 64 0.24063 12 0 1000000 134217728
huron$
This shows that up to 4 concurrent threads see 152ns latency, with 64 threads (i.e. full processor utilisation) seeing 240ns latency, which equates to a throughput of 267 million memory reads per second (i.e. 64 / 0.240e-6). Just to set this in context, here are some data for a quad socket Tigerton system running at 2.93GHz...
tiger$ for i in 1 2 4 8 16; do bin/memrand -s 128m -B 1000000 -C 10 -T $i; done
prc thr usecs/call samples errors cnt/samp size
memrand 1 1 0.15559 12 0 1000000 134217728
prc thr usecs/call samples errors cnt/samp size
memrand 1 2 0.15621 12 0 1000000 134217728
prc thr usecs/call samples errors cnt/samp size
memrand 1 4 0.15667 12 0 1000000 134217728
prc thr usecs/call samples errors cnt/samp size
memrand 1 8 0.17726 12 0 1000000 134217728
prc thr usecs/call samples errors cnt/samp size
memrand 1 16 0.18654 12 0 1000000 134217728
tiger$
This shows a peak throughput of about 86 million memory reads per second (i.e. 16 / 0.186e-6), making the single chip UltraSPARC T2 processor's throughput 3x that of its quad chip rival. Of course, mileage will vary greatly from workload to workload, but pretty impressive nonetheless, heh?
( Nov 02 2007, 05:43:22 PM PDT ) Permalink"Congratulations, you have been randomly selected to win a free all accommodation included vacation to Florida Bahamas. Press 9 for more information."
How lucky am I?!
I received this automated call twice within ten minutes to two different phone numbers! How random is that?!"
So what possessed me to hang up? Well, if you have any sense, you'll do the same. If these people can't even be honest about the way choose your number, why should you trust them with anything else?
Moral: If something appears too good to be true, it probably is.
Anyway, must dash as I've got to book my tickets for Nigeria. A very friendly lady who is the widow of some ex quasi government official needs my help laundering a few million dollars...
( Sep 28 2007, 03:12:04 AM PDT ) Permalink Comments [1]
I've just read a couple of intriguing posts which discuss the possibility of hosting UFS filesystems on ZFS zvols. I mean, who in their right mind...? The story goes something like this ...
# zfs create tank/ufs # newfs /dev/zvol/rdsk/tank/ufs # mount /dev/zvol/dsk/tank/ufs /ufs # touch /ufs/file # zfs snapshot tank/ufs@snap # zfs clone tank/ufs@snap tank/ufs_clone # mount /dev/zvol/dsk/tank/ufs_clone /ufs_clone # ls -l /ufs_clone/file
Whoopy doo. It just works. How cool is that? I can have the best of both worlds (e.g. UFS quotas with ZFS datapath protection and snapshots). I can have my cake and eat it!
Well, not quite. Consider this variation on the theme:
# zfs create tank/ufs # newfs /dev/zvol/rdsk/tank/ufs # mount /dev/zvol/dsk/tank/ufs /ufs # date >/ufs/file # zfs snapshot tank/ufs@snap # zfs clone tank/ufs@snap tank/ufs_clone # mount /dev/zvol/dsk/tank/ufs_clone /ufs_clone # cat /ufs_clone/file
What will the output of the cat(1) command be?
Well, every time I've tried it so far, the file exists, but it contains nothing.
The reason for this is that whilst the UFS metadata gets updated immediately (ensuring that the file is created), the file's data has to wait a while in the Solaris page cache until the fsflush daemon initiates a write back to the storage device (a zvol in this case).
By default, fsflush will attempt to cover the entire page cache within 30 seconds. However, if the system is busy, or has lots of RAM -- or both -- it can take much longer for the file's data to hit the storage device.
Applications that care about data integrity across power outages and crashes don't rely on fsflush to do their dirty (page) work for them. Instead, they tend to use raw I/O interfaces, or fcntl(2) flags such as O_SYNC and O_DSYNC, or APIs such as fsync(3C), fdatasync(3RT) and msync(3C).
On systems with large amounts of RAM, the fsflush daemon can consume inordinate amounts of CPU. It is not uncommon to see a whole CPU pegged just scanning the page cache for dirty pages. In configurations where applications take care of their own write flushing, it is considered good practice to throttle fsflush with the /etc/system parameters autoup and tune_t_fsflushr. Many systems are configured for fsflush to take at least 5 minutes to scan the whole of the page cache.
From this is it clear that we need to take a little more care before taking a snapshot of a UFS filesystem hosted on a ZFS zvol. Fortunately, Solaris has just want we need:
# zfs create tank/ufs # newfs /dev/zvol/rdsk/tank/ufs # mount /dev/zvol/dsk/tank/ufs /ufs # date >/ufs/file # lockfs -wf # zfs snapshot tank/ufs@snap # lockfs -u # zfs clone tank/ufs@snap tank/ufs_clone # mount /dev/zvol/dsk/tank/ufs_clone /ufs_clone # cat /ufs_clone/file
Notice the addition of just two lockfs(1M) commands. The first blocks any writers to the filesystem and causes all dirty pages associated with the filesystem to be flushed to the storage device. The second releases any blocked writers once the snapshot has been cleanly taken.
Of course, this will be nothing like as quick as the initial example, but at least it will guarantee that you get all the data you are expecting. It's not just no data we should be concerned about, but also stale data (which is much harder to detect).
I suppose this may be a useful workaround for folk waiting for some darling features to appear in ZFS. However, don't forget that "there's no such thing as a free lunch"! For instance, hosting UFS on ZFS zvols will result in the double caching of filesystem pages in RAM. Of course, as a SUNW^H^H^H^HJAVA stock holder, I'd like to encourage you to do just that!
Solaris is a wonderfully well-stocked tool box full of the great technology that is ideal for solving many real world problems. One of the joys of UNIX is that there is usually more than one way to tackle a problem. But hey, be careful out there! Make sure you do a good job, and please don't blame the tools when you screw up. A good rope is very useful. Just don't hang yourself!
Technorati Tags: OpenSolaris, Solaris, ZFS
( Sep 11 2007, 07:57:21 AM PDT ) Permalink

The table on the right was directly lifted from a report exploring the scalability of a fairly traditional client-server application on the Sun Fire
E6900
and
E25K
platforms.
The system boards in both machines are identical, only the interconnects differ. Each system board has four CPU sockets, with a dual-core CPU in each, yielding a total of eight virtual processors per board.
The application client is written in COBOL and talks to a multithreaded C-ISAM database on the same host via TCP/IP loopback sockets. The workload was a real world overnight batch of many "read-only" jobs run 32 at a time.
The primary metric for the benchmark was the total elapsed time. A processor set was used to contain the database engine, with no more than eight virtual processors remaining for the application processes.
The report concludes that the E25K's negative scaling is due to NUMA considerations. I felt this had more to do with perceived "previous convictions" than fact. It bothered me that the E6900 performance had not been called into question at all or explored further.
The situation is made a little clearer by plotting the table as a graph, where the Y axis is a throughput metric rather than the total elapsed time.
Although the E25K plot does indeed appear to show negative scalability (which must surely be somehow related to data locality), it is the E6900 plot which reveals the real problem.
The most important question is not "Why does the E25K throughput drop as virtual processors are added?" but rather "Why does the E6900 hardly go any faster as virtual processors are added?"
Of course there could be many reasons for this (e.g. "because there are not enough virtual processors available to run the COBOL client").
However, further investigation with the prstat utility revealed severe thread scalability bottlenecks in the multithreaded database engine.
Using prstat's -m and -L flags it was possible to see microstate accounting data for each thread in the database. This revealed a large number of threads in the LCK state.
Some very basic questions (simple enough to be typed on a command line) were then asked using DTrace and these showed that the "lock waits" were due to heavy contention on a few hot mutex locks within the database.
Many multithreaded applications are known to scale well on machines such as the E25K. Such applications will not have highly contended locks. Good design of data structures and clever strategies to avoid contention are essential for success.
This second graph may be purely hypothetical but it does indicate how a carefully written multithreaded application's throughput might be expected to scale on both the E6900 and the E25K (taking into account the slightly longer inter-board latencies associated with the latter).
The graph also shows that something less that perfect scalability may still be economically viable on very large machines -- i.e. it may be possible to solve larger problems, even if this is achieved with less efficiently.
As an aside, this is somewhat similar to the way in which drag takes over as the dominant factor limiting the speed of a car -- i.e. it may be necessary to double the engine size to increase the maximum speed by less than 50%.
Working with the database developer it was possible to use DTrace to begin "peeling the scalability onion" (an apt metaphor for an iterative process of diminishing returns -- and tears -- as layer after layer of contention is removed from code).
With DTrace it is a simple matter to generate call stacks for code implicated in heavily contended locks. Breaking such locks up and/or converting mutexes to rwlocks is a well understood technique for retrofitting scalability to serialised code, but it is beyond the scope of this post. Suffice it to say that some dramatic results were quickly achieved.
Using these techniques the database scalability limit was increased from 8 to 24 virtual processors in just a matter of days. Sensing that the next speed bump might take a lot more effort, some other really cool Solaris innovations were called on to go the next step.
The new improved scalable database engine was now working very nicely alongside the COBOL application on the E25K in the same processor set with up to 72 virtual processors (already somewhere the E6900 could not go).
For this benchmark the database consisted of a working set of about 120GB across some 100,000 files. With well in excess of 300GB of RAM in each system it seemed highly desirable to cache the data files entirely in RAM (something which the customer was very willing to consider).
The "read-only" benchmark workload actually resulted in something like 200MB of the 120GB dataset being updated each run. This was mostly due to writing intermediate temporary data (which is discarded at the end of the run).
Then came a flash of inspiration. Using clones of a ZFS snapshot of the data together with Zones it was possible to partition multiple instances of the application. But the really cool bit is that ZFS snapshots are almost instant and virtually free.
ZFS clones are implemented using copy-on-write relative to a snapshot. This means that most of the storage blocks on disk and filesystem cache in RAM can be shared across all instances. Although snapshots and partitioning are possible on other systems, they are not instant, and they are unable to share RAM.
The E25K's 144 vitual processors (on 18 system boards) were partitioned into a global zone and five local zones of 24 virtual processors (3 system boards) each. The database was quiesced and a ZFS snapshot taken. This snapshot was then cloned five times (once per local zone) and the same workload run against all six zones concurrently (in the real world the workload would also be partitioned).
The resulting throughput was nearly five times that of a single 24 virtual processor zone, and almost double the capacity of a fully configured E6900.
All of the Solaris technologies mentioned in this posting are pretty amazing in their own right. The main reason for writing is to underline how extremely powerful the combination of such innovative technologies can be when applied to real world problems. Just imagine what Solaris could do for you!
Solaris: the whole is greater than the sum of its parts.
Technorati Tags: OpenSolaris, Solaris, ZFS, Zones, DTrace
( Jul 10 2007, 05:24:21 PM PDT ) PermalinkA week ago I was presenting A Brief History Of Solaris at the Sun HPC Consortium in Dresden. My slideware is pretty minimalist (audiences generally don't respond well to extended lists of bullet points), but it should give you a flavour of my presentation style and content. For more, see Josh Simon's writeup.
My main point is that although Solaris is a good place to be because it has a consistent track record of innovation (e.g. ONC, mmap, dynamic linking, audaciously scalable SMP, threads, doors, 64-bit, containers, large memory support, zones, ZFS, DTrace, ...), the clincher is that these innovations meet in a robust package with long term compatability and support.
Linus may kid himself that ZFS is all Solaris has to offer, but the Linux community has been sincerely flattering Sun for years with its imitation and use of so many Solaris technologies. Yes, there is potential for this to work both ways, but until now the traffic has been mostly a one way street.
As a colleague recently pointed out it is worth considering questions like "what would Solaris be without the Linux interfaces it has adopted?" and "what would Linux be without the interfaces it has adopted from Sun?" (e.g. NFS, NIS, PAM, nsswitch.conf, ld.so.1, LD_*, /proc, doors, kernel slab allocator, ...). Wow, isn't sharing cool!
Solaris: often imitated, seldom bettered.
Technorati Tags: OpenSolaris, Solaris, ZFS, Zones, Linux
( Jul 03 2007, 09:50:15 AM PDT ) Permalink Comments [7]Are you concerned about silent data corruption? You should be. History shows that silent data corruption has potential to end your career!
In 1631 printers introduced a single bit error into an edition of the King James Bible. They omitted an important "not" from Exodus 20:14, making the seventh commandment read "Thou shalt commit adultery."
These unfortunate professionals were fined £300 (roughly a lifetime's wages). Most of the copies of this "Wicked Bible" were recalled immediately, with only 11 copies known to exist today (source Wikipedia, image Urban Dictionary).
One of the problems with silent data corruption is that we may not even notice that is it there when we read the data back. Indeed, we may actually prefer the adulterated version. Happily the 1631 error stands out against the rest of the text in which it is found.
ZFS protects all its data (including metadata) with non-local checksums. This means that it is impossible for silent data corruption introduced anywhere between the dirty spinning stuff and your system memory to go unnoticed (what happens from there onwards is entirely up to you).
ZFS is able to repair corrupted data automatically provided you have configured mirrored or RAIDZ storage pools. It's just a shame the British authorities didn't take the ditto blocks in Deuteronomy 5:18 into account way back in 1631.
Hmmm, does that make the Bible prior art for United States Patent 20070106862?
Technorati Tags: OpenSolaris, Solaris, ZFS
( Jul 02 2007, 04:27:53 AM PDT ) Permalink"Sesame Street was brought to you today by the letter zee ..." was probably the first time I was aware of a problem. Whilst I do try to contextualise my zees and zeds, sometimes the audience is just too diverse (or I am just too old). I am grateful to my French-Canadian friend and colleague Roch Bourbonnais for proposing an alternative. So "zer-eff-ess" it is, then! After all, ZFS really is zer one true filesystem, n'est-ce pas?
Technorati Tags: OpenSolaris, Solaris, ZFS
( Jun 29 2007, 07:00:14 AM PDT ) Permalink Comments [1]When I were a lad "RAID" was always an acronym for "Redundant Array of Inexpensive Disks". According to this Wikipedia article 'twas always thus. So, why do so many people think that the "I" stands for "Independent"?
Well, I guess part of the reason is that when companies started to build RAID products they soon discovered that were far from inexpensive. Stuff like fancy racking, redundant power supplies, large nonvolatile write caches, multipath I/O, high bandwith interconnects, and data path error protection simply don't come cheap.
Then there's the "two's company, three's a crowd" factor: reliability, performance, and low cost ... pick any two. But just because the total RAID storage solution isn't cheap, doesn't necessarily mean that it cannot leverage inexpensive disk drives.
However, inexpensive disk drives (such as today's commodity IDE and SATA products) provide a lot less in terms of data path protection than more expensive devices (such as premium FC and SCSI drives). So RAID has become a someone elitist, premium technology, rather than goodness for the masses.
Enter ZFS.
Because ZFS provides separate checksum protection of all filesystem data and meta data, even IDE drives be deployed to build simple RAID solutions with high data integrity. Indeed, ZFS's checksumming protects the entire data path from the spinning brown stuff to the host computer's memory.
This is why I rebuilt my home server around OpenSolaris using cheap non-ECC memory and low cost IDE drives. But ZFS also promises dramatically to reduce the cost of storage solutions for the datacentre. I'm sure we will see many more products like Sun's X4500 "Thumper".
ZFS - Restoring the "I" back in RAID
Technorati Tags: OpenSolaris, Solaris, NAS, ZFS, RAID
( May 16 2007, 12:01:46 PM PDT ) Permalink Comments [3]
When my trusty Solaris 10 Atlhon 32-bit home server blew up I enlisted a power hungry dual socket Opteron workstation as a stop-gap emergency measure. I also used the opportunity to upgrade from SVM/UFS to ZFS. But the heat and the noise were unacceptable, so I started thinking about a quieter and greener alternative ...
My initial intention was to build a simple ZFS-based NAS box, but after an abortive attempt to do things on the really cheap with a £20 mobo and £25 CPU (one of which didn't work, and I'm still waiting for Ebuyer to process my RMA), I decided I needed to make an effort to keep up with the Gerhards.
Although I'd seen Chris Gerhard's post about a system built around an ASUS M2NPV-VM, when I searched for this mobo at Insight UK (where Sun employees can get a useful discount, free next working day delivery, and expect to be treated as a valued customer), I was unable to find it. So instead, I opted for the cheaper ASUS M2V-MV (£38) ... and soon regretted it.
My config also included: an Antec SLK3000B case (£29), a quiet Trust PW-5252 420W PSU (£19), an AMD Athlon 64 3000+ 1.8GHz 512K L2 CPU (£45), two Kingston ValueRAM 1GB 667 DDRII DIMMS (£52 each), and two Seagate Barracuda 400GB 7200.10 ATA100 drive (£82 each). I also threw in a spare 80GB SATA drive and DVD rewriter I just happened to have lying around. Grand total: £399.
However, despite upgrading to the latest firmware, I couldn't get Casper's PowerNow stuff to work. My disappontment grew whilst talking to Chris about his ASUS M2NPV-VM. Not only had he got PowerNow working (rev 0705 firmware minimum required), but the mobo included gigabit networking, NVIDIA graphics, and greater memory capacity. By this time, feature creep had also set in. I could see that the machine might also be a useful workstation, and ZFS compression probably meant I could use as much CPU as I could get (sadly, the current version of Casper's code supports only a single core config).
Then I discovered that Insight UK did indeed stock the ASUS M2NPV-VM after all! It's just that their search engine is broken. So I decided to bite the bullet (£56) ... I have since reused the ASUS M2V-MV in a dual core Ubuntu config, but that's a story for another day ... perhaps.
To find out how much power I was saving, I invested in a Brennenstuhl PM230 inline power meter (shown above). Machine Mart only wanted £20 for it, and unlike some other cheap units, it does a proper job with the power factor. The only issue I've found is the crazy positioning of the power outlet relative to the display and control buttons (it's pretty obvious that UK power plugs were not considered in the original design). Anyway, here are some results:
| Mode \ Config | 2x Opteron 2.2GHZ RioWorks HDAMB 2GB DDRII 6x 7200RPM | 1x Athlon 64 1.8GHz ASUS M2NPV-VM 2GB DDRII 3x 7200RPM | Intel Core Duo 1.83GHz Apple MacBook Pro 2GB DDR2 1x 5400RPM |
| standby | 40W (£23 PA) | 4W (£2 PA) | 7W (£4 PA) |
| idle | 240W (£137 PA) | 78W (£47 PA) | 34W (£19 PA) |
| idle + charging | - | - | 60W (£34 PA) |
| 1 loop | 260W (£149 PA) | 111W (£64 PA) | 50W (£29 PA) |
| 2 loops | 280W (£160 PA) | 111W (£64 PA) | 55W (£32 PA) |
| 2 loops + charging | - | - | 81W (£46 PA) |
The above calculated annual electricity costs are based on a charge of 9p per kWh. Since a home server spends most of its time idle, I calculate that my new machine will save me at least £90 per year relative to my stop-gap Opteron system. That hardly pays for the upgrade, but it does salve my green conscience a little ... just not much!
mbp$ ssh basket Last login: Fri Apr 27 16:12:16 2007 Sun Microsystems Inc. SunOS 5.11 snv_55 October 2007 basket$ prtdiag System Configuration: System manufacturer System Product Name BIOS Configuration: Phoenix Technologies, LTD ASUS M2NPV-VM ACPI BIOS Revision 0705 01/02/2007 ==== Processor Sockets ==================================== Version Location Tag -------------------------------- -------------------------- AMD Athlon(tm) 64 Processor 3000+ Socket AM2 ==== Memory Device Sockets ================================ Type Status Set Device Locator Bank Locator ------- ------ --- ------------------- -------------------- unknown in use 0 A0 Bank0/1 unknown empty 0 A1 Bank2/3 unknown in use 0 A2 Bank4/5 unknown empty 0 A3 Bank6/7 ==== On-Board Devices ===================================== ==== Upgradeable Slots ==================================== ID Status Type Description --- --------- ---------------- ---------------------------- 1 in use PCI PCI1 2 available PCI PCI2 4 available PCI Express PCIEX16 5 available PCI Express PCIEX1_1 basket$
Technorati Tags: OpenSolaris, Solaris, NAS, ZFS
( Apr 27 2007, 09:04:24 AM PDT ) PermalinkMy current electronics hobby project is to build some ADAT "Lightpipe" repeaters and patch panels for my home studio (yes, there are commercial products out there, but they are very pricey for what they do).
ADAT uses the same plastic fiberoptic "TOSLINK" technology found on many domestic CD/DVD players, games consoles, AV processors etc, but carries up to 8x 48KHz (or 4x 96KHz) 24bit PCM signals (i.e. a higher bandwidth requirement than traditional S/PDIF or Dolby5.1 applications).
Generally, I use RS-Components for such projects, but I've had real problems sourcing the TOSLINK tranmitter and receiver parts in the UK. They are only made by Toshiba and Sharp, and despite being ubiquitous in domestic audio appliances are very hard/expensive for hobbyists to find at a reasonable price.
Googling around revealed one enthusiast who used to sell them for £6 a piece, and another UK online supplier currently quoting over £17 per device. At those prices, one might as well buy the commercial product!
RS will source non-catalogue items for account holders, but they require a £200 minimum order (plus they quote 10+ days for delivery). The big distributor EBV don't sell to the little guy at all (they require a trade account, and a minimum order of £1000).
Enter Digi-Key. They're based in the USA, but have localised online ordering for the UK and many other countries. About 36 hours ago I placed an order for 20x TOTX147L and 20x TORX147L parts (3v, shuttered, 15Mbps transmitter and receiver). The price was about £1 each, plus £10 handling because my order was under £75, and plus £12 shipping because by order was under £100.
The UPS man knocked on the door a couple of hourse ago, taking COD for the £12 VAT duty due. Amazing! 40 parts for just £74. But what really blew my socks off was that it arrived about half an hour before the RS order I placed the same evening (for all the other bits I need for my project).
( Jul 13 2006, 04:26:12 AM PDT ) PermalinkIt is my intention to make this a series of short postings, each looking at one or two specific tunables. Hopefully, this "Part 1" will be followed by a "Part 2" and a "Part 3" and so on -- until I run out of interesting things to say on the subject. But I'm not the most prolific blogger, so don't hold your breath!
I suffer from blogger's block. Part of my problem is gaugeing the amount of context which needs filling out before I can get to the meat of what I want to say. For this short series of posts I am not going to go into lots of detail about Solaris multithreading architecture. In Solaris 8 we began a U-turn on our famous MxN application threads architecture when we introduced the alternate 1-1 implementation. In Solaris 9 we dropped the MxN implementation altogether, and the thankless task of explaining why this was a good thing fell to me. I'm going to assume that you've read Multithreading in the Solaris Operating Environment, most of which still applies in Solaris 10 and OpenSolaris (we've just made it even better).
"It betrayed Isildur to his death. And some things that should not have been forgotten were lost. History became legend, legend became myth, and for two and half thousand years the Ring passed out of all knowledge. Until when chance came, it ensnared a new bearer. The Ring came to the creature Gollum, who took it deep into the tunnels of the Misty Mountains. And there, it consumed him. The Ring brought to Gollum unnatural long life. For five hundred years it poisoned his mind. And in the gloom of Gollum's cave, it waited. Darkness crept back into the forest of the world. Rumor grew of a shadow in the East, whispers of a nameless fear, and the Ring of Power perceived. Its time had now come." -- J R R Tolkien, The Lord Of The Rings
The stuff I'm going to mention used to be hidden and was intended solely for our own use -- not yours. We believe Solaris should perform well out of the box, and that the provision of tunables amount to an admission of failure. We have never formally published or documented these things before because: we don't think you really need them; we don't want them to become de facto standards; and because we need the freedom to remove or change them in future. In many cases tweaking these tunables will actually hurt application performance. And they are a very blunt instrument, affecting all the threads and synchronisation objects within a process. So although some regions of your code may speed up, other regions may suffer, with no net gain overall.
Please use this information responsibly. It is unlikely ISVs have tested their applications with anything other than the default values -- I know we don't. Be up-front with your suppliers (including us) when dealing with support issues, and make sure you eliminate these tunables early from their inquiries.
With the advent of OpenSolaris some things which should have been forgotten are going to be found. Indeed, you will find the names of all the tunables I'm going to mention here. We can't hide this stuff any more -- nor do we want to. With OpenSolaris we say "what's ours is yours".
Of course, there may be cases where fiddling with these tunables actually helps application performance. Of course, this may be because the applications themselves are poorly written, but it also may indicate that we have further work to do in improving Solaris performance out of the box. Please do share your data! And sharing this information with your suppliers -- especially some of the debugging features -- may actually speed the resolution of some support issues.
For now, I'm just going to tell you what the tunables are. You'll have to wait for subsequent posts (or go, grok the source) for an understanding of what they actually do and how they may be useful to you.
In /usr/src/lib/libc/port/threads/thr.c we find these names:
QUEUE_SPIN ADAPITVE_SPIN RELEASE_SPIN MAX_SPINNERS QUEUE_FIFO QUEUE_VERIFY QUEUE_DUMP STACK_CACHE COND_WAIT_DEFER ERROR_DETECTION ASYNC_SAFE DOOR_NORESERVE
The function etest() specifies a maximum value for each variable and ensures that any user-defined value falls between zero and this limit. The default values are defined elsewhere in the code. In OpenSolaris the tunables are defined as follows:
envvar limit default _THREAD_QUEUE_SPIN 1000000 1000 _THREAD_ADAPITVE_SPIN 1000000 1000 _THREAD_RELEASE_SPIN* 1000000 500 _THREAD_MAX_SPINNERS 100 100 _THREAD_QUEUE_FIFO 8 4 _THREAD_QUEUE_VERIFY** 1 0 _THREAD_QUEUE_DUMP 1 0 _THREAD_STACK_CACHE 10000 10 _THREAD_COND_WAIT_DEFER 1 0 _THREAD_ERROR_DETECTION 1 0 _THREAD_ASYNC_SAFE 1 0 _THREAD_DOOR_NORESERVE 1 0
All but the last one also exist in Solaris 10 (and for now, the LIBTHREAD_ variants will also work). Making a change is as simple as setting an environment variable (just remember that all envvars are generally inherited by child processes).
I hope this has whetted your appetite for next next post, but in the meantime "hey, be careful out there"!
Tag: Solaris, OpenSolaris
( Mar 01 2006, 01:53:40 AM PST ) PermalinkThe following abstracts were submitted for Sun's internal Customer Engineering Conference 2006. Of course there is no guarantee that this material will be accepted by the CEC panel, but I'd be happy to present the same (or similar) material at other events. If you're interested, please drop me a line.
Many purchasing, configuration and development choices are made on the basis of benchmark data. Industry organisations such as SPEC and TPC exist to inject a measure of realism and fairness into the exercise. However, such benchmarks are not for the faint hearted (e.g. they require considerable hardware, software and people resources). Additionally, the customer may feel that an industry-standard benchmark is not sufficiently close to their own perceived requirements. Yet building a bespoke benchmark for a real world application workload is an order of magnitude harder than going with something "off the peg". It is at this point that an alarming number of customers make the irrational leap to some form of microbenchmarking -- whether it is good old "dd" to test an I/O subsystem, or perhaps LMbench's notion of "context switch latency". The whole is rarely greater than the sum of its parts, but the issue often ignored is that a microbenchmark -- by very definition -- only considers one tiny component at a time, and then only covers a small subset of functionality in total. Furthermore, it is often observed that some microbenchmarks are very poor predictors of actual system performance under real world workloads.
Is there any place for microbenchmarking? Certainly, we need to be aware that customers may be conducting ill-advised tests behind closed doors. But should we ever dare engage in such dubious activities ourselves? In short: yes! In the right hands microbenchmarks can highlight components likely to respond well to tuning, and assist in the tuning process itself. This session will focus on libMicro: an in-house, extensible, portable suite of microbenchmarks first used to drive performance improvements in Solaris 10. The libMicro project was driven by the conviction that "If Linux is faster, it's a Solaris bug". However, some of the initial data made the case so strongly that we chose to adopt the Monsters Inc. slogan "We scare because we care" at first! libMicro is now available to you and your customers under the CDDL via the OpenSolaris programme. Key components of libMicro will be demonstrated during this session. The demo will include data collection, reporting and adding of new cases to the suite.
Note: I was taking them seriously about 2500 chars and two paragraphs.
The Unified Process Model is one of the best kept secrets in Solaris 10. Yet this "so what?" feature entailed changes to over 1600 source files. But was it all a waste of effort? For over a decade Sun has been recognised as a thought leader in software multithreading, but did we lose the plot when we dropped the idealistic two level MxN implementation for something much simpler in Solaris 9?
To both of these questions we must answer a resounding "No!". Indeed, the Unified Process Model, under which every process is now potentially a multithreaded process, was only possible by a simpler, more scalable, more reliable, more maintainable, realistic one level 1:1 implementation. And all this goodness just happens to coincide with the CoolThreads revolution. As other vendors chime in with CMT, Solaris is streets ahead of Linux and other platforms in being able to deliver real benefits from this technology. It is extremely important that we are able to understand, articulate and exploit this synchronicity.
Note: this time I realised that they didn't really mean 2500 chars!
Wonder what all the fuss is about? Need a good reason before you engage your brain with this stuff? Think this may be one new trick too far for an aging dog? Just curious? Then this session is for you! We have a reputation for making DTrace come alive for even the most skeptical and indifferent of crowds -- D is certainly not for "dull" at our shows! Don't worry, we won't get you bogged down in syntax or architecture. But we will convince you of the dynamite that is the DTrace observability revolution -- that, or you are dummer that we thought! Everything you see will happen live. We don't use any canned scripts. Anything could happen. You'd be a fool to miss it!
Notes: This was a joint submisson from me and Jon Haslam. We've found our combination of sound technical content and brit humour very effective at getting across the DTrace value proposition to a wide audience. We first did our double act (Jon types while Phil talks) at SUPerG 2004. Following rave reviews we were asked to present a plenary session at SUPerG 2005.
Technorati Tags: OpenSolaris, Solaris
( Feb 06 2006, 03:08:57 AM PST ) Permalink Comments [0]How could I not be excited about CoolThreads?! Regular readers of Multiple Threads will be aware of my technical white paper on Multithreading in the Solaris Operating Environment, and of more recent work making getenv() scale for Solaris 10. And there's a lot more stuff I'm itching to blog -- not least libMicro, a scalable framework for microbenchmarking which is now available to the world via the OpenSolaris site.
For my small part in today's launch of the first CoolThreads servers, I thought it would be fun to use libMicro to explore some aspects of application level mutex performance on the UltraSPARC T1 chip. In traditional symmetric multiprocess configurations mutex performance is dogged by inter-chip cache to cache latencies.
To applications, the Sun Fire T2000 Server looks like a 32-way monster. Indeed, log in to one of these babies over the network and you soon get the impression of a towering, floor-breaking, hot-air-blasting, mega-power-consuming beast. In reality, it's an cool, quiet, unassuming 2U rackable box with a tiny appetite!
Eight processor cores -- each with its own L1 cache and four hardware strands -- share a common on-chip L2 cache. The thirty-two virtual processors see very low latencies from strand to strand, and core to core. But how does this translate to mutex performance? And is there any measurable difference between inter-core and intra-core synchronization?
For comparison, I managed to scrounge a Sun Fire V890 Server with eight UltraSPARC IV processors (i.e. 16 virtual processors in all). Both machines are clocked at 1.2GHz.
First up, I took libMicro's cascade_mutex test case for a spin. Literally! This test takes a defined number of threads and/or processes and arranges them in a ring. Each thread has two mutexes on which it blocks alternately; and each thread manipulates the two mutexes of the next thread in the ring such that only one thread is unblocked at a time. Just now, I'm only interested in the minimum time taken to get right around the loop.
The default application mutex implementation in Solaris uses an adaptive algorithm in which a thread waiting for a mutex does a short spin for the lock in the hope of avoiding a costly sleep in the kernel. However, in the case of an intraprocess mutex the waiter will only attempt the spin as long as the mutex holder is running (there is no point spinning for a mutex held by thread which is making no forward progress).
With 16 threads running cascade_mutex the T2000 achieved a blistering 11.9us/loop (that's less than 750ns per thread)! The V890, on the other hand, took a more leisurely 25.3us/loop. Clearly, mutex synchronization can be very fast with CoolThreads!
Naturally, spinning is not going to help the cascade_mutex case if you have more runable threads than available virtual processors. With 32 active threads the V890 loop time rockets to 850us/loop, whereas the T2000 (with just enough hardware strands available) manages a very respectable 32.4us/loop. Only when the T2000 runs out of virtual processors does the V890 catch up (due to better single thread performance). At 33 threads the T2000 jumps to 1140us/loop versus 900us/loop on the V890.
libMicro's cascade_mutex case clearly shows that UltraSPARC T1 delivers incredibly low latency synchronization across 32 virtual processors. Whilst this is a good thing in general it is particularly good news for the many applications which use thread worker pools to express their concurrency.
In Part 2 we will explore the small difference cascade_mutex sees between strands on the same core and strands on different cores. Stay tuned!
Technorati Tags: OpenSolaris, Solaris, NiagaraCMT
( Dec 06 2005, 11:07:52 AM PST ) Permalink Comments [0]