There is something I just love about German heavy metal. American heavy metal tends to be a little much for me. It just comes off overwrought and corny. Metallica's Black Album is my favorite of the lot. But there's just something hopelessly charming about German metal. Rammstein is a good example. You may remember them, as they caught some airtime in the US around 2000 with Du Hast. (It was the only song on alternative radio in German, so it was kinda hard to miss.) There's just something about the tone of the music that makes it so much more palatable to me than the American equivalent. It probably also help that I speak German.
Last time I was in Regensburg, my colleagues took me out to a metal bar, and I found a new favorite song: Wir Werden Alle Sterben by Knorkator. If you love heavy metal, it's definitely worth the $0.99 to download. The gist of the lyrics is that the singer had a conversation with his manager, in which his manager suggested that he needed to write a song to uplift the spirits of his fans. This song is the result of that conversation. The title and first line of the chorus translates to "We're all going to die." Quaint, eh? The juxtaposition of song's super heavy metal riffs with an upbeat, bouncy chorus singing that we're all going to die is just too much to resist. Go check it out!
Reuti just reminded me of a nice application of one of the new features we added in Grid Engine 6.1. Before 6.1, resource requests were limited to simple boolean AND and OR expressions. For example, when submitting a job, a user might request "-l a=sol-x*|sol-amd64 -l mem_free=4G -l exclusive=TRUE", meaning that the job must run on a Solaris i386 or AMD64 machine, and the machine must have at least 4GB of memory free, and the job wants exclusive access to the host. (AND is represented by multiple -l switches.) There was no way, however, to request, for example, Solaris on anything but x86.
Enter 6.1. With 6.1 we introduced full boolean expressions for resource requests. A user can now make requests like, "-l a =sol-*&!sol-sparc*". (The job must run on Solaris, but not on SPARC or SPARC64.) Even better, you make create complex boolean statements, like "-l (sol-*&!*-x86)|(lx2[46]-*&!(*-x86|*-ia64))". (The job must run on either Solaris on anything but x86 or Linux on anything except x64 or Itanium.)
Now, to the title problem. In the email that prompted this post, Reuti responded to a question about how to submit a job to any host, except for one. With 6.1, the answer is simple. Grid Engine has a built-in complex called hostname, or h for short. Using the new boolean expressions, it's very simple to request "-l h=!badhostname", which allows the job to run on any machine except the one named badhostname.
I'm a little slow on the draw, but in case you haven't noticed already, Grid Engine 6.2 Beta 2 is now ready for download! Go pull it down and give it a whirl!
In case you haven't noticed yet, we've posted the JavaOne Hands-on Labs from this year to developers.sun.com. Among the labs posted there are the Compute Server lab (7410), which includes my old DRMAA lab as an appendix, and the Project Darkstar lab (7400). If you missed them at JavaOne, now's your chance to do them from the comfort of home.
A common thing to want to do with Grid Engine is to let users request that their jobs be run as the only thing on the host(s). The naïve approach would be for the user to request a number of slots equal to the number of slots offered by the hosts, but for a plethora of reasons, that doesn't work. (Among the reasons are that we might not have the same number of slots per host, and more importantly, unless we're using a parallel environment that is configured for fill-up allocation, a job can't request all the slots on a host.) Let's talk through an approach that does work.
Let's think through this problem. A natural approach for a Grid Engine administrator would be to create a special queue on each host to which all other queues are subordinated. When jobs are running in that queue, then all other jobs on the system are suspended. That approach solves the problem (mostly), but it's a bit heavy-handed. Whenever an exclusive job gets put on a host, other jobs on that host get suspended until it is finished. If there is a steady stream of exclusive jobs, non-exclusive jobs could starve.
To fix that problem, you could set up circular subordination: make the other queues subordinate to the exclusive queue and the exclusive queue subordinate to the other queues. The effect of this circular subordination is that there can never be jobs in both the exclusive queue and any other queue, preventing the starvation issue. (If a job is running in a non-exclusive queue, the exclusive queue is unavailable (suspended), and vice versa.)
Another problem that crops up is keeping non-exclusive jobs from accidentally ending up in the exclusive queue. That problem is easily solved with a forced resource assigned to the exclusive queue. With a forced resource, only jobs that either request the resource or explicitly request the exclusive queue can run in the exclusive queue.
There's another problem. How do you keep multiple exclusive jobs from all running in the exclusive queue on the same host? One answer would be to only give the exclusive queue one slot. That works for non-parallel jobs and parallel jobs that are only allowed to run one slave per host. It does not work for parallel or parametric jobs where more than one task could (or should) run on a single host. One solution would be to change the forced resource to a forced integer consumable with a value equal to the number of slots. A job could then theoretically request as much of that resource as each host has, making sure that there isn't any left over for other jobs. Unfortunately, that won't work. First, we still have the problem that our hosts might not all have the same number of slots. We could try to solve that problem by setting the exclusive queue's consumable's value to 1. That guarantees that only one job can get the resource. The problem there is that a parallel job consumes one set of resources for each slave, so a parallel job with two slaves on a host will need 2 of our consumable. We could try requesting 1/<num_slaves_per_host> of the consumable for such a parallel job, so that after multiplying by the number of slaves on the host, we end up with a request for 1. That only works, however, if every host will be running the same number of slaves per host, and if we know how many that is ahead of time. "But, wait!" you say. "The consumable is an integer, so even if we request less than 1, we should still consume the entire resource!" You'd think so, but you'd be wrong. It turns out that if one job requests half of our resource, another job can still be assigned the other half, defeating our strategy.
In order to solve the problem, we need to fundamentally prevent the scheduler from looking at hosts that are running exclusive jobs. Well, one way to do that would be to remove the host from the host group that is associated with the queues whenever a job is put in the exclusive queue. We can do that from a prolog on the exclusive queue. qconf -dattr queue hostlist $HOST <queue> You'd then have to add the host back from the epilog. qconf -aattr queue hostlist $HOST <queue> Now, the circular subordination makes sure that jobs can run either in the exclusive queue or the other queues (but not both), our forced complex makes sure that only jobs that request exclusivity get it, and our prolog/epilog make sure that the scheduler cannot put multiple exclusive jobs on the same host. But, you guessed it, there's still a problem.
Once a job starts running in the exclusive queue, everything works as intended. The problem is that the scheduler may put more than one exclusive job on the same host at the same time. Because the host isn't removed from the host group until an exclusive jobs starts, we need to keep the scheduler from scheduling multiple exclusive jobs at the same time. That's where load adjustments come in. We can create a new resource, say exclusive_load, and set a load threshold for the exclusive queue based on that resource, say exclusive_load=1. By adding something like exclusive_load=50 to the job_load_adjustments attribute in the scheduler config (and probably also setting the load_adjustment_decay_time to something small, like 0:0:30), we force the scheduler to consider a host's exclusive queue to be full (for the current scheduler interval) whenever a job is put there. After the decay interval, the host becomes available to the scheduler again, but by that time the prolog should have removed it from the host group.
QED (Whew!)
By the way, credit for the host group/load adjustment idea goes to Roland Dittel. Unfortunately, Roland doesn't have a blog, so I can't link to it. If you run into Roland, be sure to tell him how much you'd love to see him start blogging.
I've been working in the Grid Engine team for over five years ago, and I'm still learning about features of the product that I never knew about. One more was just brought to my attention.
When configuring a queue in Grid Engine, you can configure a prolog and epilog. The prolog is a script or binary that is run by the shepherd before running a job. The epilog is the same, except that it comes after a job finishes. When you set the prolog and epilog for a queue, all jobs that run in that queue inherit that prolog and epilog. (A job cannot specify its own prolog and epilog, but look for that to change in a future release. (Actually, if you configure your queue's prolog and epilog to read a custom environment variable in the job's environment and exec the path it contains, you can effectively allow a job to specify its own prolog and epilog by setting them in the environment variables.))
The epilog and prolog are well known tools. What I never noticed, though, is that not only can you specify a path, but you can also specify the user as whom the prolog or epilog should run. For example, if you set the queue's prolog to root@/path/to/my/prolog, the shepherd will execute the prolog as root, no matter who submitted the job. This is really helpful if your prolog and/or epilog needs to do something that has restricted access, such as mounting a directory or modifying the grid configuration. Because only the administrator can change the queue configuration, this feature is not a big security risk. (Actually, this feature is a compelling reason for restricting who has manager rights on your grid. Anyone who is recognized as a grid manager could change a queue to run a malicious prolog/epilog as root, submit a job to that queue, and compromise the system.)
The Solaris Cluster team today contributed over 2 million lines of source code to the open source community, completing the promise we made almost one year ago to open source the complete Solaris Cluster product under the name, Open High Availability Cluster.
Listen to a podcast with Meenakshi Kaul-Basu, Director of Availability Products or
read the official press release.
Solaris Cluster is Sun's High Availability Cluster offering. The product community hosts a blog and wiki where you can also find more information.
Open HA Cluster is part of OpenSolaris, available in the HA Clusters Community Group on OpenSolaris.org.
The Open HA Cluster source code is available under the Common Development and Distribution License (CDDL).
Phase 1, on June 27, 2007, contained the source for almost all the Sun Cluster agents.
Phase 2, on December 4, 2007, included the source for Sun Cluster Geographic Edition disaster recovery software.
Phase 3, announced today, and delivered six months ahead of schedule, contains the source for the core Solaris Cluster product, consisting of over 2 million lines of source code!
The open source code does not include some encumbered Solaris Cluster source code. Nonetheless, users can build a completely usable HA Cluster from this source with the Sun Studio 11 product.
Also available is source for parts of the Solaris Cluster Automated Test Environment (SCATE), source for the Solaris Cluster man pages, and source for Solaris Cluster Globalization (G11N).
CTI for TET, which is part of the SCATE test infrastructure, has been separately open sourced on the testing community under the Artistic License. This framework supports both Solaris Cluster and ON test suites.
In addition to the source code, there is a binary distribution of OHAC, called Solaris Cluster Express (SCX), that runs on Solaris Express Community Edition.
Consider getting involved in the HA Clusters community group:
Making Grid Engine HA with Open High Availability Cluster and OpenSolaris
At the Open Source Grid & Cluster Conference a couple of weeks ago, Ashu from the Solaris Cluster team gave a 30-minute presentation about building a highly available Grid Engine cluster using the Open HA Cluster project. (Open HA Cluster is the open-sourced Solaris Cluster.) If you've got a spare 30 minutes, it's worth a look.
I just posted this information as answer to a question on the Grid Engine users mailing list, but I thought it was useful enough to post here, too. If you're new to Grid Engine and trying to understand what a queue is, hopefully this explanation will help.
Let's take it from the top. A queue is where a job runs, not where it waits to run. When a job is in the qw (queued and waiting) state, it has not yet been assigned to a queue. A job that has been assigned to a queue is in the r (running) state (or transferring or suspended). In the pre-6.0 days, a queue could only exist on a single host. With 6.0, we introduced the idea of cluster queues. A cluster queue is a queue that can span multiple hosts. Under the covers, it's essentially a group of pre-6.0 queues, all with the same name, and each on a different host. With one caveat. A pre-6.0 queue is composed of a long list of required attributes, like slots, pe_list, user_list, etc. Starting with 6.0, that long list of attributes is only required for the cluster queue. All of the queue instances that belong to that cluster queue inherit the attribute values from it. The queue instances are allowed, however, to override those attribute values with local settings. A common example of that is the slots attribute. When you install an execution daemon using the install_execd script, it will add a slots setting for the queue instance of all.q on that host (noted as all.q@host). And if it wasn't already clear, pre-6.0 "queue" == post-6.0 "queue instance". Post-6.0 "queue" == "cluster queue".
So, aside from governing the number of free slots on a host, what does a queue do? It controls the execution context of jobs that run in it. It determines what parallel environments are available, what file, memory, and CPU time limits should be applied, how the job should be started, stopped, suspended, and resumed, what the job's process' nice value is, etc.
Queues also have a concept of subordination. A queue that is subordinated to another queue will be suspended (along with all the jobs running in it) when jobs are running in that other queue. By default, the subordinated queue will be suspended when the other queue is full, but you can set the number of jobs required to suspend the subordinated queue. 1 is a common value, meaning that the subordinated queue should be suspended if any jobs are running in the other queue. Subordination trees can be arbitrarily complex. Circular subordination schemes are permitted, producing a sort of mutual exclusion effect.
One other oddity to point out is that the slot count for a queue is not really a queue attribute. It's actually a queue-level resource (aka complex). To allow multiple queues on the same host to share that host's CPUs without oversubscribing, you can set the slots resource at the host level. Doing so sets a host-wide slot limit, and all queues on that host must then share the given number of slots, regardless of how many slots each queue (or queue instance) may try to offer.
Since we're talking about resources, let's talk about one of the common queue/resource configuration patterns. By default, there's nothing (other than access lists) to prevent a stray job from wandering into a queue. That's bad for queues that govern expensive resources or that represent special access, like a priority queue. To solve this problem, the most common approach is to create a resource that is forced. A forced resource (one that has FORCED in the requestable column) has the property that any queue or host that offers that resource can only be used by jobs requesting that resource (or that queue or host, in which case, the resource request is implicit). By assigning such queues forced resources, you can guarantee that stray jobs can't end up in the queue. A nice side effect is that you can also assign an urgency to that resource, meaning that jobs requesting that resource (or the queue to which it's assigned) gain (or lose) priority when being scheduled.
For more information on the above topics, I recommend looking at the man pages for queue_conf(5), complex(5), and sge_priority(5).
Even if you can't speak German, this is a really cool video:
The basic gist is that Constantin and crew think the Thumper is a really cool machine, but that at $50k, it's a bit expensive for a developer to have under his desk. As an alternative, they propose a stack of inexpensive storage, in this case, a jumble of USB sticks. Using ZFS, Constantin pools together the sticks in RAID groups of three as one big storage pool. To demonstrate ZFS's recovery features, he copies a video onto the storage pool, starts the video playing, and then disconnects one of the USB hubs. Even though the storage pool looses 1/4 of its devices, the video continues playing without interruption. He then plugs the missing hub back in and shows that ZFS automatically reconstructs the pool, reintegrating the missing USB sticks. Constantin then does a ZFS export, removes all the sticks, shuffles them thoroughly, sticks them all back in in an unknown order, and then does a ZFS import. ZFS then sorts out which sticks are which and rebuilds the pool.
There's a new effort afoot in the OpenSolaris world to build a community around using OpenSolaris (and Solaris) in high performance computing. If you're interested in HPC on OpenSolaris, head over to the OpenSolaris HPC Community and have a look-see.
As a product of the community, the HPC Stack project is attempting to define what software would be needed to build a complete and useful HPC software stack for OpenSolaris. Right now, the main discussion for the HPC stack project is actually happening on the HPC Community mailing list, but feel free to jump on either list and voice your opinion.
Keep in mind that both communities are still young and in an evolving state, but you should count that as a good thing. It gives you the chance to jump in early and make a big difference in the community and/or project direction. I'm looking forward to seeing your input on the lists!
If you haven't seen this presentation yet, it's worth a look. It's a really nice overview of how to create an interesting and compelling presentation. Of course, having something useful to say also helps.
This
page validates as XHTML 1.0, and will look much better in
a browser that supports web standards, but it is accessible
to any browser or Internet device. It was created using techniques
detailed at glish.com/css/.
Powered by Roller Weblogger.