An interesting thread just went by on the Grid Engineusers alias about how to disable job submissions during grid maintenance. After several suggestions, Andreas posted a nice solution worth sharing.
The obvious solutions are to disable things: disable the queue, disable the host, stop the qmaster, etc. What Andreas suggested was instead to create an empty user set (aka user list or access list) and set that as the ACL for your queue(s) (via the user_lists queue attribute). When a queue has the user_lists attribute set, only users who are members of one of the listed user lists are allowed to submit jobs to that queue. If the attribute contains only a reference to an empty list, then no user is allowed to submit jobs to that queue.
I might extend Andreas' solution a little to say that instead of being empty, the user list should contain only the administrative user who is doing the maintenance. That way, the administrator can submit test jobs to make sure that the grid works before opening it back up to the public.
Grid Engine is a Distributed Resource Manager (DRM). It's job is to match workload to resources in the most optimal fashion possible, otherwise known as scheduling jobs. What Grid Engine doesn't do is provide an automated means to run certain jobs at certain times of the day. For example, I have an application that I want started at 8:00 and stopped at 17:00, and I want that to happen every weekday of the year. You could certainly use Grid Engine to find a home for the application instance when you start it and to stop the job without needing to know where it ended up, but you can't give Grid Engine an application path and a schedule and have it do the starting and stopping for you automatically. At least not out of the box.
For that kind of automated job management, one would normally use some kind of job management tool, like cron or Autosys. In some cases, however, you can use Grid Engine to solve the problem. In this post, I'll talk about how you'd configure Grid Engine to do automated job management, and the administrative implications of doing so.
Grid Engine does have a feature that can be controlled by a schedule, and that's the enabling/disabling/suspending/resuming of queues. By configuring a calendar, you can tell Grid Engine to enable, disable, suspend or resume a queue according to a specified periodic schedule. (The difference between disabling and suspending is that a disabled queue stops accepting jobs, whereas a suspended queue stops accepting jobs and suspends all currently executing jobs.)
Grid Engine also has support for checkpointing environments. While we do not provide any native checkpointing facilities, we allow you to configure the system to recognize and use whatever checkpointing environments your jobs may be using. One of the advantages given to a job by using a checkpointing environment is that Grid Engine is able to migrate the job under certain conditions. For example, when a job is suspended, if that job is using a checkpointing environment, the job can instead be migrated to another machine. A migration essentially consists of executing a checkpoint, stopping the executing job and resubmitting it. When the job is scheduled to a new machine, it can read its previous state from the last checkpoint and continue where it left off. In order for Grid Engine to know how to initiate a checkpoint, you tell it in the checkpointing environment configuration what commands to execute for what actions.
Why the tangent on checkpointing? Remember that what we want to do is have a job started and stopped at scheduled times. Using calendars, we could configure a queue that is suspended during the times when our jobs aren't supposed to be running. If we submit our jobs to that queue, when the queue is suspended, the jobs will also be suspended, but that's not quite what we're looking for. That scenario doesn't provide any way for the job to do cleanup after the day's operations, and if it's caught in mid operation, that operation gets suspended until the queue is resumed, possibly leaving the data in an inconsitent state. We can, however, configure a checkpointing environment that does nothing. If when we submit our jobs to our queue, we say that they use this "null" checkpointing environment, Grid Engine will know that instead of suspending them, it can migrate them. Because the checkpointing environment has no action commands configured, the migration becomes just terminating the job and resubmitting it. Because our jobs request to be run in our special queue, and because they queue is at that point disabled, our jobs will remain pending until the next time the queue is resumed. Nifty, huh? You could also get the same effect from setting the suspend method for the queue to use qmod -rj to reschedule the job instead of sending it the usual SIGSUSP.
Let's look at what that means administratively. First, it means that the administrator is managing the schedule of the queues instead of the schedule of the jobs. The jobs will run on indefinitely, being started, stopped, and restarted based on the suspension and resumption of the queues. This scheme has three implications. First it means that jobs must not be allowed to end, either through failure or termination or "natural causes." To that end, it may be useful to configure the epilog script for all of the queues to exit with exit code 99. Exit code 99 tells Grid Engine that the job should be rescheduled. By having the epilog scripts always return 99, it guarantees that if a job ever ends for any reason, it will automatically be restarted. Second, it means that the administrator has to think in terms of indirect effects instead of direct actions. For example, to change the schedule for a job, instead of changing anything about the job, the administrator has to change the schedule for the queue in which the job will run. Third, every job schedule requires a separate queue, as each queue is only able to adhere to a single schedule. If you have an environment where there are a large number of jobs with differing schedules, you can end up with a large number of queues, which can make maintenance a little complicated.
Another administrative consideration for this solution is the lack of a means to see a single-source, comprehensive schedule of what jobs will run when. To build such a schedule, you'd first have to look at all the configured calendars and their associated queues and place them on a time line. Then you'd have to go through the list of all running and pending jobs and find the jobs that are bound to those queues, placing them in the appropriate slots on the time line. Grid Engine provides no such tool. Given, however, that all of the Grid Engine command-line tools are 100% scriptable, it would certainly be possible to write such a tool, and it probably wouldn't be that difficult. (And it would probably be written in Perl.) qconf -scall to see the list of calendars; qconf -scal name to see a calendar's configuration; parse the year and week fields to find out when it's active; qconf -sql to see the list of queues; qconf -sq name to see a queue's configuration; parse the calendar field to find out what calendar is controling it; qstat -xml to see all job data in XML format; qstat -j jobid -xml to see XML data about a job; parse the hard_queue_list field to see what queue it requested; and finally put all the data together into a single chart.
Given that it's not a straight-forward may to do job management, why would you want to use Grid Engine for that purpose? You'd do it so that your managed applications can share the same resources as the rest of the work that being done on the grid, increasing efficiency, and because if Grid Engine is handling the job management instead of bringing in an additional tool, then there's one less thing to manage, fail over, back up, etc. Is it a perfect solution? No. But if it bothers you, I invite you to make it better. That's the glory of Grid Engine being an open source project! You, the user, have the opportunity to make changes and contribute them back, so that we can include them in the next rev of the product.
In case you didn't notice, the Distributed Resource Management Application API (DRMAA) has become one of the first two official recommendations from the Open Grid Forum. Along with this long awaited official recommendation status came an exciting surprise: Platform now has a DRMAA implementation for their LSF product! The implementation comes courtesy of the FedStage Developer Network. FedStage has only produced a C binding for LSF so far (and I'm told that's all that's planned), but since the Grid EngineJava™ language binding is built on top of the C binding, it's a small step to get the Java language binding working for LSF.
To make the good news even better, not only does Platform officially endorse the implementation (I'm told they funded it), but FedStage has released it under an Apache 2.0 open source license!
With the sudden rise in customer and ISV interest and the addition of LSF to the family, I have a feeling that DRMAA has finally reached critical mass. To find out more, check out the DRMAA 1.0 IDL binding specification and the Grid Engine C and Java language binding tutorials.
One of our awesome Grid Engine community members, Rayson Ho, just sent me a link to a great article from SLAC about planning for and installing their SunProject Blackbox system. (Thanks, Rayson!) If you're interested in what it's like to install a half-million-dollar data center in a box, definitely check it out!
Earlier this week I helped out with an RFP that included a section for Sun Grid Engine. After I sent in my contributions, it occurred to me that some of the information might be useful to others. Below is my description of the Grid Engine scheduler. It gives a pretty thorough idea of what the scheduler is capable of. In case you were wondering.
Sun Grid Engine supports mixed workloads of batch, interactive, parallel and parametric jobs.
The Sun Grid Engine scheduler is a highly configurable workload scheduler, providing a variety of options for matching workload (jobs) to available resources. The work of the scheduler is done in two distinct steps. The first step is the selection of jobs to be scheduled. The selection step is carried out by applying the various scheduler policies to arrive at a final order of importance for pending jobs. Jobs that are granted the same priority by the scheduler policies will be placed in order of submission.
After the jobs have been sorted in priority order according to the scheduler policies in place, the second step of workload scheduling takes place. In the second step, the scheduler matches the resources requested by the jobs to the resources offered execution machines, in the order determined by the selection step. The scheduler looks at four things to finally decide where a job will be executed. First, the scheduler filters the job's potential execution host list by the job's “hard resource requests.” A job can only run on an execution host which provides the resources the job has declared as necessary. Second, the scheduler filters the list of potential execution hosts by the job's “soft resource requests.” If a job wants, but does not need, a particular resource, the job will be run on an execution host offering that resource if possible. If no such execution host is available, that resource request is ignored. Third, the scheduler will select an execution host from the list of potential execution hosts according to host load, the host's 5-minute average load divided by the number of processors. The least loaded execution host from the list of potential execution hosts will be selected as the destination for the job. Lastly, the scheduler will select a queue on the selected execution host according to the available queues' sequence numbers. The queue with the lowest sequence number will be selected as the destination for the job.
In the above paragraph, the default configuration was discussed. In actuality, the scheduler's behavior in highly configurable. In the third step, the execution host is selected by host “load.” The scheduler's concept of load can be configured to reflect the priorities of the organization. By default it is the normalized 5-minute load average, but can be based on any resource in the system. Free memory, CPU speed and free disk space are examples of other common metrics. In the above paragraph, steps three and four can be swapped. In that case, the scheduler will filter the execution host list first by hard resource requests, then by soft resource requests. Then it will select a destination queue for the job based on queue sequence numbers. Finally, it will select a destination execution host for the job from the list of execution hosts on which the destination queue is available.
The scheduler supports three classes of scheduler policies: entitlement, urgency and custom.
The entitlement policy consists of three components: share tree policy, functional policy and override ticket policy.
The share tree policy is a fair-share policy that attempts to ensure that target resource shares are achieved over a configurable period of time. Entitlement shares are configured in a directed acyclic graph, called a share tree, that describes how resource usage is to be shared among users. The shares of users who have no jobs waiting to be run are shared among the other users according to the share distribution defined by the share tree. Users who have received more than their target share of the resource may later be penalized to ensure that resource usage over a given period matches the target resource shares.
The functional policy describes the relative target shares of resources for users, departments, projects, and individual jobs. Each job gets a total share of resources based on the sum of the shares it receives from its submitting user, the submitting user's department, the project to which the job belongs and the share assigned directly to the job. The shares of users who have no jobs waiting to be run are shared among the other users according to the share distribution defined by the share tree. The functional policy is non-historical, meaning that users who gain extra shares of resources will not be later penalized for the overage.
The override ticket policy provides a means for the administrator to give (and later remove) extra priority to a user, department, project or individual job with respect to the share tree and functional policies. The extra priority provided by the override ticket policy ultimately results in increased resource shares for the target user, department, project or job.
The urgency policy consists of three components: deadline time urgency, wait time urgency and resource urgency.
Deadline time urgency is priority associated with a job that increases as a job approaches its start time deadline. As the current time approaches a job's start time deadline, the job's urgency will approach the maximum deadline urgency in an asymptotic fashion. When the current time has reached a job's deadline time, that job is assigned the maximum deadline urgency. To prevent abuse of the system, only a select set of users is allowed to submit jobs with start time deadlines.
Wait time urgency is priority that is collected by a job as it remains in a pending state, waiting to be scheduled. Every second spent waiting accrues more priority for the job.
Resource urgency is a priority associated with resources that is inherited by jobs requesting those resources. By assigning a resource such an urgency, it ensure that jobs requesting that resource are given a priority in the scheduler's job selection process, in turn ensuring that the resource is in use as much as possible.
The custom policy is a number between -1023 and 1024 that represents the priority of a job, with higher numbers representing higher priorities, as described by POSIX 1003.1b. It's called the custom policy because it can be used by external processes to apply custom scheduling orders.
The jobs priorities derived from each of the three scheduling policies are normalized to a value between 0 and 1 within their respective policies, weighted and summed together to arrive at a final priority value. The weights used for each policy are configurable. By default the custom policy is weighted most heavily, and the urgency policy is weighted least heavily, with the entitlement policy in the middle. By defaults, the weights differ by an order of magnitude, ensuring a clear interaction among the policy contributions to the overall job priority.
In addition to the scheduler policies, the scheduler's decision-making process can be influenced by queue configuration, resource configuration, calendar configuration and resource quota sets. Through the above mechanisms, an organization's business rules and grid topology can be modeled through the grid, directing the scheduler to make decisions within the boundaries expressed by the reality of the grid environment. In particular, the above mechanisms provide the means to place fine-grained limits on the execution of jobs, such as how many can be simultaneously active, when they are allowed to be active, on which hosts they are allowed to be executed and by whom.
This news is a bit old by now, but better late than never. The proceedings from the Sun Grid Engine Workshop '07 are now online. We've posted a summary of the meeting and the presentations in PDF format. I was really impressed by the quality and the usefulness of the presentations this year. If you couldn't make it to Germany in person, definitely check out the slides online.
The code freeze for the 6.1u3 release has now happened, so if you want to checkout the V61_BRANCH and give it a whirl, please do. (You'll have to compile it yourself, of course.) If you discover any issues, please email them to the dev@gridengine.sunsource.net mailing list.
The latest version of the 6.1 Advance Reservation snapshot is now available for download. Snapshot 3 has had the u2 changes merged in, plus it includes some new features from the maintrunk: some scheduler improvements, some protocol optimizations, and a new flatfile spooling parser. Because of the new flatfile spooling parser, upgrading from a previous 6.1 or 6.1AR release using flatfile spooling will be a little trickier than usual. The reason for the new flatfile spooling parser is to consolidate spooling libraries. In 5.3, there was one flatfile spooling library, and it used all hand-written spooling code. With 6.0, we developed a new spooling library based on Lex, but we were gun shy on replacing the qmaster's spooling library with this new piece of code. Instead, we only replaced qconf's spooling code, resulting in there being two flatfile spooling libraries. Now that the no-longer-new spooling library has had time to burn in, we're confident enough to make the switch. With 6.2, we will go back to having only one flatfile spooling library.
Perhaps most importantly, the 6.1 Advance Reservation snapshot 3 contains the latest advance reservation updates. If you're interested in advance reservation, e.g. if you work at TACC, check it out.
If you were at the Sun Grid Engine Workshop '07 in September, you heard Andy talk about the new development efforts around Sun Grid Engine 6.2. One of those development projects was to make the scheduler process into a thread in the qmaster.
When we made the transition from 5.3 to 6.0, one of several major changes was that Grid Engine became multi-threaded. Remember the commd? In 6.0 the commd went away because it was replaced with a multi-threaded communications library. The qmaster itself went from being a single-threaded processing loop to being a heavily multi-threaded daemon. The idea of the qmaster also absorbing the scheduler was discussed back then, but the transition was already complicated enough.
Now that the multi-threaded qmaster has had time to settle, we've decided it's time to shake the snow globe again. Why? Well, every scheduling run, the qmaster has to pass the scheduler everything it needs to know to make its scheduling decisions. Bringing the scheduler into the qmaster as a thread allows the scheduler direct access to the data without sending it over the wire. It's not all hugs and puppies, though. Both the qmaster and the scheduler make high demands on the same set of data. Locking will be an issue that will take a little time to optimize. Don't expect the first release that has the scheduler as a qmaster thread to be an order of magnitude faster. We may eventually get there, but it will take some time.
The potential performance increase from reduced data traffic is nice, but for me the really exciting part is that once the scheduler is a thread in the qmaster, it's a small additional step to make the scheduler into 2 threads. Once there is more than one scheduler thread, going from 2 to 4 to 8 is just a matter of tuning.
What good are multiple scheduler threads? Well, the most obvious answer is that they can share the task of scheduling jobs, potentially improving scheduling time by parallelizing the operation. Even more interestingly, multiple scheduler threads could potentially each manage a different scheduler domain. Have you ever wanted to enable the fair share (share tree) policy during working hours and the functional policy at night? Having two scheduler threads with two different scheduler domains and two different scheduler configurations would let you do exactly that. No promises on when or if scheduler domains will become a product feature, but the ground work has been laid by making the scheduler into a qmaster thread.
Why bring this up now? Well, the engineer working on the project is going to do his first check-in today. It's not yet 100%, but it should work well enough to try. (It still fails some of our regression tests.) To get a build, check out the EB_PRE_SAAT tag and compile. Feedback is appreciated. You can send comments to the dev@gridengine.sunsource.net mailing list.
It's almost 8:00 on Saturday morning. My wife is still asleep. I'm just hanging out, babysitting a pair of smoking pork butts. Since there's not much to do at 8:00 on a Saturday morning while one's wife is still sleeping, I figured now's a good time to catch up on the blogging I've put off for the last three months while I've been on the road.
To get the ball rolling, let's start with an easy one. The video below is from one of our very clever and talented Sun folks. Every year at SuperComputing, as the conference is closing down, he has a mini film festival at the Sun booth of all his videos. (If you're there on the last day, stop by for some quality entertainment.)
I just found myself looking for an answer to a picky question about using generic methods on the Java™ platform. After reading through the Generics tutorial, my question was still unanswered. With a little searching, I found Angelika Langer's excellent Java Generics FAQ, and I learned everything I wanted to know. If you want to really understand Generics, give this site a read.
This
page validates as XHTML 1.0, and will look much better in
a browser that supports web standards, but it is accessible
to any browser or Internet device. It was created using techniques
detailed at glish.com/css/.
Powered by Roller Weblogger.