Monday Jul 27, 2009

The New Jobs Drawer in Ops Center 2.5

Because Ops Center is designed to manage thousands of servers, we designed it differently than other, more traditional management systems.  In particular, the core of the Ops Center Controller is a giant job queuing systems.  Almost no actions in Ops Center are synchronous.  When the user requests an action, a job is created and queued for execution.  Then the job is picked up by a Proxy for execution against a group of managed Servers or OSs.  Because this is all asynchronous, jobs can be queued against thousands of systems and executed in an orderly manner -- without blocking the user interface until they complete.

However, with all these jobs starting and completing on different schedules, it requires that Ops Center do a really good job of explaining to the user the state of the world.  What jobs are processing?  What has completed?  Did anything fail?  Thus, in Ops Center 2.5, we've redesigned the user experience for accessing the job manager status to be even easier.  In version 2.0 and 2.1, there was a section in the left-nav that took you to a dedicated screen to access job info.  Now it's available all the time.

Take a look at the screenshot below.  At the bottom-left corner of the screen is the word "Jobs" followed by a set of icons and numbers.  This shows you the current status of the job manager at a glance, all the time.  Each of those numbers tells you how many jobs are in different states.  How many have completed, are processing or have failed, etc.

Beyond that, each of those icons is a button (with a rollover tooltip to remind you the exact meaning of the icon) that allows you to access the Jobs Drawer.  Let's start by clicking on the icon with the Yellow Arrow.  This opens the Job Drawer and shows us all the jobs in the system.

Next, we might want to just focus in on the jobs that failed, so we click on the Red Stop Sign.  That filters the jobs to only show ones with Failed status.  The screen below shows that this looks like.

Now, of course, when you see you have two failed jobs you'll want to find out why.  You can double click any of the jobs in the list and bring up the details of that job (example below).

The job details shows you each step in the job (many jobs have multiple components) and shows you the specific target(s) that may have failed (jobs can be directed and multiple hosts and may success against some and fail on others).  This then gives you the info you need to investigate the failure, determine the problem and then, if you so choose, rerun the job against failed targets with just the click of a button.

Thursday Jul 23, 2009

xVM Ops Center at Supercomputing

Last month, Sun was a sponsor of the International Super Computing Conference (ISCC) in Germany, and the Ops Center team was there to show off the latest in how Ops Center can be used in High Performance Computing (HPC) environments.  Prasad Pai is our own HPC rockstar who has been responsible for major super computing installations like TACC, KISTI, and Clemson choosing Ops Center as part of their infrastructure.

Prasad was on hand to show off how Ops Center can be best used in these kinds of environments.  In particular, Ops Center's scalability and agentless hardware monitoring make it a great choice for these kinds of environments.  Below you can see a picture of Prasad demonstrating Ops Center at the conference.

Over the past couple of months, we've been upgrading customers from older Ops Center version to the newest available 2.1 version -- TACC just recently upgraded to 2.1.  However, there's even more coming in 2.5 that HPC customers will like.  In particular, because today's supercomputers use so many nodes, scalability is really important.  To that end, we've built a 1,000 node cluster (nicknamed Nessie) of older hardware that we can use for internal scalability testing.  Because Nessie is dedicated to the task, we can really abuse it in ways we wouldn't do to our customer's production (or even pre-production systems).  With this ability to try extreme use-cases we've been able to dramatically increase performance at large scale.  I'll plan to post a few stats about our performance work here soon.

Wednesday Jul 22, 2009

New Smart Goups Feature in Ops Center 2.5

We've reached some key, internal development milestones for Ops Center 2.5 so I thought it was a good occasion to share more key bits of what's coming in this next release.  One small, but very cool feature is called Smart Groups.  Ops Center has always offered the ability to create arbitrary groupings of assets, and it still does.  However, it now offers the ability to use several pre-fabricated Smart Groups that act a queries against the data model and create automatic associations.  It's not rocket science, but it is incredibly convenient!

Let me show you how it works.  These are some screen shots I took off of a development systems this morning.  When you open Ops Center 2.5, much of it looks familiar, but there are some key changes.  One of them is in the left-hand side nav bar.  In particular, there is now a drop down menu that allows you to select different filters.  In the screenshot below, it's set to the default "All Assets" filter (which is pretty much the only filter there was in 2.5).  Note these assets are a collection of hardware, operating systems, and virtual machines (both SPARC and x86).

Now, if you click on that drop down, you'll see all the pre-built filters that are now available.  These filters allow you to quick select custom views depending on the kinds of operations you want to do.  They also provide you quick access to heterogeneous groups -- which means you can take common actions across the group. 

Below you'll see the screen you get when you select the Operating Systems view.  You'll note it includes automatically built sub-groups for each major OS.  This makes it easy to do something like run a security compliance report against all your RedHat systems.  Also, the inspector in the center pain now shows summary information for the group -- like top CPU and memory using systems.  This can help you quickly identify a server that may be in trouble.

And lastly, here you can see the quick breakdown that comes from the Systems filter.  This shows you all the servers, and breaks them down by different processor types.  Need to do a firmware check against all your SPARC systems?  Easy!  Need to do an emergency power down on your whole data center?  Easy!

While none of this is really complicated, it's sure to make admin's lives easier -- and that is Ops Center's main job.  In the next few days, I'll post a few other cool bits coming in 2.5.  One final note: as part of this re-organization we've dropped the term Gear from the interface.  It seems some people found this term to be either confusing or even distasteful.  We've generally moved to using the term Asset.  Let me know here if you like or dislike the change (if you have an opinion)!

Monday Jul 13, 2009

Recent 2.1 Docs Updates

The writers have been busy lately adding new items to the docs set for Ops Center, and I wanted to call attention to three recent additions that have been featured over at the xVM Blog.

Ops Center 2.1 Quick Start Guides - Your fastest way to get up and running

Highly Available Ops Center Controllers - How to build a stand-by controller, so your Ops Center is always up

Working with Windows - How to use Ops Center 2.1 to monitor Windows instances (bare metal or virtualized)

Please go check them out!

It's going to be a busy few weeks here.  We're coming up on feature freeze for Ops Center 2.5 (which will be our best release yet).  Be sure to stay tuned for more news on that in the coming days.