Back in late February, I was pleased to receive an invitation to present Sun Grid Engine (SGE) at the Sun Tech Days in Saint Petersburg (note to Florida fans: I'm referring to the "original" city in Russia).
Despite some difficulties getting the trip organized (mostly the visa process, more about that in an upcoming personal blog...), I'm really glad I've come. Not only is it a great chance to meet Sun technology experts and a very enthusiastic group of users, it's my first opportunity to visit Russia and seeing the "Venise of the north" was enticing.
Preparing for the talk was a much more of a struggle than I anticipated. First, I didn't have much time since I've been trying to put out some fires at work and also take some required training. Next, my initial laptop upgrades using Live Upgrade went perfectly (from build 70b to 79a, and then to 85), but the boot of snv_85 showed some graphical desktop problems. Then a configuration mistake and the overwrite of my good build 79a led me to finally re-install the whole system using snv_82. Finally, preparing the demos for SGE required installing the product and setting up zones, both of which were "first time" tasks for me and I made some beginner mistakes. :(
Luckily, I did have a demo provided by Ravi Nallan from my grid team, and William Roche from the kernel team (both from my Solaris RPE team) very patiently both gave me good advice on the zones setup and helped diagnose and fix all my mistakes over a 3 day period. We actually had gotten the demo nearly working about 1 hour before the talk was scheduled to start, but we still needed to fix a Java class manifest problem, which we did with the help of Scott Ritter and Roman Strobl. So it was very satisfying to go deliver the talk and not at all experience the dreaded "demo effect", where the demo mysteriously and unexpectedly fails. The only things failing during the talk were the microphones: I used 3 different headset mikes before we finally gave up and used a big old hand-held job (which made typing the demo a challenge).
Some of the questions and answers during the talk were:
- Q: Is it possible to run Java programs on different execution platforms despite differences like path names?
A: Yes, this is possible and generally the submitted job would refer to the same Java program available from shared storage. If needed, a wrapper script could be used to handle platform differences (slash VS. backslash characters) and it could call the Java program. - Q: I'm using OpenMPI on JAVA. Will future Java implementations be easier to use?
A: I actually didn't fully understand the question during the talk, but apparently the point is that JINI does offer OpenMPI support, but it would be easier for platform portability if that were part of the base JDK. Later notes: there is a project JGrid which aims at a grid deployment which facilitates a mpi kind of programming for java. One can have have openmpi calls with JNI by calling the native (C/fortran) MPI calls. But otherwise, the MPI was/is never different with SGE compared to a typical setup in a MPI cluster. It is just that the nodes are found dynamically rather than statically and most env stuff is helped by SGE. - Q: If a very large job is submitted, will SGE take into account addition of more nodes?
A: Once all the proceses are launched, they won't be re-assigned to new nodes. However, if some jobs are still spooled but not scheduled, then those could run on new nodes (once they have been installed with the SGE exec daemon and configured by the grid administrator). - Q: Is this product actually used for any serious commercial work?
A: Well, yes, we do have customers who pay for licenses and support contract, and they log cases when they have problems (those sometimes lead to escalations for code fixes). And yes, it's for serious work and they are always eager for a rapid solution or fix. Just one of many serious commercial applications is for animated films: see the DNA productions and Open RSP web sites for some examples. - Q: SGE and other DRMs have accounting/billing information. Do any companies use grids to sell resource time to customers?
A: Actually, Sun did have a service marketed with the slogan "1$ per CPU hour" which used SGE to many the grid, but I'm not sure about its current status. Yes, other companies have created similar offerings. SGE does offer an ARCo module for accounting/billing, something many other DRMs do not have. - Q: What is the largest size of grid used?
A: A very good question which pertains to current deployments, the largest of which is at the TACC supercomputing center at the University of Austin Texas. SGE currently scales to 60000 nodes (CPUs) and engineering improvements have been made in the upcoming version 6.2 to reduce key bottlenecks and allow support of 90000 nodes. At the TACC, I believe they use 4 CPU blades, which still means an impressive 15000 to 22500 hosts! - Q: What are the characteristics of the licence?
A: I didn't know about the SSIL licence, but I have found this description.