ISV engineering's HPC web log For HPC ISVs & OSS

Thursday Mar 19, 2009

 In my prior blogs in this series on integrating "Sun Grid Engine and MSC.Software's MD Nastran" I described the following:

1. "Sun Grid Engine and MD Nastran" [recommended SGE configurations/queues for MD Nastran users]  

2. "Part 1--How to submit "MD Nastran" (serial) jobs with Sun Grid Engine"

3. "Part 2--How to submit "MD Nastran" DMP (Distributed Memory Parallel) jobs with Sun Grid Engine"  

4. "Part 3--How to configure consumable resources (Disk, Memory, and License Tokens)"

In this final blog in the series I'll describe some SGE configuration details I mentioned in the earlier blogs.

1.  How to create "runtime limiting queues"
2.  How to create an "SGE parallel environment"

First, a quick comment on my use of SGE's CLI (command line interface) instead of the GUI interface tool (QMON).
I continue to give examples using the SGE's CLI instead of the  GUI interface (QMON) to configure SGE because I've found the CLI to be an extremely flexible and powerful way to change settings once you become familiar with SGE. [That said, I still go back to the SGE QMON GUI interface tool when I'm not sure how something works--then it's a really great way to see the relationship of a feature or setting to the other components within SGE]



Section I:  "How to create runtime limiting queues"

Step #1. Execute the qconf command to make a small, medium, and large queue

#qconf -aq small.q
#qconf -aq medium.q
#qconf -aq large.q

Step #2: Modify the queues "small.q, medium.q, and large.q" to have different h_rt (runtime (elapsed time) limits).


--> First, set the small.q to have complex_value h_rt=5 minutes
# qconf -mattr queue complex_values h_rt=00:05:00 small.q
"complex_values" of "small.q" is empty - Adding new element(s).
root@tm19-232 modified "small.q" in cluster queue list

-->Second, set the medium.q to complex value h_rt=30 minutes
# qconf -mattr queue complex_values h_rt=00:30:00 medium.q
"complex_values" of "medium.q" is empty - Adding new element(s).
root@tm19-231 modified "medium.q" in cluster queue list

-->Third, set the large.q to complex value h_rt="very large number"
# qconf -mattr queue complex_values h_rt=99:00:00 large.q
"complex_values" of "large.q" is empty - Adding new element(s).

Now verify that the above values/limits have been set correctly:
For example, to check time limit of 5 minutes (300 seconds) for small.q:
tm19-231:/dpl/sge.sc08#qconf -sq small.q
qname                 small.q
hostlist              tm19-231
.....
complex_values        h_rt=300
.....


Section II:   How to make a parallel environment (PE) called "nastran" for Nastran DMP (Distributed Memory Parallel) jobs:
First, some background.
What is a parallel environment?  A parallel environment within SGE enables concurrent computing on parallel platforms in networked environments.
Before you continue you might want to also read my earlier blog on MD Nastran's DMP (Distributed Memory Parallel) capability:  

Here's the steps in creating a Nastran "parallel environment" in SGE.

Step #1: Create the "nastran" parallel environment

First check to see what parallel environment(s) already exist in your SGE environment:
tm19-231:/dpl/sge.sc08#qconf -spl
make

Now add a  parallel environment  (let's call it "nastran") to SGE using the following "qconf -ap" command:
[the -ap option (add parallel environment) displays an editor containing a parallel environment configuration template.]
tm19-231:/dpl/sge.sc08#qconf -ap nastran
pe_name           nastran
slots              0
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $pe_slots
control_slaves     FALSE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE

Now you can modify/edit two of the above default parameters settings ("slots" and "allocation_rule") to the following: (1)  "slots 16" to utilize all cores in my two 8-core machines, and (2)  "allocation rule $round_robin"  so that SGE will distribute the jobs optimally among the machines defined in the parallel environment queue.

tm19-231:/dpl/sge.sc08#qconf -sp nastran
pe_name            nastran
slots              16
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE



Step #2: Attach/associate one or more "queues" to the "nastran" parallel environment using "qconf -mq"
...for this example I'm only attaching queue "large.q" to the parallel environment "nastran".

tm19-231:/dpl/sge.sc08#qconf -mq large.q
qname                 large.q
hostlist              tm19-231 tm19-232
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make
rerun                 FALSE
slots                 8
tmpdir                /tmp
shell                 /bin/csh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
"/var/tmp/9512-O3JEa7" 50 lines, 1494 characters

....now edit the above "pe_list"  setting to add the "nastran" parallel environment
to this "large.q" queue.

qname                 large.q
hostlist              tm19-231 tm19-232
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make nastran
rerun                 FALSE
slots                 8
tmpdir                /tmp
shell                 /bin/csh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
"/var/tmp/9512-O3JEa7" 50 lines, 1494 characters

Step 3: You're now ready to submit jobs to the "nastran" parallel environment:

[The queues that will be available for these jobs are only those queues that have been associated with the parallel environment interface "nastran" ( in my example I only have "large.q" associated with the "nastran" parallel environment--however, you can also add other queues (like the queues small.q and medium.q described earlier).  In addition to this requirement that a queue must be associated with a parallel environment the queue must also satisfy any resource requirement specified by a  "qsub -l" command when you submit your job.]

...as described in a prior blog here's the script I used to submit a Nastran job to the parallel environment "nastran".

#cat nastran_qsub.sh
qsub -pe nastran 2 -q large.q  -l mem_free=6G -l nastran_tokens=`token_estimate.sh`  -l export_size=10G  -S /bin/ksh nastran_sge.sh

The above SGE command line defines the following:
1. "qsub" : qsub is the SGE command used to submit the Nastran job submittal script "nastran_sge.sh" to the parallel environment  "nastran"  I configured within SGE.  The standard Nastran job submittal command "mdnast2008..." that Nastran user's are familiar with is in the script  "nastran_sge.sh"  (see Step 2. below).

2." -pe nastran 2" : This tells  qsub to submit  "nastran_sge.sh" to the  SGE parallel execution environment "nastran".  The "nastran_sge.sh" script will then start two MD Nastran jobs (running in parallel)  on one or more of the host machines as defined in the queue (large.q).
[The queues that are suitable for this job, like the large.q, are queues that are associated with the parallel environment interface "nastran" by the parallel environment configuration. Suitable queues also must satisfy the resource requirement specification specified by the qsub -l command (see item 4. below).]

3. "-q large.q" : I configured three SGE queues  for my Nastran environment (small.q, medium.q, and large.q)--each one having a different limit on the amount of elapsed time allowed for jobs within the queue.  In this example I chose large.q to handle a long  (elapsed time) running job.

4. "-l mem_free=6G" -l nastran_tokens=`token_estimate.sh`  -l export_size=10G" defines the "complex resource attributes" that I configured within SGE. If  any of these requirements (free memory, tokens, or disk space) is not met the jobs will not be dispatched.



This concludes my series of blogs on integrating "Sun Grid Engine and MD Nastran".  Feel free to send me comments or suggestions on additional configurations/queues/settings that you think might be useful to MD Nastran users. Based on the input I receive I may start up another series of blogs on this topic to cover those additional suggestions.






.







 

 

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed