fastDNAml_1.2.2p Don Berry University Information Technology Services Indiana University, Bloomington, IN dkberry@indiana.edu ----------------------------------------------------------------------------- FastDNAml_1.2.2p is a parallel version of Gary Olsen's fastDNAml_1.2.2. It consists of a master program, which constructs the trees to be evaluated, a set of worker programs which do the actual evaluation and branch length optimization, and a "foreman" program which manages all the workers. This structure is based on the P4 parllel code, fastDNAml_1.0.6, that Olsen, Hideo Matsuda, Ray Hagstrom and Ross Overbeek released in 1992. In their code they used the Argonne P4 message passing library to pass messges among all the processes. In fastDNAml_1.2.2p, this library has been replaced by a set of special fastDNAml communication functions, which in turn invoke functions from either the PVM or MPI message passing libraries. This isolates the main code from any specific message passing library, so that ports to new libraries will be easier. The fastDNAml communication functions for MPI are contained in file comm_mpi.c, and those for PVM are in comm_pvm.c. There is also a comm_seq.c, which contains basically stubs, so that one can even compile a sequential code. This sequential version could then replace Gary's original v1.2.2 code. P4 has rather fallen out of favor for parallel programming (although it provides the basic "abstract device interface" for the Argonne MPICH implementation of MPI), but if a P4 version is ever needed, a comm_p4.c file can easily be constructed without touching the rest of the code. Another major upgrade to the P4 code is installation of some fault-tolerance for running on networks of computers, where nodes may drop out or become unreachable. In fact, the PVM code has been run on a world-wide network consisting of several nodes of an IBM SP at Indiana University, several more SP nodes at National University of Singapore, and several DEC Alpha machines at Australian National University in Canberra. It recovered and continued the calculation just fine when various worker processes were killed. The MPI version is based on the MPI-1 standard, which assumes a static set of processes. MPI was originally targeted to massively parallel machines where one could assume that either all processes would stay up for the duration of the run, or they would all go down together. Consequently, MPI assumes tighter coupling among processes. In some implementations, if one process dies, the whole application may abort, although it will probably keep running if a process just becomes temporarily unreachable. So the fault-tolerance built into fastDNAml_1.2.2p is partially defeated by some finicky MPI implementations. In contrast, an advantage of PVM is that it was designed to be used in dynamic environments such as local area networks of workstations. PVM includes functions for dynamically adding and deleting hosts to/from the virtual machine, and for starting and stopping processes. To make use of these features, fastDNAml_1.2.2p includes a special user interface program called "cmon". With it, a user can add hosts and spawn and kill worker processes. (Cmon currently just has the most basic functions. As we see other functions that might be useful, they will be added to future releases.) Cmon is discussd more at the end of this document. The programs comprising the parallel application ------------------------------------------------------ Before using the parallel program, you should read Gary's documentation in file fastDNAml_doc_1.2.txt. It explains his original v1.2.2 serial code. This file gives an overview of how the parallel program works, how to compile it, and how to run it. It also lists some differences regarding input and output. The serial application is called "fastDNAml", just as in Gary's code, while the master processes of the MPI and PVM applications are "mpi_fastDNAml" and "pvm_fastDNAml", respectively. The MPI and PVM applications actually consist of four programs: mpi_fastDNAml pvm_fastDNAml This is the "top-level" program of the four, and the user would run either mpi_fastDNAml or pvm_fastDNAml to start up the parallel application, just as he would run fastDNAml to start up the serial application. But in the parallel case, this program just figures out which trees to evaluate, and sends them off to the foreman for eventual evaluation. Because of this, *_fastDNAml is often called the "master" program. (In fact the source code is master.c.) mpi_foreman pvm_foreman Recieves trees from *_fastDNAml and controls their evaluation by a set of worker processes. This process combines the functions of the dispatch and merge programs in the old Argonne p4 version of fastDNAml, and includes some fault recovery features to deal with workers which may die or become unreachable. This is necessary for running on fault-prone clusters of workstations. mpi_worker pvm_worker The program that does all the work of optimizing branch lengths and evaluating tree likelihoods. You would run several instances of this program. mpi_dnaml_mon pvm_dnaml_mon Monitors the parallel program and prints progress reports, or "tele- metry" to stderr. The user can specify three levels of telemetry. Under some implementations of MPI and PVM, this process may require its own processor. If you do not care about getting telemetry during the run, you can omit the monitor and recover use of its processor for an extra worker. The run can still be monitored by looking at the .bst file, which is output by the master process. If you *do* want to run the monitor, you must specify it on the command line rather than mpi_fastDNAml or pvm_fastDNAml. Also, *_dnaml_mon is used to monitor only the parallel application. The serial code monitors itself if you specify the -d option as described below. cmon Mpi_dnaml_mon and pvm_dnaml_mon print out all the telemetry on a scrolling screen, as though it were a stream of printer paper. The cmon monitor uses the curses library to update a single screen with the telemetry as the run progresses. (cmon is my unimaginative acroynm for "curses monitor".) This monitor is only available for the PVM version. In addition to providing a more useful display of the telemetry, cmon also allows user to interactively add new hosts to the virtual machine and to start and stop worker processes. Compiling -------------------------------- There is a makefile for the IBM SP and one for LINUX clusters. To make the serial version for a LINUX Beowulf cluster, just type, make -f Makefile.LINUX serial To make the MPI and PVM parallel versions, you will first have to set several MPI and PVM macros at the beginning of the makefile to match the locations of MPI and PVM on your machine. Then make the MPI or PVM versions by typing either make -f Makefile.LINUX mpi or make -f Makefile.LINUX pvm You can also make each program separately, for example, make -f Makefile.LINUX mpi_fastDNAml You have to make cmon separately: make -f Makefile.LINUX cmon Program arguments, options, input and output --------------------------------------------- The input sequence data files for fastDNAMl_1.2.2p are the same as Gary describes for his fastDNAml_1.2.2 (see fastDNAml_doc_1.2.txt). But fastDNAml_1.2.2p does not read them from stdin. Rather, you give the file name on the command line and it reads the file directly. This may be an inconvenience if you have been taking advantage of the shell scripts to massage input files and feed them to fastDNAml via its stdin, but that method ran into some complications in the parallel code. I hope to iron these problems out in a future release. Besides the required input data file name, there are four optional command line arguments, and the PVM version requires specification of the number of workers desired. Suppose you want to run the PVM version, and you want to monitor it. The command line would take the following form: pvm_dnaml_mon [-d infolevel] -p #_workers [-n run_name] [-w work_dir] [-s] \ sequence_file or cmon [-d infolevel] -p #_workers [-n run_name] [-w work_dir] [-s] sequence_file The square brackets indicate optional arguments. Meanings are as follows: -d infolevel This causes *_dnaml_mon to output telemetry about the run to its stderr. There are 4 levels, each one of which includes the telemetry of the levels below it: 0 = (default) Currently, this does the same as level 1. 1 = measure and print out at the end of the run the wallclock time, user CPU time, and system CPU time for each process or thread 2 = print a "step time" after every fifth taxon has been added to the tree 3 = print the number of every taxon added to the tree 4 = print a "+" sign for every tree the master sends to the foreman for evaluation, and a "-" sign for every evaluated tree the foreman receives back from a worker. These levels are also pertinent to the cmon monitor, but it updates tree count fields and CPU and wallclock time fields on the terminal screen for each of the workers, rather than printing + and - signs. To get the most telemtry when using cmon, use -d4. -p #_workers The number of workers you want. This is only for the PVM program. For the MPI program, you specify the programs to be run in some sort of command or process group file (see "Running the MPI Application" below). See the documentation for the MPI installed on your machine. -n run_name A name for the output files. If you omit this, it defaults to the name of the sequence data file. FastDNAml_v1.2.2p writes four files, each named with the run name and a 3-character extension: run_name.log - Written by dnaml_mon. A line is written to this file at each step time, the same information the monitor writes to its stderr for infolevel 2. This file is written independent of the infolevel option. run_name.cpt - A checkpoint file, written by the master after each taxon is added (and the best tree found) and after each round of branch swapping. If the run is iterrupted this file can be attached to the end of the sequence data file and the run can be restarted from that point using the R option. run_name.bst - The "best tree" file, output by the master as the run progresses. This file contains details about the order of addition of the taxa, how many trees were tested, what the best likelihoods were, and an ASCII drawing of the final tree with branch lengths and confidence intervals. run_name.trf - The final best tree. -w work_dir This option allows you to set the working directory at run time, a feature which can be useful in certain cirumstances. It is passed to the monitor, master, foreman, and all workers. The workers need a working directory only if they are going to read the sequence data file themselves (see -s option). This is the only file I/O they do. Work_dir can be either a relative or absolute path. The default is the current working directory. -s Instructs only the master to read the sequence data file, and ship it to the workers in a message. Without this option, each worker must be able to read the sequence data file itself. This could be a problem on some systems if you want to run hundreds of workers, or on heterogeneous clusters where the machines do not have a common file system. To avoid the headache of staging the data on every file system, you can select the -s option. On the other hand, if the sequence data file is large, and you are running the program on a slow or wide-area network, it could take some time for the file to be shipped out to all the workers. If you want to do a lot of runs with the same sequence data, the effort of copying the file by hand to every computer might be amortized over the many runs. Running the MPI Application -------------------------------- Different MPI implementations have different ways of starting up programs. Here we will describe how to do it under MPICH, the implementation from Argonne National Lab, using the ch_p4 device. First create a p4 procgroup file containing the machines you want to use and the programs to run on them. For example, if a Beowulf cluster consists of machines ppc00, ppc01,...ppc32, and you want to run the monitor and 5 workers, you could create the following procgroup file: ppc00 0 ppc00 1 /home/jsmith/fastDNAml_1.2.2p/src/mpi_foreman ppc00 1 /home/jsmith/fastDNAml_1.2.2p/src/mpi_fastDNAml ppc06 1 /home/jsmith/fastDNAml_1.2.2p/src/mpi_worker ppc07 1 /home/jsmith/fastDNAml_1.2.2p/src/mpi_worker ppc08 1 /home/jsmith/fastDNAml_1.2.2p/src/mpi_worker ppc09 1 /home/jsmith/fastDNAml_1.2.2p/src/mpi_worker ppc10 1 /home/jsmith/fastDNAml_1.2.2p/src/mpi_worker Procgroups are described in the MPICH documentation from Argonne. In this example we are assuming you start mpi_dnaml_mon by hand on ppc00. Each line says "run 1 instance of the given program on the specified machine". The first line is there to indicate mpi_dnaml_mon was started by hand. Since mpi_dnaml_mon, mpi_fastDNAml, and mpi_foreman accumulate very little user time relative to the workers, it is usually possible to run them all on the same node (in this case ppc00) without hurting performance. You can list the programs in any order. For example,if you do not want to run mpi_dnaml_mon and you want to run the mpi_foreman on ppc08, the procgroup file could be, ppc00 0 ppc00 1 /home/jsmith/fastDNAml_1.2.2p/src/mpi_worker ppc06 1 /home/jsmith/fastDNAml_1.2.2p/src/mpi_worker ppc07 1 /home/jsmith/fastDNAml_1.2.2p/src/mpi_worker ppc08 1 /home/jsmith/fastDNAml_1.2.2p/src/mpi_foreman ppc09 1 /home/jsmith/fastDNAml_1.2.2p/src/mpi_worker ppc10 1 /home/jsmith/fastDNAml_1.2.2p/src/mpi_worker In this case, mpi_fastDNAml (started by hand) and one worker would run on ppc00. Let's say this procgroup file is mpi56.pg. To run the program on the sequence data file test56.phy, and assuming testdata is the current working directory, you would enter the command, mpirun -p4pg mpi56.pg mpi_fastDNAml -n mpi56 -s test56.phy Here, infolevel = 0 (because we are not running dnaml_mon) run_name = mpi56 sequence_file = test56.phy work_dir = current working directory and we have requested mpi_fastDNAml to ship the sequence data to all workers. An example set of output files for a run similar to this is given in the testdata directory. The command used for that run was mpirun -p4pg mpi56.pg mpi_dnaml_mon -d4 -nmpi56 -s test56.phy The output files the run produced are, mpi56.bst -- the best-tree file mpi56.cpt -- the last checkpoint file, in Newick format. mpi56.log -- the logfile output by mpi_dnaml_mon mpi56.trf -- the final best tree, also in Newick format. mpi56.mon -- this is stderr captured from dnaml_mon. PBS Batch Jobs of the MPI App on Beowulf Machines -------------------------------------------------- Since fastDNAml consists of 4 different programs, it is a little tricky to run it under PBS. The problem is PBS creates a PBS_NODEFILE containing the machines it has chosen, and you have to convert this into a p4 procgroup file like the ones shown above. A perl script called mkp4pg is included in the testdata directory for doing this. There is also an example PBS script, mpi_test.pbs. Study that script, and you should be able to figure out how to make it work on your system. MPICH/PBS is the only environment we have for batch MPI programs on our Beowulf cluster at IU. If you have some other MPI implementation and/or batch queueing system, please tell us how you got it to work and we will inlcude it in this documentation. LoadLeveler Batch Jobs of the MPI App on the IBM SP ---------------------------------------------------- MPI programs on the the SP are run under control of POE (the Parallel Operating Environment). There is an example LoadLeveler batch script, css0_ip_02x04.ll, in the testdata directory which runs the MPI application on two Winterhawk2 nodes, using all 4 processors of each node. It uses internet protocol over the SP switch. Since mpi_fastDNAml is a MPMD (Multi-Program, Multi-Data) type of parallel program, you have to specify the "-pgmmodel mpmd" option to POE and provide a file (in this case SP_poe_08.cmd) listing the programs to run. Once you get past the hurdle of learning LoadLeveler and a little POE stuff, the rest of the story is the same as already described. Running the PVM Application -------------------------------- PVM is a more natural fit to the master/worker structure of this code than MPI. MPI is good for tightly coupled parallel applications where relatively large messages containing highly structured data are passed frequently among proces- ses. But fastDNAml_1.2.2p does relatively infrequent communication of small to medium size messages, and hence has a very low communication/computation ratio. A tree of n taxa can be converted to a Newick string of around n*42 bytes, which is only 4200 bytes for even 100 taxa. A message of this size could be sent over a 10 Mbit/sec ethernet network in less than 10 msec. If a worker process spends 30 seconds finding the optimum branch lengths and evaluating likelihoods, and it takes another 10 msec to send the result back, the comm/comp ratio would be less than 6.7E-4, a very favorable number. This is why fastDNAml is a good par- allel application for networks of workstations. The communication/computation ratio is still good even for wide area networks of workstations and/or multi- processors, as long as the network is not too congested. PVM was designed to ease the construction and execution of parallel applications on such hetero- geneous clusters. We will not describe the PVM system here. The PVM developers have set up a good web site at http://www.epm.ornl.gov/pvm/pvm_home.html, which has, among other information, an HTML version of their book, "PVM: Parallel Virtual Machine A Users' Guide and Tutorial for Networked Parallel Computing". The book is also availble from MIT press. Assuming you have a basic understanding of how PVM works, here is how to run pvm_fastDNAml. Start the PVM console, specifying a hostfile if you wish, or adding the hosts after starting it. Then either "quit" the console or go to another window and start up pvm_dnaml_mon (if you want to monitor the run) or pvm_fastDNAml (if you do not). The command line arguments are almost the same as for the MPI version. The one difference is that you specify the number of workers to spawn with a -p option, instead of specifying the total number of processes to run as in a P4 procgroup. For instance, to do an unmonitored run with 8 workers on the hosts in hostfile, first start up the virtual machine by typing pvm hostfile then either "quit" PVM or go to another window and type $HOME/pvm3/bin/$PVM_ARCH/pvm_fastDNAml -p8 -ntest56 \ -w$HOME/fastDNAml_1.2.2/testdata -s test56.phy If you do want to monitor the run, say at infolevel 3, then type $HOME/pvm3/bin/$PVM_ARCH/pvm_dnaml_mon -d3 -p8 -ntest56 \ -w$HOME/fastDNAml_1.2.2p/testdata -s test56.phy You must make sure the executables are where PVM can find them. You can specify this with "ep" lines in the hostfile, or use the default location, $HOME/pvm3/bin/$PVM_ARCH. Also, we have used the -s option here to make the master ship the input file test56.phy to all the workers. Pvm_dnaml_mon will spawn the master. If the master starts up successfully and can read the input file, it will spawn the the foreman, informing it to spawn the 8 workers. This is a different startup method from the MPI version. Since MPI does not have any dynamic process management functions, all of the MPI processes start simultan- eously (at least from the programmer's and user's point of view) and they ren- dezvous to see who is alive and how to reach each other. If the program crashes, look in /tmp/pvml. on each machine. There are a number of errors that will be logged there. The most common are that an executable or the input data file were not found. PBS Batch Jobs of the PVM App -------------------------------------------------- As with the MPI application, running pvm_fastDNAml as a PBS batch job is a little tricky. Some batch queuing systems, such as IBM's LoadLeveler, are able to set up and manage a parallel virtual machine for batch runs, much as the PVM console does for interactive runs. PBS does not have this capability. Rather, you have to start up and shutdown the virtual machine yourself from within your batch script. PBS publishes the nodes it allocates to your job in a file specified in environment variable PBS_NODEFILE. In your PBS script you then feed this file to the PVM master daemon, pvmd3, which will start up a PVM daemon on each node. Then you run the program the same way as described above. After your run is finished, you must shut down the parallel virtual machine. The easiest way is to send a SIGTERM to the master pvmd3. Before it terminates, it sends the signal to all the other pvmd3s, thus shutting down the whole virtual machine. There is an example PBS script called pvm_test.pbs in the testdata directory. We used this on our Beowulf cluster at Indiana University. How to use the cmon monitor -------------------------------------------------- The section, "Program arguments, options, input and output" described how to start up cmon. Here I will explain the (very few) commands you can give it interactively. add -- Add a new host to the virtual machine. This is exactly like the "add" command in the PVM console. spawn -- Spawn a worker on the specified host. kill -- cmon maintains a list of workers on the curses screen, listing their PVM task IDs, the hosts they are running on, and their CPU usage and wallclock runtimes. To kill a process use the "kill" command and specify the task ID. The process will not be killed until it returns the tree it is currently working on. Thus no trees are lost by killing processes. You can even kill all the workers. The master and foreman will just wait until you spawn at least one new worker, and then calculations will resume. quit -- quit the monitor. BUT, do not quit before the program is finished. In the future, I hope to make cmon attachable to and detachable from a running PVM application, but that is harder than it sounds. So for now, once you start cmon, pvm_fastDNAml, pvm_foreman, and the pvm_workers, leave cmon running until the whole program is finished. Well, that's it. Cmon is not a finished product, but rather a proof-of-concept for dynamic process management of parallel fastDNAml. It allows enough features that users can play with it and perhaps suggest new features that would be useful. For one, it would be nice to incorporate all the features of the standard PVM console into cmon, so you would not have to start up the console AND cmon. You could just start up cmon, "add" a bunch of machines to the virtual machine, start up fastDNAml, dynamically adjust the infolevel (the -d option), and so on. These, and probably many other things are all feasible, but it's a lot of work, and I just wanted to get something running to play with. Known Problems -------------------------------------------------- 1. In cmon, you can't backspace if you make a typo in entering a command. Solution is to just enter the botched command, let cmon tell you it doesn't understand it, and reenter it correctly. 2. In the MPI version, if you make a mistake in specifying the sequence data file, mpi_master aborts the whole application. This is what I designed it to do, but in our MPICH implemenation of MPI on our Beowulf cluster, not all the MPI processes get killed. So if this happens, you should check each of the machines and kill any remaining processes. 3. In spite of all my efforts to provide useful error messages, sometimes they just don't come out right. Gary Olsen was very careful to cover every possible error that could happen in his serial code, and all his error handling is still intact in the parallel version. But mine, for handling errors about working directory path, sequence file name, command line errors, and others related to the parallel code, is still a bit clunky. 4. The foreman has to decide when a worker has gone AWOL. Currently, this is done by just setting a timeout clock every time a worker is sent a tree to work on. If the worker does not send back a result within the timeout period, the foreman marks it AWOL and gives its tree to another worker. The timeout period is hard-coded to 120 seconds. If a tree actually takes longer than that to evaluate, the foreman will go into an infinite loop, trying over and over again to get some worker to do the work in 120 seconds. If you think this is happening, edit foreman.c search for "120", change it to some larger time and recompile. 5. If you are running on a massively parallel machine which has only one /tmp directory used by all the nodes, do not use the -s option. The way workers deal with a shipped sequence data file is to write the message out to /tmp and then open that file and read it normally. If a bunch of workers are all writing large files to /tmp, that could be a problem. (The C library function tmpfile is used to create and open the file. So the file is automatically deleted when closed.)