Thursday, January 17, 2008

Running parallel pw.x on the LoneStar of TACC: on site/condorG/condor-birdbath APIs

(1) submit to the LSF queue on Lonestar
bsub -I -n 4 -W 0:05 -q development -o pwscf.out ibrun /home/teragrid/tg459247/vlab/espresso/bin/pw.x < /home/teragrid/tg459247/vlab/__CC5f_7/Pwscf_Input (2) submit through condorG script file Globus RSL parameter is available at http://www.globus.org/toolkit/docs/2.4/gram/gram_rsl_parameters.html
Actual script file is following,
=============================================
executable = /home/teragrid/tg459247/vlab/bin/pw_mpi.x
transfer_executable = false
should_transfer_files = yes
when_to_transfer_output = ON_EXIT
transfer_input_files = /home/leesangm/catalina/VLAB_Codes/__CC5f_7/008-O-ca--bm3.vdb,/home/leesangm/catalina/VLAB_Codes/__CC5f_7/__cc5_7,/home/leesangm/catalina/VLAB_Codes/__CC5f_7/Mg.vbc3
universe = grid
grid_resource = gt2 tg-login.tacc.teragrid.org/jobmanager-lsf
output = tmpfile.out.$(Cluster)
error = condorG.err.$(Cluster)
log = condorG.log.$(Cluster)
input = /home/leesangm/catalina/VLAB_Codes/__CC5f_7/Pwscf_Input
x509userproxy = /tmp/x509up_u500
globusrsl = (environment=(PATH /usr/bin))\
(jobtype=mpi)\
(count=4)\
(queue=development)\
(maxWallTime=5)

queue


(3) submit through condor birdbath APIs
Almost the same with serial job submission except for setting up the wall clock time. When you generate globusrsl, add

(maxWallTime=yourWallMaxTime)

Friday, January 11, 2008

Job submission to TG machines

Color code
Blue: Serial pw.x is ready to run and accessible by Task Executor
Red: pw.x installation failed.
Green: Serial + MPI pw.x is ready to run and accessed from Task Executor
==================================================
machine hostname architecture job sub job manager
-----------------------------------------------------------------------------------------
BigRed login.bigred.iu.teragrid.org ppc64 GT4 loadleveler
*QueenBeelogin-qb.lsu-loni.teragrid.org GT4
NCAR tg-login.frost.ncar.teragrid.org i686 GT4
*Abe login-abe.ncsa.teragrid.org Intel64 GT4 pbs
Cobalt login-co.ncsa.teragrid.org ia64 GT4/GT2 pbs/fork
Mercury login-hg.ncsa.teragrid.org ia64 GT4/GT2 pbs/fork
Tungsten login-w.ncsa.teragrid.org ia32 GT4/GT2 LSF/fork
ORNL tg-login.ornl.teragrid.org i686 GT4/GT2 pbs/fork
*BigBen tg-login.bigben.psc.teragrid.org AMD Opteron GT4/GT2 pbs
*Rachel tg-login.rachel.psc.teragrid.org GT4/GT2 pbs
Purdue tg-login.purdue.teragrid.org GT4/GT2 pbs
*sdsc BG bglogin.sdsc.edu ppc64 GT4/GT2 no job manager??
*sdsc DS dslogin.sdsc.edu 002628DA4C00 GT4/GT2 loadleveler/fork
sdsc IBM tg-login.sdsc.teragrid.org ia64 GT4/GT2 pbs/fork
lonestar tg-login.lonestar.tacc.teragrid.org ia64 GT4/GT2 LSF/fork
maverik tg-viz-login.tacc.teragrid.org sun4u GT4/GT2 sge/fork
*ranger tg-login.ranger.tacc.teragrid.org GT4 sge/fork
IA-VIS tg-viz-login.uc.teragrid.org i686 GT4/GT2 pbs
IS-64 tg-login.uc.teragrid.org ia64 GT4/GT2 pbs/fork
=================================================

*QueenBee : could not login
*Abe doesn't support single-sign-on
*BigBen: could not login
*Abe: could not login
*Rachel: could not login
*Purdue: could not login
*sdsc BlueGene: unknown job manager?
*sdsc DataStar: unusual architecture?
*ranger: could not login

Compiling espresso in the TG machines

To run the executables on the TG machines, first you have to get ready your executables on the site.
Here is the instruction of installation serial run espresso. README.install was very useful.

* Cobalt, Mercury, and Tungsten NCSA

step 1. copy espressoXXX.tar

step 2. On the espresso directory, set the environment variable to select architecture.
setenv BIN_DIR /home/ac/quakesim/vlab/espresso/bin
setenv PSEUDO_DIR /home/ac/quakesim/vlab/espresso/pseudo
setenv TMP_DIR /home/ac/quakesim/vlab/espresso/tmp
setenv ARCH linux64
setenv PARA_PREFIX
setenv PARA_POSTFIX

note: for serial process, PARA_PREFIX MUST be left empty. For parallel process,

setenv PARA_PREFIX "mpirun -np 2"
setenv PARA_POSTFIX

step 2.5 make sure you have tmp, pseudo, bin directory under your espresso directory

step 3. ./configure

step 4. make all

* Lonestar parallel pw.x, ph.x
step 1.
setenv PARA_PREFIX "mpirun"
step 2. setenv ARCH linux64
step 3. ./configure
step 4. make all

Submit job to pbs[1]: on site with command line

(0) create script file which displays hostname of the machine. Name the file as "test"
#!/bin/sh
/bin/hostname
(1) submit job test to the pbs queue.
qsub -o test.out -e test.err test
(2) check result file

*Useful guide
http://www.teragrid.org/userinfo/jobs/pbs.php

Friday, January 4, 2008

Submitting a job to LSF job queue [3]: through CondorG with birdbath APIs

To submit a job to the LSF thourgh the condor G with birdbath APIs, your ClassAdStructAttr should contains required attributes. I retrieved the list of keywords from my previous example: Submitting a job to LSF job queue [2]. After submitting condor job with example[2], run condor_q -l, than you can get a list of valid keywords. Please note that even though you use wrong keyword, condor WON'T throw any exception, and your job WON'T go through. (now I'm pulling my hair.) Therefore, make something runs correctly even though it's not with birdbath APIs, and start from the valid keywords generated from that example.

* Attributes In, Out, and Err are used for specifying Standard Input, output, and error redirections. Therefore if your executables uses standard input/output and redirects them to files, those should be specified with these attributes.

* In this case, pw.x generates multiple files besides stdout output files. Attribute
TransferOutput specifies files that should be transfered after the process is done.

* Attribute GlobusRSL is equivalant to the keyword globusrsl in the script file for the command line submission by condor_submit.

* Many many thanks to Marlon for helping me out!!


Actual ClassAdStructAttr[] is following:
------------------------------------------------------------------------------------------------------------------------------
ClassAdStructAttr[] extraAttributes =
{
new ClassAdStructAttr("GridResource", ClassAdAttrType.value3, gridResourceVal),
new ClassAdStructAttr("TransferExecutable",ClassAdAttrType.value4,"FALSE"),
new ClassAdStructAttr("Out", ClassAdAttrType.value3, tmpDir+"/"+"pwscf-"+clusterId+".out"),
new ClassAdStructAttr("UserLog",ClassAdAttrType.value3, tmpDir+"/"+"pwscf-"+clusterId+".log"),
new ClassAdStructAttr("Err",ClassAdAttrType.value3, tmpDir+"/"+"pwscf-"+clusterId+".err"),
new ClassAdStructAttr("In",ClassAdAttrType.value3, workDir+"/"+"Pwscf_Input"),
new ClassAdStructAttr("ShouldTransferFiles", ClassAdAttrType.value2,"\"YES\""),
new ClassAdStructAttr("WhenToTransferOutput", ClassAdAttrType.value2,"\"ON_EXIT\""),
new ClassAdStructAttr("StreamOut", ClassAdAttrType.value4, "TRUE"),
new ClassAdStructAttr("StreamErr",ClassAdAttrType.value4,"TRUE"),

new ClassAdStructAttr("TransferOutput",ClassAdAttrType.value2,
"\"pwscf.pot, pwscf.rho, pwscf.wfc, pwscf.md, pwscf.oldrho, pwscf.save, pwscf.update\""),

new ClassAdStructAttr("TransferOutputRemaps",ClassAdAttrType.value2,
"\"pwscf.pot="+tmpDir+"/"+"pwscf-"+clusterId+
".pot; pwscf.rho="+tmpDir+"/"+"pwscf-"+clusterId+
".rho;pwscf.wfc="+tmpDir+"/"+"pwscf-"+clusterId+
".wfc; pwscf.md="+tmpDir+"/"+"pwscf-"+clusterId+
".md; pwscf.oldrho="+tmpDir+"/"+"pwscf-"+clusterId+
".oldrho; pwscf.save="+tmpDir+"/"+"pwscf-"+clusterId+
".save; pwscf.update="+tmpDir+"/"+"pwscf-"+clusterId+".update\""),

new ClassAdStructAttr("GlobusRSL", ClassAdAttrType.value2,
"\"(queue=development)(environment=(PATH /usr/bin))(jobtype=single)(count=1)\""),

new ClassAdStructAttr("x509userproxy",ClassAdAttrType.value3,proxyLocation),

};
------------------------------------------------------------------------------------------------------------------------------

Pwscf output files?

In my local machine, pw.x generates output files besides the standard output:
pwscf.pot, pwscf.rho, pwscf.wfc
unless I reuse the tmp directory.

However, in lonestar, it genrates,
pwscf.md pwscf.oldrho pwscf.pot pwscf.rho pwscf.save pwscf.update pwscf.wfc

For sure, I transfer all of the possible files from the remote machine.

Submitting a job to LSF job queue [2]: through CondorG with condor_submit

Now, I tried to submit the same job, pw.x with the remote input files through the CondorG command line which is condor_submit. In this example, I use inputfiles stored in my local machine. Therefore, my script file should specify input files like following lines.

---------------------------------------------------------------------------------------------------------------------------------
executable = /home/teragrid/tg459282/vlab/pw.x
transfer_executable = false
should_transfer_files = yes
when_to_transfer_output = ON_EXIT
transfer_input_files = /home/leesangm/catalina/VLAB_Codes/__CC5f_7/008-O-ca--bm3.vdb,/home/leesangm/catalina/VLAB_Codes/__CC5f_7/__cc5_7,/home/leesangm/catalina/VLAB_Codes/__CC5f_7/Mg.vbc3

universe = grid
grid_resource = gt2 tg-login.tacc.teragrid.org/jobmanager-lsf
output = tmpfile.out.$(Cluster)
error = condorG.err.$(Cluster)
log = condorG.log.$(Cluster)
input = /home/leesangm/catalina/VLAB_Codes/__CC5f_7/Pwscf_Input

globusrsl = (queue=development)\
(environment=(PATH /usr/bin))\
(jobtype=single)\
(count=1)

queue
---------------------------------------------------------------------------------------------------------------------------------

This script file is almost the same with normal condor submit script except for the globusrsl keyword. This is a simple case for the serial job. For the parallel jobs, this should be modified.

Then submit condor job,
condor_submit script_file_name

Submitting a job to LSF job queue [1] : On the Cluster

* Usefule LSF commands :
bsub: submission jobs
bjobs: display information about the jobs
bkills: send signal to kill
For more commands,
http://its.unc.edu/dci/dci_components/lsf/lsf_commands.htm

* Useful options of bsub command
-q : name of the queue
-n: desired number of processors
-W: Walltime limit in batch jobs -W[hours]:[minutes]
-i : input file
-o : output file
-e: error file

Example lsf submit of pw.x in lonestar
bsub -q development -n 1 -W 15 -i "Pwscf_Input" -o "myout.out" ../pw.x

Thursday, January 3, 2008

Building a Client of the Task Executor

To generate client of the vlab Task Executor service, first we have to create stub code and compile them.
If the service is running on localhost, WSDL file is located at,
http://localhost:8080/task-executor/services/TaskExecutor?wsdl
With this WSDL file, we can generate java classes with WSDL2Java included in the axis package.
java org.apache.axis.wsdl.WSDL2Java http://localhost:8080/task-executor/services/TaskExecutor?wsdl
Then compile/jar the java code.
Required jar files to run WSDL2Java are following:
  • axis-1.4.jar
  • activation-1.1.jar
  • commons-discovery-0.2.jar
  • saaj.jar
  • jaxrpc.jar
  • mail-1.4.jar
  • wsdl4j-1.5.1.jar