[sharp-discuss] Sharp on a cluster

Evan Bursey ehbursey@lbl.gov
Thu, 26 Feb 2004 13:05:55 -0800


Hello,

I am trying to get Sharp running on a Linux cluster.  The cluster is a
"Warewulf" cluster using the Sun Grid Engine queue system.  The
webserver is running on the cluster's master node.  I have requested and
received licences for all of the  computers in the cluster (the master
node and 12 slave nodes).  I have been testing AutoSharp with the
included if3-c.0 tutorial data.

I can run AutoSharp in interactive mode (where the jobs are run on the
cluster master node) without any problems.  The jobs run to completion
and produce what looks like good output.

I am still working out the details of submitting jobs to the cluster's
queue.  Do any Sharp users have experience submitting AutoSharp jobs to
a Sun Grid Engine queue?

I have tried to submit jobs via rsh with mixed luck.  Most jobs quit at 
the first round of sharp with a complaint that /usr/bin/time cannot be
found.  /usr/bin/time does exist on the cluster's master node, but not
on the slaves.  I'm in the middle of taking care of this.  At this
point, I would be happy if all the jobs made it this far, because... 

Some jobs that are submitted via rsh crash before the first round of
sharp.  They crash at the "collecting and analyzing all data" stage with
the complaint:

unable to get resolution limits for file
/home/software/packages/sharp/users/ehbursey/None.sharp/datafiles/16-if3-c.data.mtz

Still other jobs crash at the "collecting and analyzing all data" stage
with a different complaint (from CAD2_w2.log):

***  Error
 From LWASSN : Duplicate column labels in output file, columns   8 and 
12 both have the label FMIDw2
 CAD:  *** Program Terminated

Other jobs complain about an invalid licence.  The nodes that make this
complaint make it at the "Finding sites" stage.  The error message is
in  PKMAPS_w1_ano_set1.log.  I'm assuming that the licence really is
invalid for these four nodes, although I wonder why the job makes it
this far without complaining.   

Running checkBDG.sh reports that none of the nodes have a valid licence,
although I did receive one and it is in $BDG_home/.licence.  Only a
couple of nodes complain about licencing, but I wonder, if these flakey
rsh problems I'm seeing might be related to the .licence file?  Would
the jobs even start without a valid licence, or would they stop partway
through the process, as I'm seeing here?  Would the job always give a
licence-related error or is it possible that the error would appear at a
later stage.

If this looks familiar to anyone, or if you have any thoughts on all of
this, I would be interested to read your comments.

Thanks very much for your help,
Evan Bursey