[sharp-discuss] Sharp on a cluster

Evan Bursey ehbursey@lbl.gov
Mon, 01 Mar 2004 16:42:54 -0800


Dear Clemens,

Thank you very much for your help.  I now have Sharp working on our
Warewulf cluster, with jobs being submitted through the Sun Grid
Engine.  There were multiple problems with my initial installation.

First, I reinstalled CCP4 and compiled from patched sources using your
ccp4_cv.sh script.

Next, the licence file did not contain entries for four of our 13
nodes.  I have requested and received new entries for our cluster.  The
nodes all appear to be working now.

Finally, the start.sh, restart.sh, and resume.sh scripts were failing
due to the variables being passed to qsub incorrectly.  A working
start.sh is listed below.


#!/bin/sh
SGE_ROOT=/sge
export SGE_ROOT
# This works for submitting jobs to the Sun Grid Engine
#
#
/sge/bin/glinux/qsub -v BDG_home=$BDG_home \
                        -v BDG_user=$BDG_user \
                        -v BDG_project=$BDG_project \
                        -v BDG_job=$BDG_job \
                        -v BDG_type=$BDG_type \
                        -v BDG_log=$BDG_log \
                        -v BDG_err=$BDG_err \
                        $BDG_home/$BDG_gui/submit/sungrid/start&
#
#

Again, thank you very much for your help.

Evan Bursey

On Fri, 2004-02-27 at 02:12, Clemens Vonrhein wrote:
> Dear Evan,
> 
> some initial remakrs (before getting into details about your
> problems):
> 
>   1. any kind of queuing system should be supported if it can be
>      adapted from the examples we provide (for DQS and LSF). The way
>      to do this is:
> 
>     % cd /where/ever/sharp
>     % cp -r submit submit.local
>     % rm sushi/submit
>     % ln -s ../submit.local sushi/submit
> 
>     % cd submit.local
>     % cp -r dqs MySubmit
> 
>     % vi start.dat       # and restart.dat, resume.dat
>       ==> see example line for 'dqs' and substitute this with
>       'MySubmit'
> 
>     You then need to adopt the files in the new subdirectory MySubmit.
> 
>     (see also detailed description in the installation manual).
> 
>   2. ideally, your cluster should have identical machines (in terms of
>      software): so the same set of packages installed, the same CCP4
>      (and ARP/wARP etc) installation, same paths and mount-points etc.
> 
>   3. when configuring each MASTER (i.e. each machine that should be
>      running a SHARP/autoSHARP job) it is then recommended to 'clone'
>      the configuration of the master node (_IF_ everything is
>      identical, that is!)
> 
> > Some jobs that are submitted via rsh crash before the first round of
> > sharp.  They crash at the "collecting and analyzing all data" stage with
> > the complaint:
> > 
> > unable to get resolution limits for file
> > /home/software/packages/sharp/users/ehbursey/None.sharp/datafiles/16-if3-c.data.mtz
> 
> Either CCP4 isn't properly installed (or working) on that particular
> machine. Or maybe 'awk', 'grep' or similar UNIX tools aren't properly
> installed. Is there a file
> 
>   ...ehbursey/None.sharp/datafiles/16-if3-c.data.mtzlog
> 
> which looks like a normal 'MTZDUMP' output? Does it contain any error
> messages? Did you have a look into the $BDG_home/sushi/logs/error_log
> file?
> 
> > Still other jobs crash at the "collecting and analyzing all data" stage
> > with a different complaint (from CAD2_w2.log):
> > 
> > ***  Error
> >  From LWASSN : Duplicate column labels in output file, columns   8 and 
> > 12 both have the label FMIDw2
> >  CAD:  *** Program Terminated
> 
> This could be related to earlier problems ...
> 
> > Other jobs complain about an invalid licence.  The nodes that make this
> > complaint make it at the "Finding sites" stage.  The error message is
> > in  PKMAPS_w1_ano_set1.log.  I'm assuming that the licence really is
> > invalid for these four nodes, although I wonder why the job makes it
> > this far without complaining.   
> 
> This is the first time the licence key is checked. You can check the
> validity of your .licence file by doing:
> 
>   % cd /where/ever/sharp
>   % source ./setup.csh
>   % bin/linux_exe/sharp
> 
> ON EACH node! This should only complain about a missing parameter file
> (and _not_ about the licence key).
> 
> > Running checkBDG.sh reports that none of the nodes have a valid licence,
> > although I did receive one and it is in $BDG_home/.licence.  Only a
> > couple of nodes complain about licencing, but I wonder, if these flakey
> > rsh problems I'm seeing might be related to the .licence file?  Would
> > the jobs even start without a valid licence, or would they stop partway
> > through the process, as I'm seeing here?  Would the job always give a
> > licence-related error or is it possible that the error would appear at a
> > later stage.
> 
> jobs will always start, even if you haven't got a valid licence. If
> you get a message about 'invalid licence' it means you haven't
> requested all licence keys: see 
> 
>   http://www.globalphasing.com/sharp/
> 
>     and
> 
>   http://www.globalphasing.com/sharp/restricted/request.html
> 
> on how to request additional licence keys.
> 
> 
> I think the first thing to do is to make sure all machines have the
> same set of packages installed and all of these are at the same
> version. Then make sure that crystallographic software (CCP4,
> ARP/wARP) is visible in the same way on all nodes.
> 
> Cheers
> 
> Clemens
>