TutorialMontevideo2021

Content:

Check

Tutorial/example data

Caveat

Simple run

Detailed explanation of command
Sources of information regarding autoPROC and XDS
Main reporting and analysis
Additional tools and files for inspection

Some ideas about "advanced" processing

Processing your own data

Submitting jobs to the cluster queues

Setup - optional (work-in-progress)

Check

You can always check if the software is correctly set up via

  which process
  process -h

This is after having run the

  module load ccp4-workshop
  clusterme

commands (on the NX/NoMachine server) and then

  module load ccp4-workshop

again on the compute/cluster node.

Note: if in your terminal you see a prompt like

[FEDID@cs04r-sc-com99-07 ~]$ 
                ^^^

you are already on the compute node reserved for you. If on the other hand you see something like

[FEDID@cs05r-sc-serv-04 ~]$ 
                ^^^^

then you are still on the main Nomachine server.

Tutorial/example data

Here are some example dataset - all from recently collected and deposited PDB, where raw diffraction images are also available (PD = proteindiffraction.org). The images are already placed on the DLS computers: see full path to those images below.

PDB	PDBpeep	Table-1	data on DLS computers	archived data	Notes
6ORC	PDBpeep	Table-1	`/dls/i04-1/data/2021/mx29507-1/processing/ClemensVonrhein/Tutorials/6ORC/Images`	PD	very quick to run, Se-MET for phasing
6YNQ	PDBpeep	Table-1	`/dls/i04-1/data/2021/mx29507-1/processing/ClemensVonrhein/Tutorials/6YNQ/Images`	PD	very quick to run, small ligand
7KRX	PDBpeep	Table-1	`/dls/i04-1/data/2021/mx29507-1/processing/ClemensVonrhein/Tutorials/7KRX/Images`	PD	fast to run - and contains anomalous signal, interesting ligand
7K1L	PDBpeep	Table-1	`/dls/i04-1/data/2021/mx29507-1/processing/ClemensVonrhein/Tutorials/7K1L/Images`	PD	twinning, anisotropy, interesting ligand
6VWW	PDBpeep	Table-1	`/dls/i04-1/data/2021/mx29507-1/processing/ClemensVonrhein/Tutorials/6VWW/Images`	PD	twinning
6W9C	PDBpeep	Table-1	`/dls/i04-1/data/2021/mx29507-1/processing/ClemensVonrhein/Tutorials/6W9C/Images`	PD	two sweeps, low completeness, anisotropy, multiple lattices?, radiation damage?, might contain some anomalous signal (Zn)

Caveat

Remember that we don't provide a graphical interface to start an autoPROC run (there is a lot of graphical output though). You won't need a lot of experience with the terminal/shell and command-line, but a little bit is necessary after all. See also here for some hopefully helpful pointers.

Simple run

You can run interactively using e.g.

  process -I /dls/i04-1/data/2021/mx29507-1/processing/ClemensVonrhein/Tutorials/6ORC/Images -d 6ORC.01 | tee 6ORC.01.lis

Detailed explanation of command

Let's have a closer look at that command (you can skip that if you are already a Linux guru):

process is the name of the actual program to run autoPROC.

we now provide different command-line flags (arguments) to that command:

these are separate by spaces
the typical form is -x value, i.e. a flag followed by a value (with a space between them)
if that value would contain spaces, we would have to write -x "value1 value2", i.e. adding quotes around it (usually double-quotes)

the first flag -I expects as a value the directory name where the images are located

here we give it the full path as described in the above table

the second flag -d tells the program in what (sub)directory all output should go

it is always a good idea to keep input and output separate, i.e. don't run the processing within the directory containing the images (you can, but this can quickly get messy)
anticipate right from the start that you might need to run processing several times, so use some kind of numbernig system (bettern than using names like "new", "newer" and "newest")
this is why we use a form of <identifier>.<number> here: we might not need all that information, but calling files/directories test, new, old etc is a very bad idea ;-)
it is also not a good idea to run everything always in the same directory: make use of a directory hierarchy to order you projects

at the end comes a bit of "Linux magic":

the programm might/will write a fair amount of information to standard output
while in a terminal, this means that all that (potentially useful) information is written into the terminal ... and will disappear once it gets to long (and scrolling doesn't catch it any longer)
at the latest it will disappear once you log out of a computer (or it gets rebooted, you close the terminal application etc)
so it is a good idea to save the output
this could also be done with ... > 6ORC.01.lis, but it would mean that we don't "see" what is written into that file (listing)
so using the tee command we get the best of both worlds: all standard output will be saved while at the same time it will also be shown within the terminal ... it's a "T junction" ;-)

Main reporting and analysis

The full report of processing (including all analysis about indexing, processing, spacegroup decisions, scaling and final set of fully processed data ready for deposition), can be found in the summary.html file within the output directory. So browse to your current directory (in a file browser) and search for the 6ORC.01/summary.html file. This is also reported on standard output (and you might be able to just click on it to open it). You could also cd into the relevant directory and just start firefox summary.html.

Sources of information regarding autoPROC and XDS

autoPROC:

              process -h

XDS:

manual
wiki
For a pure XDS based (graphical) interface: XDSGUI

Additional tools and files for inspection

Although the summary.html file is the first stop to see processing progress (and final results), there are additional resources for you:

the standard output (i.e. the file you saved this to, e.g. 6ORC.01.lis will suggest running some jiffies to visualise the images with predictions. If you do

  grep gpx.sh 6ORC.01.lis

you will see something like

running 6ORC.01/status/01_run_idxref_01/gpx.sh
running 6ORC.01/status/03_index/gpx.sh
running 6ORC.01/status/04_integ/gpx.sh
running 6ORC.01/status/05_postref/gpx.sh
running 6ORC.01/status/06_process/gpx.sh

that provides you with tools to visualise predictions at different stages. Usually only those related to indexing (idxref) and the final one are of interest. Especially if you suspect multiple lattices (or multiple indexing solutions): this allows to to see the predictions for each of the (significant) indexing solutions.

If you are "only" interested in the final data quality, have a look at the following files (within the result directory and also referenced from summary.html):

report_staraniso.pdf (PDB report, good for archiving, printing and keeping)
Data_1_autoPROC_STARANISO_all.cif (deposition-ready PDBx/mmCIF file, including full set of data-quality metrics)
staraniso_alldata-unique.mtz (MTZ file of final, processed data - STARANISO analysis: ready for experimental phasing, molecular-replacement, refinement etc)
summary.tar.gz (archive of summary.html and all referenced files, i.e. plots, MTZ, PDF and mmCIF files: good idea to always backup/transfer at least that file)

Some ideas about "advanced" processing

After having done data-proecssing with all options at their default value, a careful analysis of the summary.html reporting might already give you some ideas about possible changes. One could set an explicit space group (SG) via

  process symm="P212121" ...

(... symbolises any additional arguments as discussed above). Or a SG/cell combo:

  process symm="P212121" cell="43 67 112 90 90 90" ...

There are also a variety of so-called macros: these are predefined collections of parameter settings for typical tasks. One of the most commonly used macro is

  process -M LowResOrTricky ...

You might want to have a closer look at the manual or the reference card (PDF) for other suggestions: autoPROC has a very large number of potential parameter settings that can be used to modify its behaviour. The wiki contains a large set of worked-through examples too.

Processing your own data

There shouldn't be anything different to those notes above when it comes to processing your own data. One note though: in our experience during those workshops there are a fair amount of "skeletons" that see the light of day, i.e. problematic datasets that have been sitting on disk for a long time in students home institutions and caused all kind of problems. These might come from unusual beamlines/instruments or non-standard settings. So be prepared to provide as much background information about the actual instrument and data collection (back in the days) as possible: the automatic detection of accurate beamline/instrument parameters might not recognize some of those "interesting" datasets right away.

If you are processing so-called mini-cbf files (file ending *.cbf) that are already compressed (i.e. *.cbf.gz), you might need to tell XDS to use a so-called plugin to avoid unnecessarily large numbers of file access and conversion steps. For that add

  autoPROC_XdsKeyword_LIB="/dls/i04-1/data/2021/mx29507-1/processing/ClemensVonrhein/Tutorials/xds-zcbf/build/xds-zcbf.so"

to your command (somewhere after the initial process ... and remember the spaces to other command-line flags/arguments!)

Submitting jobs to the cluster queues

It can also be useful to submit any processing job to the DLS computer clusters. For that you will need to write a little shell script (e.g. called run.sh) that could contain something like

#!/bin/sh

module load ccp4-workshop

process \
    -I /dls/i04-1/data/2021/mx29507-1/processing/ClemensVonrhein/Tutorials/6ORC/Images \
    -d 6ORC.01 > 6ORC.01.lis 2>&1

What does that file contain?

the first line is just an identifier to describe the interpreter (shell) to be used for anything that follows

remember to load all relevant packages

the actual command can be written exactly how you would otherwise type it in the terminal

here we place different parts of that command on different lines for better readability
the preceding line ends with a backslash \, signaling a continuation
there should be no additional character/space after that backslash
no backslash at the end of the full command

We can then submit that job via

  chmod +x run.sh
  qsub -pe smp 16 -cwd run.sh

and see it in the cluster queues via

  qstat

Note: we have seen that sometimes a submitted job seems to hang for quite some time, especially when handling compressed (*.gz or *.bz2) files - but your mileage might vary

Setup - optional (work-in-progress)

To make your life easier later, you can run the below commands as-is on the Diamond/DLS computers (after having connected via Nomachine): this needs to be done only once!.

  echo "alias x ='module load ccp4-workshop'" >> ~/.bashrc_user
  echo "alias c ='x; clusterme'" >> ~/.bashrc_user
  mkdir ~/.ssh
  chmod 0700 ~/.ssh
  ssh-keygen -t rsa -b 4096 -f  ~/.ssh/rsa_clusterme

(hitting just Enter/Return for an empty password). Then

  cat ~/.ssh/rsa_clusterme.pub >> ~/.ssh/authorized_keys

to add it to your authorized SSH keys.

Whenever connecting to a fresh Nomachine session (only once in each session):

  ssh-add ~/.ssh/rsa_clusterme

After that the command

should work fully automatically in any terminal (or terminal tab) you open: it should connect to your compute/cluster node without asking for a password. Afterwards, the setup of the CeBEM-CCP4 workshop environment (on the cluster/compute node) can be done with just typing