RunningAutoProcAtSynchrotrons

Content:

Introduction

Making use of compute clusters
Making use of multiple CPUs/threads

The autoPROC XML file

Presenting results to users

summary.html
summary_inlined.html
ISPyB compatible XML
PDF reports
Deposition-ready PDBx/mmCIF files

Some suggested settings

Accessing image files

Some proposed runs for automatic data-processing with autoPROC
Some proposed options when providing re-processing capability for autoPROC

Processing multiple sweeps from the same crystal together
Processing multiple sweeps from different crystals
Selecting subset of images
Enforcing a given spacegroup or spacegroup and cell
Special handling of (large signal) anomalous/dispersive signal datasets

Example (for Diamond MX beamlines)

External links to documents at synchrotrons/beamlines

Some remarks about image headers

Introduction

Running autoPROC at a synchrotron is not different to running it in the home lab. However, a lot of synchrotrons have optimised their infrastructure for speed and efficiency when it comes to running data-processing jobs. So a few settings could help you getting the most out of autoPROC when using those resources.

Please make sure to check with your local IT contact for details about resources and usage policies.

Make sure that the use of autoPROC is referenced adequately according to the licence conditions. Any website (internal or external) describing data processing systems at your synchrotron/beamline that use autoPROC should (1) clearly describe that usage, (2) inform users of the need to reference autoPROC in any publication and (3) provide a pointer to our main autoPROC website.

Making use of compute clusters

As described in the XDS documentation, the spot-searching (COLSPOT) and integration (INTEGRATE) stages can be distributed to multiple compute nodes. Once this has been configured within the XDS installation, autoPROC can be told about this feature using the MAXIMUM_NUMBER_OF_JOBS keyword - e.g. with

% process autoPROC_XdsKeyword_MAXIMUM_NUMBER_OF_JOBS=4 ...

Please make sure that all required packages are also available, configured and useable on such a compute cluster.

Note: be aware that this kind of processing does change the actual results - maybe only in small ways for a high quality crystal on a very stable instrument. The integration (INTEGRATE) step will split the various DELPHI blocks into parts that are then processed by independent jobs (be that on different cluster nodes or as different jobs on the same machine), using the same initial set of parameters describing the crystal and the experiment. If there are significant changes in those parameters throughout the rotation of the crystal and one would not use this feature, ecah of the DELPHI blocks will (post-)refine these parameters so that the next block should start with a better set of parameters for integration. By using the multi-job feature of XDS, this only happens within the different DELPHI blocks of each individual (independent) job, but not across them. We haven't done any systematic analysis of the effects on normal or problematic datasets/experiments: but you should be aware of that difference triggered by a potentially faster runtime.

Making use of multiple CPUs/threads

By default, autoPROC will use all available threads for the XDS and AIMLESS steps. This can be controlled globally via e.g.

% process -nthreads 8 ...

or separately for XDS and AIMLESS. E.g. to use 16 threads for XDS but 32 for AIMLESS one would use:

% process autoPROC_XdsKeyword_MAXIMUM_NUMBER_OF_PROCESSORS=16 -nthreads 32 ...

The number of threads used during the INTEGRATE step should be aligned with the number of images processed for each DELPHI block for optimal performance. Using 32 threads for a DELPHI block of 50 images should not be faster than using 25 threads - while using 64 images in each DELPHI block would be equally fast when using 32 threads. The DELPHI parameter itself (given in degrees) is a hint to XDS on how to split the total number of images into blocks. Often, the default of 5.0 degree fits well with the individual image width (0.05, 0.1, 0.2 deg/image etc) and the total number of images will then also fit the total number of images. If the latter was not given, XDS would increase the numbers of images per DELPHI block slightly to stay (on average) close to the default of 5.0 degree per block.

Bottom line: there is a fair amount of dependency between the image width, the total number of images collected, the number of threads MAXIMUM_NUMBER_OF_PROCESSORS and the MAXIMUM_NUMBER_OF_JOBS - only if everyting becomes an integer multiple of each other will one use the available CPUs/threads optimally.

The autoPROC XML file

autoPROC will write two ISPyB-compatible XML files at the end of processing (into the output directory):

autoPROC.xml describes the final isotropic dataset (usually truncate-unique.mtz)
autoPROC_staraniso.xml describes the final, anisotropically analysed dataset (usually staraniso_alldata-unique.mtz)

The default name can be changed via the autoPROC_CreateXmlFile parameter, e.g.

% process -d 01 ...                                           # 01/autoPROC.xml
% process ...                                                 # ./autoPROC.xml
% process -d 01 autoPROC_CreateXmlFile=`pwd`/01/ispyb.xml ... # 01/ispyb.xml

Releases after 13th Dec 2015 allow injection of your own XML elements, e.g. via

% process autoPROC_CreateXml_LocalElements="AutoProcContainer:AutoProcScalingContainer:AutoProcIntegrationContainer:Image:datasetID=12345" ...

which would insert <datasetID>12345</datasetID> into the

<AutoProcContainer>
  <AutoProcScalingContainer>
    <AutoProcIntegrationContainer>
      <Image>

hierarchy.

If there are any missing items, incorrect or inconsistent information in those XML files, please let us know immediately!

Presenting results to users

autoPROC provides a large amount of output - both in terms of files, but also in terms of annotated results that both a novice and experienced user should find useful. We invest a lot of effort to make the autoPROC output as understandable and educational as possible. It should not only help users to understand the quality of their final dataset better, but also to get helpful information and suggestions to maybe improve future experiments. Furthermore, it can sometimes provide indications of local issues with setup or instruments that beamline staff can use.

Therefore, we would like to have as much of autoPROC's "added value" visible to the user as technically possible. For that reason we provide a whole range of result files that should be made available to the user - namely:

summary.html is the main output file describing in detail the whole autoPROC process, including explanations, links to the manual, all relevant plots, warning messages (about multiple lattices, ice-rings, unstable parameter refinement, overloads, ...), detailed anisotropy analysis via STARANISO and much more. Making this visible and available to the user would be our preferred option - if technical possible.

Since the 20180515 release there is the possibility to create a version of that file with all images and referenced files (as far as possible) inlined into the actual HTML page: summary_inlined.html. This will allow the full display of all plots and graphs even after moving autoPROC results to another computer or system. The generation of that file can be triggered by using the "–M ReportingInlined" macro - but be aware that the resulting HTML can be rather large (>100 MB) because of the inlined content. However, it provides a one-stop solution of showing all diagnostic and processing results, so seems ideal for use in an automatic beamline setup.

Since the 20211020 release there is the possibility to control a few of the CSS styles regarding the left-hand menu and the appearance of the X-scrollbar. If your particular way of presenting our HTML file requires some modifications (e.g. due to you using a fixed window/frame size or such), please get in contact with us.

To have all explanation and documentation links reference our online version of those pages (as opposed to the local installed pages), the use of autoPROC_CreateSummaryManualDir="http://www.globalphasing.com" is recommended. This will ensure that those links are valid even after updates of the local software or other cahnges to the site-specific layout.

autoPROC.xml (traditional, isotropic analysis) and autoPROC_staraniso.xml (anisotropic analysis with STARANISO), providing ISPyB-compatible information about mainly the final data scaling/merging statistics). These also reference several result files mentioned below.

truncate-unique.mtz and staraniso_alldata-unique.mtz (final, scaled and merged MTZ reflection data)

truncate-unique.table1 and staraniso_alldata-unique.table1 (ASCII formatted table of final scaling/merging statistics - for overall, inner and outer resolution shells)

truncate-unique.stats and staraniso_alldata-unique.stats (ASCII formatted table of final scaling/merging statistics - as a function of resolution)

report.pdf and report_staraniso.pdf (PDF files with summary of final scaled/merged dataset - including several pages of plots and graphs)

Data_1_autoPROC_STARANISO_all.cif and Data_2_autoPROC_TRUNCATE_all.cif (deposition-ready mmCIF reflection data, containing several data blocks of scaled data)

If there are technical reasons why some of those results can't be presented: please let us know and we would try and work towards a solution. If the above files are stored and made available to users (including the visualisation of results provided in HTML and PDF files), the normal output from autoPROC could be removed or archived. Especially the HTML file will contain a full record of the run conditions and should allow users to potentially process the data again (e.g. at home institution) using the same commands and software versions.

Finally, please make sure that the use of autoPROC is referenced adequately and that users are aware of the use of autoPROC for particular results they might have achieved automatically from synchrotron systems.

Some suggested settings

By default, standard output from process will contain escape sequences to have bold or underlined text. If storing standard output is required (and this would be a good idea), then setting the environmental variable autoPROC_HIGHLIGHT to "no" will prevent this.

Some sites/beamlines might require specific settings (regarding header information, local coordinate convention, goniostat configurations etc). Since the 20210224 release, the meta-data is analysed to extract a detector identifier (if available) together with a date: these are then used to look up site-specific settings in a distributed database $autoPROC_home//autoPROC/lib/detector-site.def to get most likely correct parameters. These would typically include changes to the rotation axis direction, beam centre convention, vertical/horizontal rotation axes etc. If this is not intended, please add "do_setup=no" to your process command-line.

Furthermore, some pre-defined settings might already be available within our distribution: please see

% process -M list

output for such "macros". You could write your own macro (see "process -M show" for examples) or check our database of known settings here. If you know of wrong or missing settings in those tables or have any other information regarding specific beamlies: please let us know.

Although the default parameters for running autoPROC are the result of processing a very large number of datasets over the years, some settings that would speed-up a job could be used - with the caveat that this means running autoPROC in non-default mode.

These settings could include

restricting the number of images to use for spot-searching, e.g. using 10 degree of images distributed over 4 ranges within the first 180 degree of data (released 14th Dec 2015):

      XdsSpotSearchNumImagesAngularRange="10.0"
      XdsSpotSearchNumRanges=4
      XdsSpotSearchAngularRange=180

Please be aware that this might significantly hamper autoPROC's ability to detect multiple lattices and ice-rings (and take corrective measures). By using such settings the user might not become aware of serious problems with the dataset.

restricting the number of pictures to produce showing the diffraction image and predictions:

      autoPROC_CreateGpxPicturesAtRotationAngles="0"
      autoPROC_CreateGpxPicturesAtStages="process"

This would only create those pictures at the final processing stage (caveat: the default settings would also create pictures when potential multiple lattices are analysed) and only for the first image (caveat: potentially missing poorer diffraction patterns that are visible only at different angles).

Accessing image files

Remember that data-processing will consist of a large amount of disk I/O, especially reading of image files. If this is accommodated in special ways at the synchrotron site, it should be taken advantage of. Specifically:

try accessing the images from the fastest location possible (if they are visible/stored on multiple filesystems)

avoid accessing compressed *.bz2 images: although XDS can handle them, it does so by uncompressing them on-the-fly each time an image is requested. Compressed *.gz images should be fine: autoPROC makes use of the xds-zcbf plugin in those cases.

autoPROC can take advantage of the LIB= settings for XDS when reading images: just add the relevant autoPROC_XdsKeyword_LIB=/where/ever/some/thing

Some proposed runs for automatic data-processing with autoPROC

Apart fom the specific details regarding distribution (across nodes and threads), some possible usages of autoPROC could be:

# all defaults:
% process ...

# explicitly assume anomalous signal:
% process -ANO ...

# explicitly assume no anomalous signal:
% process -noANO ...

# use XSCALE for scaling (instead of AIMLESS - not recommended!):
% process -M ScalingX ...

# use CC(1/2) as high-resolution criteria (instead of default I/sigI):
% process -M HighResCutOnCChalf ...

# in case of "poor" diffraction:
% process -M LowResOrTricky ...

# going for pure speed (with all the obvious caveats this entails):
% process -M fast ...

# with known SG:
% process symm=P21 ...

# with known SG and cell:
% process symm=P21 cell="34 45 56 90 98 90" ...

# with reference dataset available
% process -ref /where/ever/ref.mtz ...

Of course these can be combined. See also your local autoPROC reference card at

$autoPROC_home/docs/autoproc/manual/autoproc_reference_card.pdf

Some proposed options when providing re-processing capability for autoPROC

Apart from using autoPROC with (more or less) defaults on each sweep of data, it can easily accommodate a wide array of re-processing options - especially since it was designed for multi-sweep processing right from the start.

One default feature of autoPROC is to automatically combine differnet sweeps into a single, merged dataset if the wavelength value of those sweeps is identical. This is defined via the parameter

WavelengthSignificantDigits=5

If the wavelength value written into the image header can change between sweeps, you might need to reset that criteria to something more appropriate for your given setup (especially what/how wavelength values are written into image headers), e.g.

% process WavelengthSignificantDigits=4 ...

The most common reasons for re-processing might be

Processing multiple sweeps from the same crystal together

This could be due to multi-wavelengths/MAD data, multiple orientations when using a multi-axis goniostat, (pseudo-)helical scans, interleaved data collection, inverse beam etc. If all images reside in the same directory, just using

% process -I /where/ever/images ...

will be fine. If data is in separate directories, one can use

% process -Id "A,/where/ever/scan1,test_####.cbf,1,900" \
          -Id "B,/where/ever/scan2,test_####.cbf,1,900" \
          ...

To handle multi-orientation data correctly, the instrument/goniostat description needs to be correct and up-to-date - see:

% process -M list

and

% x_kappa -list

for our currently distributed, beamline/instrument specific settings. If you are running a multi-axis instrument at your beamline, please contact us with updates and calibration datasets fro time to time!

Processing multiple sweeps from different crystals

Since the orientations of those datasets are unrelated, you need to run with

% process EnsureConsistentIndexing=no ...

This will avoid transforming orientation matrices between different sweeps according to some defined instrument/goniostat model.

Selecting subset of images

This can easily be done using e.g.

% process -Id "A,/where/ever/scan1,test_####.cbf,1,600" \
          -Id "B,/where/ever/scan2,test_####.cbf,201,900" \
          ...

If you want to exclude images in the middle of a sweep, you could run e.g.

% process -Id "A1,/where/ever/scan1,test_####.cbf,1,200" \
          -Id "A2,/where/ever/scan1,test_####.cbf,401,600" \
          -Id "B,/where/ever/scan2,test_####.cbf,201,900" \
          ...

However, this would treat potentially very small wedges as separate datasets. Another option would be to first create a directory with symbolic links to the images wanted

% mkdir tmpA
% ln -s /where/ever/scan1/test_0[01][0-9][0-9].cbf tmpA/.
% ln -s /where/ever/scan1/test_0200.cbf tmpA/.
% ln -s /where/ever/scan1/test_0[45][0-9][0-9].cbf tmpA/.
% rm tmpA/test_0400.cbf
% ln -s /where/ever/scan1/test_0600.cbf tmpA/.

and then run

% process -Id "A,`pwd`/tmpA,test_####.cbf,1,600" \
          -Id "B,/where/ever/scan2,test_####.cbf,201,900" \
          ...

Enforcing a given spacegroup or spacegroup and cell

This can be implemented via

% process symm=P21 ...

% process symm=P21 cell="40 50 45 90 90.3 90" ...

Special handling of (large signal) anomalous/dispersive signal datasets

To avoid classification of very large anomalous/dispersive differences being classified as outliers, one can run with

% process -ANO ExpectLargeHeavyAtomSignal=yes ...

% process -ANO ExpectLargeHeavyAtomSignal=yes ExpectLargeHeavyAtomSignalScaleAndMerge=yes ...

Example (for Diamond MX beamlines)

A script like

#!/bin/sh

module load autoPROC
module load global/cluster

process -I /where/ever/images \
  autoPROC_XdsKeyword_MAXIMUM_NUMBER_OF_PROCESSORS=16 \
  autoPROC_XdsKeyword_MAXIMUM_NUMBER_OF_JOBS=4 \
  -d autoPROC.dir | tee autoPROC.log

can be used to take advantage of the compute cluster (forkintegrate has been configured accordingly, the COLSPOT step is not configured for multi-node execution). Of course, any additional command-line arguments can also be added - e.g.

#!/bin/sh

module load autoPROC
module load global/cluster

process -I /where/ever/images \
  symm=P6122 cell="93 93 130 90 90 120" \
  autoPROC_XdsKeyword_MAXIMUM_NUMBER_OF_PROCESSORS=16 \
  autoPROC_XdsKeyword_MAXIMUM_NUMBER_OF_JOBS=4 \
  -d autoPROC.dir | tee autoPROC.log

Submission would then be done with

% qsub -pe smp 16 -cwd run.sh

(where run.sh is the above script).

External links to documents at synchrotrons/beamlines

Please note that not all of these pages will be up-to-date: developments at synchrotrons/beamlines usually move much faster than documentation can keep up with. If you notice any out-of-date (or non-existent) documentation during your beamline visit, please let us know - especially if you are in a position to help improving those documents.

It would be nice if all beamlines that make use of autoPROC would (1) document this on their user pages, (2) link to our own pages in some way and (3) make the requirement for proper citations/referencing in publications and depositions clear to users.

Synchrotron	Links
ALBA	https://www.cells.es/en/beamlines/bl13-xaloc/running-your-experiment
ESRF	http://www.esrf.eu/UsersAndScience/Experiments/MX/How_to_use_our_beamlines/Run_Your_Experiment/autoproc-global-phasing
	http://www.esrf.eu/UsersAndScience/Experiments/MX/Software/PXSOFT
	http://www.esrf.eu/UsersAndScience/Experiments/MX/How_to_use_our_beamlines/Run_Your_Experiment/automatic-data-processing
SLS	https://www.psi.ch/sls/pxi/status
	https://www.psi.ch/sls/pxii/pxii-manual
	https://www.psi.ch/sls/pxii/status
	https://www.psi.ch/sls/pxiii/data-processing-and-analysis
Diamond	http://www.diamond.ac.uk/Beamlines/Mx/I24.html
	http://www.diamond.ac.uk/Beamlines/Mx.html
ALS	http://bl1231.als.lbl.gov/xtalprogs/xtalprogs.php
	http://bl831.als.lbl.gov/~gmeigs/links/links.html
	http://www.mbc-als.org/manual.html
Soleil	http://www.synchrotron-soleil.fr/Recherche/LignesLumiere/PROXIMA1/UserInfo
PETRA-III	http://www.embl-hamburg.de/services/mx/software/
	http://www.embl-hamburg.de/SoftwareManuals/#data
Australian Synchrotron	http://www.synchrotron.org.au/aussyncbeamlines/macromolecular-crystallography/faqs-mx-beamlines
APS	https://ls-cat.org/links.html
	https://www.gmca.aps.anl.gov/computing/software_crystallography.html#autoPROC
SSRL	http://smb.slac.stanford.edu/facilities/software/xtal_software/
MAX-IV	https://www.nsc.liu.se/support/presto/index.html
	https://www.maxiv.lu.se/accelerators-beamlines/beamlines/biomax/user-access/data-handling-and-processing-at-biomax/
	https://www.maxiv.lu.se/accelerators-beamlines/beamlines/biomax/user-access/processing-results-using-ispyb/
	https://www.maxiv.lu.se/fragmax/fragmaxapp/

Some remarks about image headers

Within autoPROC, the imginfo program is responsible for extracting meta-data from all supported image formats. Whenever some necessary changes are introduced into the raw diffraction data produced by a specific beamline/detector combo: plese make sure that it still produces the same summary output as before.

It would be great if synchrotron beamlines could check how other synchrotrons/beamlines are writing image file headers and stick as closely as possible to those. At the moment (20210413) we have e.g. a probably unnecessary number of variants for the Detector keyword in mini-cbf headers:

 Detector: (null), S/N E-32-0119
 Detector: ADSC HF-4M, S/N H401,
 Detector: D19@ILL curved detector
 Detector: Dectris Eiger 16M, E-32-0107
 Detector: Dectris Eiger 16M, S/N E-32-0100
 Detector: Dectris Eiger 16M, S/N E-32-0101
 Detector: Dectris Eiger 16M, S/N E-32-0102
 Detector: Dectris Eiger 16M, S/N E-32-0104
 Detector: Dectris Eiger 16M, S/N E-32-0108
 Detector: Dectris Eiger 16M, S/N E-32-0110
 Detector: Dectris Eiger 16M, S/N E-32-0113
 Detector: Dectris Eiger 16M, S/N E-32-0115
 Detector: Dectris Eiger 16M, S/N E-32-0116
 Detector: Dectris Eiger 4M, S/N E-08-0104
 Detector: Dectris Eiger 9M, S/N E-18-0101
 Detector: Dectris Eiger 9M, S/N E-18-0102
 Detector: Dectris Eiger 9M, S/N E-18-0103
 Detector: Dectris Eiger 9M, S/N E-18-0104
 Detector: Dectris Eiger2 9M, S/N E-18-0110
 Detector: Eiger 16M, S/N (null)
 Detector: Eiger 4M, S/N (null)
 Detector: PILATUS 12M, S/N 120-0100
 Detector: PILATUS 2M, S/N 24-0103, Elettra
 Detector: PILATUS 2M, S/N 24-0107 Diamond
 Detector: PILATUS 2M, S/N 24-0109
 Detector: PILATUS 2M-F, S/N 24-0109-F
 Detector: PILATUS 2MF, S/N 24-0109-F
 Detector: PILATUS 6M
 Detector: PILATUS 6M Prosport+, S/N 60-0100 Diamond
 Detector: PILATUS 6M, 60-0103, IMCA-CAT
 Detector: PILATUS 6M, S/N 60-0101 SSRL
 Detector: PILATUS 6M, S/N 60-0101, 
 Detector: PILATUS 6M, S/N 60-0102, PSI
 Detector: PILATUS 6M, S/N 60-0104, ESRF ID29
 Detector: PILATUS 6M, S/N 60-0104, ESRF ID30B
 Detector: PILATUS 6M, S/N 60-0106, Soleil
 Detector: PILATUS 6M, S/N 60-0107, BNL
 Detector: PILATUS 6M, S/N 60-0108, Alba
 Detector: PILATUS 6M, S/N 60-0113, 
 Detector: PILATUS 6M, S/N 60-0116-F, ESRF ID23
 Detector: PILATUS 6M, S/N 60-0118
 Detector: PILATUS 6M, S/N 60-0118, HZB-BESSYII BL14.1
 Detector: PILATUS 6M, SN 60-0001, X06SA@SLS.PSI.CH
 Detector: PILATUS 6M-F, S/N 60-0105-F
 Detector: PILATUS 6M-F, S/N 60-0112-F
 Detector: PILATUS 6M-F, S/N 60-0114-F
 Detector: PILATUS 6M-F, S/N 60-0115-F
 Detector: PILATUS 6M-F, S/N 60-0117-F
 Detector: PILATUS 6MF, S/N 60-0102-F, PSI
 Detector: PILATUS 6MF, SN 60-0001-F, X06SA@SLS.PSI.CH
 Detector: PILATUS 6MF-0109
 Detector: PILATUS3 2M, S/N 24-0118, ESRF ID23
 Detector: PILATUS3 2M, S/N 24-0118, ESRF ID30
 Detector: PILATUS3 2M, S/N 24-0124
 Detector: PILATUS3 300K, S/N 3-0226
 Detector: PILATUS3 6M, S/N 60-0119
 Detector: PILATUS3 6M, S/N 60-0122
 Detector: PILATUS3 6M, S/N 60-0126
 Detector: PILATUS3 6M, S/N 60-0128, ESRF ID29
 Detector: PILATUS3 6M, S/N 60-0128, ESRF ID30
 Detector: PILATUS3 6M, S/N 60-0131
 Detector: PILATUS3 6M, S/N 60-0132
 Detector: PILATUS3 6M, S/N 60-0134
 Detector: PILATUS3 6M, S/N 60-0135
 Detector: PILATUS3 6M, S/N 60-0136

Just because the specification does allow for any kind of string here, it would be helpful to stick with some sensible, common syntax ;-)