Pipedream documentation

This software is proprietary to and embodies the confidential technology of Global Phasing Limited (GPhL). Possession, use, duplication or dissemination of the software is authorised only pursuant to a valid written licence from GPhL.

Authors: (2011 - 2024) A. Sharff, P. Keller, C. Vonrhein, O. Smart, T. Womack, C. Flensburg, W. Paciorek and G. Bricogne
Contact: buster-develop@globalphasing.com
Version: 1.4.1

Partial support from EU projects: SILVER (FP7-HEALTH-F3-2010-260644)

1. Pipedream.

1.1. What is Pipedream?

Pipedream is an "expert" system to link and automate [a] data processing with autoPROC, [b] a "limited" molecular replacement stage with Phaser, [c] structure refinement with BUSTER and (where requested) [d] automated ligand fitting with Rhofit with [e] subsequent BUSTER post-refinement of the top solution. The required input for Pipedream is an input data set, either in the form of unprocessed diffraction images or as a pre-processed mtz file, an input model and optionally, an associated mtz file. Consistent relationships between these items are expected, as detailed below.

1.2. Scope and Limitations.

Pipedream has been specifically designed as a pipeline tool to facilitate the use and integration of Global Phasing’s primary software packages (autoPROC, BUSTER and Rhofit) into a (high-throughput) fragment/ligand screening pathway. As such, its scope is quite rigidly defined and a number of limitations on its use apply.

It is anticipated that the primary use for Pipedream is where multiple data sets have been collected on a single target, differing only in the soaking/co-crystallisation conditions of the crystals. As such, it requires that the structure used as the input model for structure solution be essentially identical to that present in the crystals from which the various datasets have been collected - not only the same protein and same sequence, but CRITICALLY, the same space group and cell dimensions (allowing for small differences in the latter due to non-isomorphism, changes in environment due to soaking/co-crystallisation and freezing)!

The input model should be an appropriate APO structure. This may be a native structure or indeed a structure with a known ligand bound. Any "non-protein" molecules that are present in the binding site of interest (i.e. where you are looking for new bound ligands) MUST be removed from the input model. Other built-in ligands, such as cofactors, prosthetic groups, ions etc. - other HETATM’s - can be retained. Careful consideration should be given to any water molecules in the input model. As the chemical environment in soaked/co-crystallised crystals may well be expected to differ from the "apo" structure used as the input model, the water structure in the "apo" model may not be fully conserved. Therefore, Pipedream will automatically remove all water molecules by default, unless the explicit argument "-keepwater" is used to override this behaviour, see Optional arguments, in cases where specific water structure is known or assumed to be conserved. IMPORTANT: Conserved water molecules, together with other non-covalently bound HETATMS, in the input pdb file MUST have been assigned the chain id corresponding to the protein chain to which they are bound, otherwise they will not be retained.

Pipedream is (currently) NOT designed as a tool for automated structure solution. It cannot deal with data that require experimental phase determination, nor can it deal with cases that require full Molecular Replacement.

1.2.1. Minimum input

A reference structure (pdb file) and associated reference mtz file. The latter is optional though strongly recommended. If both are given, Pipedream will confirm that the space group of the reference structure and mtz file match and that the cell dimensions are substantially the same, otherwise it will terminate immediately. Where given, the reference mtz file MUST contain a set of structure factors, ideally those from which the reference pdb structure was refined. If the reference mtz file also contains a Free R set, this will be "transferred" to the experimental data (whether input as raw images or as a pre-processed mtz file). This is considered good practice as it allows proper cross-validation of all related datasets that are refined against the input model (Pozharski et al. Acta D (2013), 69, 150-167). Where a reference mtz file is not given, Pipedream will "back calculate" structure factors from the reference model, will generate a new Free R set and will combine them into an ersatz reference mtz file for use in subsequent steps. We advise against doing this if at all possible.
An input data set. This can be in the form of unprocessed images, in which case the data will be automatically processed by autoPROC, the output directory of an independent autoPROC run, or a file containing already processed scaled/merged data. This can be an mtz file, a Scalepack reflection file or a d*TREK reflection file. Both Scalepack and d*TREK reflection files will be converted to mtz format before further processing. Where an autoPROC output directory or a file of pre-processed data is input, Pipedream will confirm that the space group of the input data matches the reference structure and that their cell dimensions are "similar", otherwise it will terminate. If the input data file contains structure factor amplitudes, they will be used. However, if only intensities are available, Pipedream will automatically run truncate to calculate structure factor amplitudes. Note that Pipedream will terminate if it finds more than one set of structure factor amplitudes unless a unique F/SIGF pair has been specified. Pipedream will also ensure that the input and reference mtz files are consistently indexed, reindexing the input mtz file if necessary. Note that Pipedream will only accept experimental data as unprocessed images if autoPROC is available. If not, only pre-processed data, in the form of an autoPROC output directory, an mtz file, Scalepack or d*TREK reflection file, will be accepted.

Pipedream is designed to ensure that all output structures (and maps) are in the same asymmetric unit as the model used as input to Pipedream, so that all output structures and maps from multiple runs of Pipedream (using the same reference model) are directly superimposable. For this purpose, all input data sets are examined to check that they are consistently indexed with the reference structure and that both input and reference data conform to the CCP4 definition of the asymmetric unit for the appropriate space group. Thus it is important that the reference mtz file should be directly associated with the reference pdb file. If this is not the case, the limited MR procedure may not be successful and in any case the consistency checks to ensure that the output is superimposable over the input will in effect be bypassed. This caveat also applies in the event that the same mtz file is used as both the input and reference data. Although not proscribed, this is definately not to be recommended and Pipedream will generate a warning if it detects that this is the case.

Pipedream would usually be run with all of the required input data/files specified on the command line. However, a plugin mechanism is provided to allow a user provided script to furnish any one or indeed all of the required input data/files to Pipedream, see Appendix C for full details.

1.3. Multiple models.

Conformational change in proteins is a well-known and studied phenomenon. Such changes can be extremely localised, limited to alterations in side-chain conformation, or they can be much much more extensive, such as rigid body domain movements. Localised loop movements are frequently observed in proteins, particularly in response to ligand / cofactor binding.

Such loop movements can be of a large enough magnitude that refinement alone is unable to deal with them - hence it is important to pick the correct input model for refinement. For example, if you are looking at a protein where a loop occludes the known ligand binding site in its apo state, but moves out of the site in response to the presence of a ligand, then it does not make sense to use an apo model, with the loop in the in conformation, in refinement against data where a ligand is bound. Refinement alone is unlikely to move the loop out of the binding site and subsequent ligand fitting will fail, resulting in a false negative. Conversely, it is similarly unwise to use as input to refinement a model for a conformation of the protein in a case where a ligand is bound if the experimental structure is in the apo conformation, this time resulting in a false positive.

However, in the context of running a fragment/ligand screening pathway, how do you know ahead of time whether or not the soaked ligand has bound, and therefore which input model to use with Pipedream?

Pipedream deals with this issue by allowing the input of multiple models, performing an initial refinement on all of them, after which it makes a decision as to which best matches the experimental data. It will then carry on with refinement and ligand fitting using this model alone.

1.3.1. Procedure

In order to determine which model best matches the experimental data, Pipedream looks at main-chain real-space correlation coefficients (calculated against the refined electron density maps, with the CCP4 program edstats) after initial refinement. The model that best matches the data is the one with the highest mean CC. However, this can be very insensitive when calculated over the entire structure. Therefore, to increase the sensitivity of the method, Pipedream calculates and uses the mean CC only for residues where there is significant conformational change.

The preferred method of identifying these residues is for them to be explicitly defined by the user. However, if they are left undefined, Pipedream will attempt to automatically identify regions showing conformational change by stepping through the structure and looking at pairwise RMS deviations for each residue. By default, any residue which has an RMSD of greater than 1.5Å in any of the pairwise comparisons will be selected. Note that in the eventuality that no residues are found above the defined RMSD threshold, Pipedream will select as the best match the model with the lowest Rfree after initial refinement.

1.3.2. Requirements and limitations

The method employed currently is designed to distinguish between conformational changes produced by main-chain differences, such as loop movements. Additional, limited domain movement (such as hinge-bending) can also be accommodated by running Pipedream with an appropriately constituted rigid body definition file, so that the initial refinement can correct for relative domain shifts before model comparison.

However, the current method is not sensitive to conformational change caused solely by side-chain movements.

The input model requirements listed in section 1.2 (same space group and cell as the experimental data, same protein, same sequence) apply to ALL input models. In addition, all of the input models must be directly superimposable. Furthermore, they must all share a common residue numbering and chain identification scheme.

Importantly, unmodelled residues in one or more of the input models (presumably due to disorder), are to be avoided if at all possible. Any significant number of unmodelled residues in any of the input models (unless missing from ALL of them) could potentially compromise the ability of Pipedream to select the correct model.

Pipedream deals with unmodelled regions slightly differently, depending on how they have been defined. Where Pipedream is left to determine automatically which regions to compare, it will remove from consideration any residue, regardless of pairwise RMS deviation, which does not appear in all of the input models. The potential drawback is that regions where there is genuine conformational change could be excluded from analysis if those regions are not present in one or more of the input models. Where the residues for analysis are specified by the user (the preferred method), any residues missing in one of the models would be assigned a default CC of 0 for that model. Again, a significant number of missing residues from one or more of the input models could potentially compromise the ability of Pipedream to select the correct model.

Bear in mind that intended use and scope of Pipedream implies that the ONLY difference between the input models (and their internal PDB attributes) should be in (relatively) localised conformational changes.

1.3.3. Recommended use

ALL input models must be superimposed on each other before input into Pipedream (using CCP4 program gesamt or coot) - Pipedream will not superimpose them itself.

Whilst there is no limit to the number of models that can be input, if there are a limited number of distinct conformational states, we would recommend using only the one (or two) best models representative of each conformational state. Adding more and more very similar models may simply increase the CPU time required without any improvement in precision in arriving at the correct solution.

Although Pipedream can be used to automatically identify the regions that differ (as described above), the preferred method is to tell Pipedream which residues to use for structure comparison, using either (or indeed both) the -seqin1 or -seqin2 options (see Chapter 3 for a description of the use of these options).

Ideally, analysis of the input models should be based on comparison of at least 5 residues. Although Pipedream does not enforce a minimum number of residues, if run in automatic residue determination mode, it will note in the main output if the number of residues selected above the threshold is below 4. In this case, you may want to re-run Pipedream lowering the default RMSD threshold.

1.4. Program dependencies and acknowledgements.

As well as autoPROC, BUSTER and Rhofit, Pipedream will run various CCP4 programs. In particular, Pipedream requires version 2.5.6 (or later) of Phaser. This is installed in CCP4 versions from 6.4.0. Pipedream also requires the reduce program from Molprobity.

Pipedream also incorporates buster-report, which has a number of external dependencies, such as mogul and grace. For further details please see the locally installed software installation instructions in <installation root directory>/docs/installation. Pipedream will test for the availability of these dependencies and if certain ones cannot be satisfied will not attempt to run buster-report.

We are grateful to Tassos Perrakis, Robbie Joosten and the Netherlands Cancer Institute (NKI) for permission to distribute and make use of programs pepflip and SideAide, components of the PDB_REDO suite (http://www.cmbi.ru.nl/pdb_redo/), in Pipedream.

You can test that all of Pipedream’s dependencies have been satisfied by running pipedream -checkdeps.

2. Pipedream architecture

Pipedream runs several packages, each generating its own output. In order to keep this output separate and clear, Pipedream will generate a specific directory structure to keep the output from each stage separated. Definition of a root directory <ROOT> in which to create this structure is obligatory.

Pipedream can be run manually. However it has been written to allow it to be called automatically and multiple runs to be run in the background or submitted on remote machines. Thus, it does not write any information to standard output, unless problems with the input files or mistakes made in invoking Pipedream prevent normal execution. All output is written to disk and may be reviewed at leisure. A summary of the main output is written into the file ROOT/summary.out.

Stage 1: Input x-ray diffraction data are processed with autoPROC (unless a pre-processed mtz file is used as the primary input). Output is written into the directory <ROOT>/process, with the standard output from autoPROC in <ROOT>/process.out.

Stage 2: The degree of non-isomorphism observed between crystals, especially after soaking experiments, can easily exceed the limit that can be corrected by rigid body refinement. Thus, Phaser is used in a specific mode to run a very "limited" molecular replacement procedure. This has the advantage that it is fast and can deal with fairly significant molecular movements due to non-isomorphism and/or conformational changes due to ligand binding. The angular range allowed for the function is matched to what is accepted as a reasonable degree of cell dimension variability - see Appendix A. Whilst this angular range can be doubled in cases of more extensive non-isomorphism, the procedure CANNOT cope with the more significant transformations seen where the search model has a different cell / symmetry to the data. By default, the input structure is treated as a single rigid unit, regardless of the number of protein chains present in the model. However, where the asymmetric unit contains multiple chains (whether homomeric or multimeric) and if so desired, individual chains or groups of chains can be defined as separate units (with the -chains option) and will be treated independently. This approach may well be be beneficial in such cases. However, one caveat that should be borne in mind is that translational NCS in the input model could lead to a possible failure mode. If the input model is known to contain chains related by translational NCS then either they should not be treated as independent units, or the brute force translation function should be selected (with the -btf keyword).

Stage 3: The structure is then refined with BUSTER, with the explicit aims of producing a) the best refined model consistent with the data and b) the best difference density to aid in the identification of potentially bound ligands. Three different refinement protocols are available ("default", "thorough" and "quick"), the choice of which is dependent on the size and degree of movement / flexibility observed in the target structure, the quality of the experimental data and the degree of change in the target structure (relative to the starting model). Pipedream supports two different protocol sets as of October 2020. The original (version 1) refinement protocols can be accessed via the -v1 command-line option for backwards compatibility. However, we recommend the use of the current (version 2) protocols that we have found to give better results on average. Please see Appendix B for full details of the refinement protocols

The section below gives an overview of the current 3 refinement protocols:

default: 2 runs of BUSTER refinement

`Run_1:`	Standard refinement, including Rigid Body at big cycle 1
`Run_2:`	adding TLS, 1st and 2nd shell water placement

thorough: 3 runs of BUSTER refinement

`Run_1:`	Standard refinement, including Rigid Body at big cycle 1
`Run_2:`	adding TLS, 1st shell water placement
`Run_3:`	TLS, 2nd shell water placement

quick: performs a single BUSTER refinement run

Run_1:

Standard refinement, including Rigid Body at big cycle 1, TLS, 1st and 2nd shell water placement

The output for the final run (independent of chosen protocol) is written into the directory <ROOT>/refine, with the standard output from BUSTER in <ROOT>/refine.out. The final model from this BUSTER run is used for subsequent ligand fitting and refinement of the resulting complex structure.

Unmodelled density elicitation

Note: version 2 protocols only.

Following completion of whichever of the above refinement protocols have been selected, an additional run of BUSTER will be performed as follows:

Run_L:

TLS, -L, calculation of map coefficients limited to 1.9Å resolution

Note that the ONLY output from this BUSTER run that is used in further steps are the map coefficients in the output refine.mtz file. The output model is discarded. The -L option used in this run places "waters" in the model to try to explain unmodelled density. At the end of the penultimate big cycle, connected networks of these "waters" (connected by density) are identified and removed prior to the final big cycle, hopefully enhancing the density features in which they had been placed. However, many of the other "waters" placed by -L will be left in the model and since their initial placement did not follow the same rules used to place genuine waters (i.e. close proximity to a hydrogen bond donor / acceptor), some may not be genuine waters.

Therefore, the subsequent run of Rhofit (see Stage 4 below), will use the output model from the final refinement run (in directory "refine", which has had only genuine waters added) together with the map coefficients from the unmodelled density elicitation run (directory "refine-L"). Extensive internal testing has shown us that the procedure to analyse density maps for the location, extent and shape of potentially unmodelled regions, works best when those maps are computed to a maximum resolution of about 1.9Å. Computing electron density maps at higher resolution tends to break the connectivity of (potentially) unmodeled density regions and therefore complicates the analysis step at this point.

Which protocol to use.

Results of internal testing suggest that the default protocol should be appropriate for many cases and that would certainly be our recommendation in the first instance (hence it is the default). For larger proteins, particularly complexes with more than one chain in the asymetric unit, or structures which show a higher degree of variability and conformational flexibility, particularly due to soaking / ligand binding, the thorough protocol may be more appropriate. For fairly rigid proteins that show little flexibility or variation (especially in the face of soaking / ligand binding), where the input model has been fully characterised and refined and it is known that the target structure will be nearly identical to the starting (APO) model in terms of crystal packing, domain arrangement, loop and side-chain conformations, the quick protocol may be sufficient. Please be aware that all BUSTER refinements have their own convergence criteria (so won’t necessarily run for the maximum specified number of iterations) - which means that in such situations the default or thorough protocols might not be much slower, whilst at the same time providing the safety net of being able to handle more complex and / or unexpected situations.

Model remediation.

Amino acid sidechains can often be seen to shift, often adopting totally different conformations, between datasets collected from different crystals. This can be a response to multiple differences between individual crystals, particularly differences in soaking with different compounds. The shifts seen in sidechains can be beyond the ability of standard refinement to correct.

SideAide, part of the PDB_REDO suite, can be run to check the modelled sidechain conformations against the electron density and refit them (if indicated) by searching all allowed rotamers to find the best fit. In addition, SideAide can rebuild sidechains that have been stubbed (please note that this is NOT the default as used in Pipedream).

Pepflip, also part of the PDB_REDO suite, can be run to check for and correct any peptide backbone flips. This is NOT run by default by Pipedream.

Model remediation (using SideAide and pepflip) can be requested in Pipedream with the -remediate keyword.

Please note that this option CANNOT be used in conjunction with the quick refinement protocol.

Where called in conjunction with the default protocol, it will be run in between the 1^st and 2^nd rounds of refinement.

Where called in conjunction with the thorough protocol, it will be run in between the 2^nd and 3^rd rounds of refinement.

After remediation, the modelcompare program (also part of the PDB_REDO suite) is run to analyse and compare the output model from SideAide with the model output from the preceeding BUSTER refinement. As well as generating a summary of the impact of running Sideaide (and pepflip) that is presented in the final summary.out, it also generates scheme and python scripts that can read into coot to aid visualisation of the impact of SideAide (and pepflip).

Log Likelihood Outlier removal:

All of the individual runs of BUSTER run by Pipedream, both pre- and post- Rhofit, analyse the input data and will exclude reflections from the output mtz that are flagged as 7-sigma Log Likelihood outliers (these are predominately poorly measured reflections or those within ice-rings). Subsequent BUSTER runs within Pipedream will use the mtz file from the previous round, with these outliers rejected. To prevent Log Likelihood outlier rejection, specify the -nologlikerej command-line option.

Multiple model input:

If run with more than one input model, Pipedream will run stage 2 and the first cycle of refinement (as defined by the specified refinement protocol in stage 3 above) for each of the input models. After automatic analysis of the conformational differences between the refined models (unless the -seqin1 and/or the -seqin2 options are specified), Pipedream will select which of the refined models gives the best fit to the data over the selected residues. Subsequent steps are only carried out with this one model. Following selection, the remaining refinement cycles (unless the quick protocol was specified) are run on the selected model.

Stage 4: If specified with one or more CIF restraint dictionaries (from Grade2 or other similar dictionary generator) for soaked/co-crystallised ligands, Rhofit will be run for each specified ligand in turn to attempt to locate and fit the ligand into the refined structure. Output is written into the directory <ROOT>/rhofit-<dictionary name>. Standard output from rhofit is in <ROOT>/rhofit-<dictionary name>.out. Unless specifically told to ignore non-crystallographic symmetry (or if there is no ncs) Pipedream will assume that the number of potential ligand binding sites is equal to the observed ncs in the input model. By default, Rhofit will attempt to fit <ncs> copies of the ligand. If there is no ncs, Rhofit will attempt to fit a single copy of the ligand. If you expect to see the ligand bound to more than one site per monomer then you will need to tell Rhofit how many "clusters" to identify and fit. See Optional arguments for more details.

The "top" solution from Rhofit will be automatically post-refined by a further run of BUSTER, unless Pipedream is specifically told not to with the -nopostref option. The default is to perform a single full, standard BUSTER run, however, if the intention is simply to update the ligand fit and generate new maps, a short BUSTER refinement can be requested (using the -M ShortRunVoid macro), or in cases where the fit ligand is quite large and/or has several degrees of freedom or the protein structure is quite large and flexible, a more thorough post-refinement can be requested (with the -postthorough option). This will run two rounds of BUSTER. By default, post-refinement will also refine the occupancy of the ligand(s) fitted by Rhofit, unless specifically told not to do so. Prior to post-refinement, hydrogen atoms will be added to the ligand only, at full occupancy, if the resolution is 2.0Å or worse, or to both protein and ligand, again at full occupancy, if the resolution is better than 2.0Å). The resolution limit below which the full protein and ligand will be hydrogenated can be altered with the -hydrogenation_res option and hydrogens can be added at zero occupancy if the -hydrogenation_zeroocc option is used. The output from this run will be written to <ROOT>/postrefine-<dictionary name> and with standard output written to <ROOT>/postrefine-<dictionary name>.out.

Note: The implementation of Rhofit in Pipedream allows fitting of a single ligand or, where a crystal has been soaked in a cocktail of compounds, fitting each component independently to allow the user to determine which, if any, component has bound, i.e. to answer the question "Does compound A or B or … etc bind?". It CANNOT be used to successively fit multiple compounds into a structure, i.e. to answer the question "Do compounds A and B and … etc all bind?".

buster-report will also be run (unless its dependencies are not satisfied) to give a concise report on the outcome of refinement. If both Rhofit and subsequent post-refinement have been requested, buster-report will be run on the output of post-refinement. If not, it will be run on the final output of the initial refinement protocol. The output from buster-report will be written to <ROOT>/report.

3. How to run Pipedream?

To invoke Pipedream, simply use the command:

% pipedream <options>

A basic invocation of Pipedream would look something like:

% pipedream -imagedir <directory> -d <output directory> -xyzin input.pdb -hklref input.mtz

3.1. Details of command-line arguments

`no argument or -help or -h`	Quick help message listing most important arguments.
`-hh`	Also list some more advanced options.
`-help process`	Quick help message listing most important autoPROC arguments.
`-help refine`	Quick help message listing most important BUSTER arguments.
`-help rhofit`	Quick help message listing most important Rhofit arguments.

3.1.1. Minimum required arguments

`-imagedir [directory name]`	Directory containing the raw images. This directory should contain the images for a single dataset only. Pipedream can cope with datasets that have been collected in multiple scans (for example high and low resolution passes or scans collected with multiple orientations on a kappa/Eulerian goniostat), provided adequate information relating these scans is provided in the image headers (and, for multi-axis goniometers, in a local configuration file for the relevant beamline).
`- or -`
`-imagescan <scan definition>`	Use this option to input a specified set of images. The scan definition is the same form as the -Id option in autoPROC, i.e. <idN>,<dirN>,<templateN>,<fromN>,<toN>. To find sets of images in a particular directory and output scans in the correct format, you can run the command find_images -l -d <dir>. Multiple scans can be input by multiple invocations of -imagescan. Please note though that multiple scans MUST be images collected at the same wavelength. Pipedream CANNOT deal with images collected at multiple wavelengths.
`- or -`
`-h5master <dir/master.h5>`	Use this option if the input data are a set of Eiger HDF5 files. Give the FULL path and name of the <template>_master.h5 file.
`- or -`
`-autoprocdir [directory or file name]`	output directory from a previous run of autoPROC (or autoPROC summary.tar.gz file). Pipedream will read the appropriate output mtz file from the autoPROC output directory (or summary.tar.gz file) as well reporting on the processing statistics.
`- or -`
`-hklin filename.mtz/sca/ref`	Input scaled & merged mtz/scalepack/dTREK file. Scalepack or dTREK reflection files will be automatically converted to mtz format. Pipedream CANNOT accept unscaled/unmerged data. If the input file does not contain structure factor amplitudes, truncate will be run automatically. The data will also be automatically reindexed (if required) to ensure that it is consistently indexed with the reference mtz file. A Free R flag will also be added if one is not present and the -nofreeref option is also specified. If more than one set of structure factor amplitudes are present, Pipedream will terminate rather than make an arbitrary decision on which amplitudes to use, unless a unique F/SIGF pair is specified (see below).
`-d [directory name]`	Output directory. All pipedream output will be written in a defined tree under this directory. Specifying an output directory is COMPULSORY to ensure that the output from a run of Pipedream is kept separate from any other, and enables the output to be separated from the input data, which may in any case be desirable as part of the data management policy in your research group.
`-xyzin <pdbinputs>`	Input pdb file(s). Enter as a comma separated list if more than one input structure is specified. These structures should be of the same target protein as the input data and they are ALL expected to have the same cell and space group! If more than one model is input, they must all be superimposable! IMPORTANT: These structures should be APO structures. They should NOT contain any ligands in the binding site(s) of interest (where you are looking for bound ligands)! However, they should contain any associated co-factors that are not expected to be affected by the soaking of the putative ligand.
`-hklref filename.mtz`	OPTIONAL (but strongly recommended) Reference MTZ file. This file should go together with the reference pdb file (where multiple pdb files are specified, the reference mtz file should go with the first input model). It MUST contain a set of structure factors and also the Free R set that was used in refining the input reference structure. If it does NOT contain a Free R set, Pipedream will terminate unless the -nofreeref option is also specified, in which case it will generate a new Free R set. If a reference mtz file is NOT specified, Pipedream will "back calculate" structure factors from the reference structure together with generation of a new Free R set and use these as the reference set (where multiple input models have been input, structure factors will be calculated from the first input model). In this case, as a reference Free R set is clearly not available for use, specifying the -nofreeref option is compulsory.

Note: If autoPROC is not installed, the -imagedir and -imagescan options will be disabled and only the -hklin (and -autoprocdir) option will be available.

The expected cell dimensions and space group will be read directly from the reference mtz file and autoPROC will ensure (where the symmetry allows the possibility of alternate indexing) that the experimental data are indexed consistently with the reference. Given that the reference pdb and mtz files are paired, this ensures that the limited molecular replacement should be successful, and has the added advantage that where pipedream is run on a series of structures, they will all end up in the same asymmetric unit and will therefore be directly superimposable. If the reference mtz file contains a Free R set, this set will be used for the processed data. Thus all data sets processed with Pipedream using the same input mtz file will have a common Free R set. As previously described, use of a common Free R set is good practice, in this context, to allow for proper cross-validation between structures.

3.1.2. Optional arguments

Further optional arguments are grouped into options for autoPROC, BUSTER and Rhofit (see pipedream -help).

Multiple model input options:

`-seqin1 <seqin.dat>`	File containing comma-separated list of the residues to be used for structure difference analysis (in the form `<residue name> <chain id> <residue number>`, i.e. GLY A 34,ALA A 35,THR A 96)
`- and/or -`
`-seqin2 "residue list"`	Double-quote enclosed, comma-separated list of the residues to used for structure difference analysis (in the form `<residue name> <chain id> <residue number>`, i.e. GLY A 34,ALA A 35,THR A 96). If this is used together with the -seqin1 option, a combined list of residues listed through both options will be used.
`-rmsd <number>`	Threshold value for pairwise RMS deviation bewtween residues to be selected for analysis (default = 1.5Å). This option can only be used if neither -seqin1 or -seqin2 options are specified.

autoPROC options (only where autoPROC is installed):

`-cell <"a b c al be ga">`	Cell dimensions. This will override the cell read from the reference mtz file. Not generally recommended.
`-mproc <macro name>`	Comma-separated list of autoPROC macros.
`-kappa <site name>`	Specify site for use of kappa/eulerian goniometer. Use without an argument to list available sites.
`-beam "x y"`	Specify direct beam position (in double quotes). Default is to use direct beam position as specified in the image header.
`-beamtransform <option>`	Double-quote enclosed x and y transformation of direct beam (from header). Possibilities are: x,y x,-y -x,y -x,-y y,x y,-x -y,x -y,-x
`-beaminit`	Test all 8 transformation possibilities of direct beam position.
`-beamrefine`	Try to determine and refine direct beam position automatically (use with caution!!).
`-apcommands "process options"`	Double-quote enclosed list of autoPROC command line options. See autoPROC documentation for further details.
`-useiso`	Use the isotropically scaled output file (truncate-unique.mtz) output by truncate in place of the anisotropically scaled output from Staraniso (staraniso_alldata-unique.mtz) in all subsequent steps.

Data acceptance criteria:

The primary goal of Pipedream to automate processing and structure refinement for ligand detection. We consider that there are certain minimum criteria that the data need to meet to make looking for ligands, particularly small ligands, viable.

Where raw images have been input, the data are checked against these criteria and if any of these checks fail, then Pipedream will terminate cleanly. The current criteria are based on resolution, completeness and Rpim. The default values of these can be changed with the following options:

`-rmin <number>`	Minimum acceptable high resolution limit (default = 3.5Å).
`-rpim <number>`	Maximum acceptable rpim (default = 25%).
`-completeness <number>`	Minimum acceptable overall data completeness (default = 60%).

Optional "Molecular Replacement" arguments:

`-chains "chain list" or "ALL"`	Double-quote enclosed, space-separated list of individual chains/multimers to move independently. For example `-chains "A B C D"` will move chains independently, whereas `-chains "AB CD"` will treat chains A & B as a single movable unit and chains C & D as another single movable unit. If the argument "ALL" is specified then all protein chains present will move independently. By default, if this option is not specified, Pipedream will treat the entire input model as a single unit. Note: all hetero-atoms/groups MUST have the same chain id as their associated protein molecule or they will be lost!
`-btf`	Use brute force translation function. Default is to use fast translation function. The fast translation function is the faster protocol, however, if the input model has translational NCS, you may get better results from the brute force translation function. This option should only be used in combination with the `-chains` option.
`-mrres [<reslow>] <reshigh>`	Resolution limits for MR. Most of the time, the defaults (low res limit left unset and high res limit set to 3.0Å) are adequate and should not need to be changed. However, for very large, multimeric structures you may need to restrict the resolution range.
`-bigrotrange`	Double the angular range for the rotation function from ±5.0^o (default) to ±10.0^o. See Appendix A

BUSTER options:

`-quick`	Select the "quick" refinement protocol. Single round of BUSTER refinement for quickest results.
`-thorough`	Select the "thorough" refinement protocol. Three rounds of BUSTER refinement.
`-mrefine <macro name>`	Comma separated list of BUSTER macros.
`-rigid <rigid.dat>`	Perform rigid body refinement using rigid body definitions as specified in the input file. Default is to define one rigid body per chain.
`-noautoncs`	Turn off autoncs (default is ON).
`-target <filename.pdb>`	Turn on target restraints. If specified without a pdb file, then the file specified by -xyzin is used.
`-sequence <TNT sequence file>`	Correct TNT format sequence file. Use of this option should only be considered where there are known issues with automatically generated sequence files that would require manual intervention. This option CANNOT be used with multiple model input.
`-l <dictionaries>`	Comma separated list of refmac-style CIF restraint dictionaries for pre-existing ligands or prosthetic groups.
`-abcommands "refine options"`	Double quote enclosed list of BUSTER command line options. See BUSTER documentation for further details.
`-fss "FP,SIGFP"`	Double quote enclosed unique F,SIGF pair. ONLY* use if primary input data is an mtz file containing more than one F/SIGF pair.*

Remediation (PDB_REDO) options:

`-remediate`	Run SideAide to refit side chains. This option cannot* be used in conjunction with the quick refinement option.*
`-sidechainrebuild`	Also allow SideAide to rebuild stubbed sidechains.
`-runpepflip`	Also run pepflip to check for and correct peptide bond flips.

Rhofit options:

`-rhofit <dictionaries>`	Run Rhofit if specified. Comma separated list of refmac-style CIF restraint dictionaries.
`-keepH`	Keep hydrogen atoms on the ligand in the fit.
`-nochirals`	Ignore CHIRAL restraints in fitting/output. Chiral centres can then invert as needed.
`-allclusters`	Fit the ligand to every potential binding site.
`-xclusters <n>`	Produce ligand fits for the <n>* best possible binding sites. Default: fit to <ncs> best sites.*
`-rhoquick`	Run fewer trials than usual.
`-rhothorough`	Run more trials than usual.
`-rhocommands "Rhofit options"`	Double quote enclosed list of Rhofit command line options. See Rhofit documentation for further details.
`-nocorrelsort`	sort Rhofit output solution by Rhofit score, rather than correlation coefficient.
`-postref`	Post-refine the top solution from Rhofit (default option if -rhofit is defined).
`-postquick`	Quick post-refinement of the top solution from Rhofit (uses ShortRunVoid macro).
`-postthorough`	Thorough post-refinement of the top solution from Rhofit.
`-nopostref`	Do not run any post-refinement but terminate after Rhofit.
`-hydrogenation_res <res>`	Where the resolution limit is better than <res>, the model will be fully hydrogenated after Rhofit (default <res>: 2.0Å)
`-hydrogenation_zeroocc`	For models with resolution better than `-hydrogenation_res`, add hydrogens at zero occupancy (default: add at full occupancy)
`-nooccref`	Do not refine ligand occupancy in post-refinement

Data Input Options:

-plugin "<identifier>"

Run defined plugin program with argument "<identifier>" to retrieve and furnish details of one or more of the required input data sources to Pipedream. Please see Appendix C for a full description of the set-up and operation of the Pipedream data plugin mechanism. Note that any of the mandatory data inputs not provided through this call must be specified individually on the command line. The argument <identifier> should be specified inside double quotes.

General options:

`-nofreeref`	Acknowledgement that the reference mtz file DOES NOT contain a Free R set and that it is OK to generate one de novo. This command is COMPULSORY if a reference mtz file is not specified, or if the reference mtz file does not include a FreeR set. This is not generally recommended.
`-keepwater`	DO NOT remove waters that are present in the input model (default is to remove them). NOTE: In order for waters (or indeed ANY HETATM’s) to be retained they MUST be assigned the same chain id as the protein chain to which they are associated.
`-nowateradd`	DO NOT add/remove waters in initial BUSTER protocols. Use with care!
`-nobr`	Do not run buster-report.
`-nthreads <integer>`	Number of processes to use (for both autoPROC, Phaser and BUSTER). A negative value will use (all)/n.
	Default = use individual program defaults.
`-help process\|refine\|rhofit`	Print help for either autoPROC, BUSTER or Rhofit.
`-macro process\|refine`	Print list of available macros for either autoPROC or BUSTER.
`-v`	write progress of run to standard output.

Advanced options:

`-nolmr`	Skip the limited MR step and proceed directly to BUSTER refinement. This option is not generally recommended. Use with care!
`-v1`	Revert to using the version 1 pre-Rhofit refinement protocols.
`-nologlikerej`	Do not remove Log-likelihood outliers reflections through the refinement protocol.

4. Location of Pipedream output.

All of the output from Pipedream will be written in a defined directory tree in the output directory specified with the -d option.

`<root>/process/`	Location of autoPROC output.
`<root>/process/truncate-unique.mtz`	Final output from autoPROC. Used as the input for subsequent processes.
`<root>/process.out`	Standard output from autoPROC.
`<root>/MR/`	Location of limited MR output.
`<root>/MR/phaser.3.pdb`	Final output from limited MR. Used as the input for subsequent processes.
`<root>/MR/phaser.1.mtz`	Mtz file containing map coefficients from Phaser.
`<root>/refine/`	Location of BUSTER output (final cycle).
`<root>/refine/refine.(pdb,mtz)`	Final output from BUSTER (final cycle). Used as input for Rhofit (if run).
`<root>/refine.out`	Standard output from BUSTER (final cycle).
`<root>/refine-L/`	Location of BUSTER output (unmodelled density elicitation cycle). Not where the -v1 option is specified.
`<root>/refine-L/refine.mtz`	Final output from BUSTER (unmodelled density elicitation cycle). Used as input for Rhofit (if run). Not where the -v1 option is specified.
`<root>/refine-L.out`	Standard output from BUSTER (unmodelled density elicitation cycle). Not where the -v1 option is specified.
`<root>/rhofit-<dictionary name>/`	Location of Rhofit output for ligand <dictionary name>.
`<root>/rhofit-<dictionary name>.out`	Standard output from Rhofit.
`<root>/postrefine-<dictionary name>/`	Location of BUSTER post-refinement output.
`<root>/postrefine-<dictionary name>.out`	Standard output from BUSTER post-refinement.
`<root>/report-<dictionary name>/`	Output from buster-report. Can be viewed with firefox <root>/report-<dictionary name>/index.html.

Where multiple models have been input, all directories and output (primarily for the limited MR and initial refinement stages) relating to the individual input models will be written into directories named <number of input pdb file>-<name of input pdb file>.

In addition, the file <root>/summary.out contains the main summary of the results (and any warning or error messages) from each stage in the process. Note that this file is primarily intended to be read by eye. For automated data harvesting, Pipedream writes out all of the results in machine-readable XML format in <root>/summary.xml.

A typical summary.out file (for a single pdb file input) is:

 =======================================
    Processing and Refinement Summary
 =======================================

 Pipedream version: 1.0.0  <2014-05-12>

 Run by fbloggs on bijvoet at 12:14:10 on Thu Dec  4 2014
 Run from /home/fbloggs/pipedream

 Command run: pipedream  -hklin 4j0p.mtz -xyzin 1w50.pdb -hklref 1w50.mtz \
              -rhofit grade-LIG.cif -postref -d output

 All output in /home/fbloggs/pipedream/output


 ==================================================
 ************* Input data is MTZ file *************
 ==================================================


 Checking indexing consistency against reference mtz file 1w50.mtz.

 No need to reindex input data.

 Copying Freer column from the reference file 1w50.mtz to the input mtz file.
 Any pre-existing Freer set in the input file will be discarded.
 Consistently indexed mtz file with reference Freer is in consistent-input.mtz



 ==================================================
 ******************* limited MR *******************
 ==================================================

 Limited MR procedure run with 1 independently defined units.

 MR solution found with score (TFZ) 55.6

 For further details please see MR/*{rotation or translation}.out
 Output pdb file: MR/phaser.3.pdb




 ==================================================
 ****** BUSTER refinement (default protocol) ******
 ==================================================


 Initial:                R = 0.2638,     Rfree = 0.2771
 After 1st refinement:   R = 0.2371,     Rfree = 0.2528
 Final:                  R = 0.2103,     Rfree = 0.2409


 For further details please see refine.out
 Output files:
              refine/refine.pdb
              refine/refine.mtz



 ==================================================
 *********** Ligand Fitting with Rhofit ***********
 ==================================================


 ++++++++++++++++++++++++++++++++++++++++++
 | Running rhofit with ligand *grade-LIG* |
 ++++++++++++++++++++++++++++++++++++++++++

 For output and further details please see rhofit-grade-LIG/

                             rhofit           ligand LigProt  Poorly
                              total   Correl  strain contact fitting
  File               Chain    score   coeff    score   score   atoms
  ===================================================================

   Hit_00_00_000.pdb   A    -2308.1   0.9171     8.9     0.0    0/26

 BUSTER post-refinement
 ======================


 Initial:        R = 0.2075,     Rfree = 0.2262
 Final:          R = 0.1858,     Rfree = 0.2152


 For further details please see postrefine-grade-LIG.out
 Output files:
              postrefine-grade-LIG/refine.pdb
              postrefine-grade-LIG/refine.mtz

 buster-report output:
              report-grade-LIG/index.html

 =======================================



 Run took 01:04:09 h:m:s to complete

A typical summary.out file (for multiple pdb file input) is:

 =======================================
    Processing and Refinement Summary
 =======================================

 Pipedream version: 1.0.0  <2014-05-12>

 Run by fbloggs on bijvoet at 15:35:56 on Thu Jan 29 2015
 Run from /home/fbloggs/pipedream


 Command run: pipedream  -hklin 4ke1/4ke1.mtz -nofreeref -xyzin \
              1w50.pdb,4dh6.pdb,4j0p.pdb -rhofit 1R6.grade_PDB_ligand.cif \
              -rhothorough -postref -seqin1 seq.list -d multiple-seqin3

 All output in /home/fbloggs/pipedream/multiple-seqin3




 Reference structure factors (multiple-seqin3/1-1w50_nowater.mtz) have been
 back-calculated from reference model with sfall!



 ==================================================
 ************* Input data is MTZ file *************
 ==================================================


 Checking indexing consistency against reference mtz file
 multiple-seqin3/1-1w50_nowater.mtz.

 No need to reindex input data.

 Using Freer column already present in the input mtz file.



 ==================================================
 ****************** Input models ******************
 ==================================================



 You are running pipedream with 3 input pdb files.
 Limited MR and initial refinement will be run on each
 of the input models, after which the model that best
 fits the data will be chosen. Further steps will only
 be run on the selected model.

 The input models (in order) are:

 1: 1w50.pdb (located in current directory)
 2: 4dh6.pdb (located in current directory)
 3: 4j0p.pdb (located in current directory)




 ==================================================
 ******************* limited MR *******************
 ==================================================

 Limited MR procedure run with 1 independently defined units.


 1-1w50: MR solution found with score (TFZ) 52.5

        For further details please see 1-1w50/MR/*{rotation or translation}.out
        Output pdb file: 1-1w50/MR/phaser.3.pdb


 2-4dh6: MR solution found with score (TFZ) 56.7

        For further details please see 2-4dh6/MR/*{rotation or translation}.out
        Output pdb file: 2-4dh6/MR/phaser.3.pdb


 3-4j0p: MR solution found with score (TFZ) 53.3

        For further details please see 3-4j0p/MR/*{rotation or translation}.out
        Output pdb file: 3-4j0p/MR/phaser.3.pdb


 ==================================================
 ***************** Model selection ****************
 ==================================================

 For the results of initial refinement and the edstats output
 for each of the input models, please see:

 multiple-seqin3/1-1w50/refine1.out
 multiple-seqin3/1-1w50/refine1/edstats.out

 multiple-seqin3/2-4dh6/refine1.out
 multiple-seqin3/2-4dh6/refine1/edstats.out

 multiple-seqin3/3-4j0p/refine1.out
 multiple-seqin3/3-4j0p/refine1/edstats.out


 The  residues, as input, that will be used to assess
 which one of the input models gives the best fit to
 the input data are listed in the file:

  multiple-seqin3/comparison-residues.list

 NOTE: Any residues from the input list that are not
 present in one (or more) of the input models will be
 automatically assigned a Z-score of 0 for that
 particular model. Please be aware that a significant
 number of "missing" residues could potentially
 compromise the model selection process!

 The average Z-score of the real-space sample
 correlation coefficient (ZCCm) over the selected
 residues for each of the input models are:

 average ZCCm = 5.5000, for model multiple-seqin3/1-1w50/refine1/refine.pdb
 average ZCCm = 9.1000, for model multiple-seqin3/2-4dh6/refine1/refine.pdb
 average ZCCm = 5.0875, for model multiple-seqin3/3-4j0p/refine1/refine.pdb
 ****************************************************
 On the basis of having the highest mean ZCCm score,
 over the selected residue range, the model selected
 as the best match to the input experimental data is

 multiple-seqin3/2-4dh6/refine1/refine.pdb

 refined from 4dh6.pdb

 Subsequent steps will proceed using this model only!

 ****************************************************

 ==================================================
 ****** BUSTER refinement (default protocol) ******
 ==================================================

 Initial:                R = 0.2706,     Rfree = 0.2961
 After 1st refinement:   R = 0.2740,     Rfree = 0.3055
 Final:                  R = 0.2210,     Rfree = 0.2569


 For further details please see refine.out
 Output files:
              refine/refine.pdb
              refine/refine.mtz

 ==================================================
 *********** Ligand Fitting with Rhofit ***********
 ==================================================

 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 | Running rhofit with ligand *1R6.grade_PDB_ligand* |
 +++++++++++++++++++++++++++++++++++++++++++++++++++++

 For output and further details please see rhofit-1R6.grade_PDB_ligand/

                             rhofit           ligand LigProt  Poorly
                              total   Correl  strain contact fitting
  File               Chain    score   coeff    score   score   atoms
  ===================================================================

   Hit_00_00_000.pdb   A    -2260.9   0.8363    28.3     0.0    0/41

 BUSTER post-refinement
 ======================

 Initial:        R = 0.2524,     Rfree = 0.2746
 Final:          R = 0.1949,     Rfree = 0.2330

 For further details please see postrefine-1R6.grade_PDB_ligand.out
 Output files:
              postrefine-1R6.grade_PDB_ligand/refine.pdb
              postrefine-1R6.grade_PDB_ligand/refine.mtz

 buster-report output:
              report-1R6.grade_PDB_ligand/index.html

 =======================================

 Run took 01:47:02 h:m:s to complete

5. How to cite use of Pipedream

Sharff A, Keller P, Vonrhein C, Smart O, Womack T, Flensburg C, Paciorek W, Tickle I, Fogh R, Wojdyr M and Bricogne G (2023). Pipedream, version 1.4.1, Global Phasing Ltd, Cambridge, United Kingdom.

autoPROC:

Vonrhein C, Flensburg C, Keller P, Sharff A, Smart O, Paciorek W, Womack T and Bricogne G. "Data processing and analysis with the autoPROC toolbox". Acta Cryst. (2011). D67, 293-303.

BUSTER:

Bricogne G, Blanc E, Brandl M, Flensburg C, Keller P, Paciorek W, Roversi P, Sharff A, Smart O, Vonrhein C, Womack T. (2011). BUSTER version X.Y.Z. Global Phasing Ltd, Cambridge, United Kingdom.

Rhofit:

Womack T, Smart O, Sharff A, Flensburg C, Keller P, Paciorek W, Vonrhein C and Bricogne G. (2011). Rhofit, version X.Y.Z. Global Phasing Ltd, Cambridge, United Kingdom.

PDB-REDO:

Joosten RP, Joosten K, Cohen SX, Vriend G, and Perrakis A. (2011). Automatic rebuilding and optimization of crystallographic structures in the Protein Data Bank. Bioinformatics. 27. 3392-3398.

XDS:

Kabsch W. "XDS". Acta Cryst. (2010). D66, 125-132.

CCP4:

Collaborative Computational Project, Number 4. "The CCP4 Suite: Programs for Protein Crystallography". Acta Cryst. (1994). D50, 760-763.

6. Appendix A: Non-isomorphism and Limited MR

A certain degree of non-isomorphism is expected and allowed for in Pipedream.

Pipedream assesses non-isomorphism in terms of the relative difference in the cell parameters (cell angle changes being referred to 1 radian) between the reference structure and the experimental data, using the following formula:

The relative difference in cell parameters is defined as $|\frac{\Delta a}{a_{exp}}| + |\frac{\Delta b}{b_{exp}}| + |\frac{\Delta c}{c_{exp}}| + |\frac{\Delta \alpha}{57.296}| + |\frac{\Delta \beta}{57.296}| + |\frac{\Delta \gamma}{57.296}|$

The larger the relative cell parameter difference, the more one might expect to have to reorient the reference structure to best match the experimental data. The limited MR procedure is configured to set the maximum angular range for the rotation function to ±5.0^o. This limit has been approximately matched to the amount of reorientation that might be seen with a relative cell dimension difference of up to 0.25. With a difference in relative cell dimensions > 0.25, there is a possibility that the limited MR procedure will not be able to move the input model sufficiently and thus may fail.

If the relative difference in cell parameters exceeds 0.25, Pipedream will print a warning message in the summary.out file indicating that the limited MR procedure (and thus all subsequent steps) MAY be compromised/fail due to the degree of non-isomorphism.

If this is the case, Pipedream can be re-run with the -bigrotrange flag. This will double the angular range for the rotation function to ±10.0^o, allowing for more extensive reorientation. However, further failure would indicate more extensive problems that are beyond the scope of Pipedream’s limited MR approach.

7. Appendix B: Refinement Protocols

The pre-Rhofit refinement protocols are defined as follows:

7.1. Version 1

default: run 2 rounds of BUSTER refinement

`Run_1:`	`-RB 9.0,4.0 -autoncs_noprune`	Rigid Body refinement (1st big cycle) - resolution restricted to range 9.0 - 4.0A. autoncs with no pruning.
`Run_2:`	`-M TLSbasic -autoncs "LigandPresent="yes" WaterMinDistance=2.3 NoUpdateWatersCycles=2`	Turn on TLS refinement and autoncs. Turn on "ligand-chasing" algorithm to elicit unmodelled density at the end of big cycle 3.

thorough: run 3 rounds of BUSTER refinement

`Run_1:`	`-RB 9.0,4.0 -autoncs_noprune`	Same as default protocol above.
`Run_2:`	`-M TLSbasic -WAT 3 -autoncs`	Turn on TLS refinement and autoncs. Turn on water addition after big cycle 3.
`Run_3:`	`-TLS -L -autoncs`	Continue TLS refinement and autoncs. Turn on "ligand-chasing" algorithm to elicit unmodelled density.

quick: performs a single BUSTER refinement

Run_1:

-RB 9.0,4.0 -autoncs LigandPresent="yes" WaterMinDistance=2.3 NoUpdateWatersCycles=1

Rigid Body refinement (1st big cycle) - resolution restricted to range 9.0 - 4.0A. Turn on autoncs. Turn on "ligand-chasing" algorithm to elicit unmodelled density at the end of big cycle 2.

7.2. Version 2

default: run 2 rounds of BUSTER refinement

`Run_1:`	`-RB 9.0,4.0 -autoncs_noprune -nsmall 500`	Rigid Body refinement (1st big cycle) - resolution restricted to range 9.0 - 4.0A. autoncs with no pruning. up to 500 small cycles per big cycle
`Run_2:`	`-M TLSbasic -M WaterUpdate2ndShell -WAT 3 -autoncs -nsmall 500`	Turn on TLS refinement and autoncs. Add 1st and 2nd shell waters from the end of big cycle 3.

thorough: run 3 rounds of BUSTER refinement

`Run_1:`	`-RB 9.0,4.0 -autoncs_noprune -nsmall 500`	Same as default protocol above.
`Run_2:`	`-M TLSbasic -WAT 3 -autoncs -nsmall 500`	Turn on TLS refinement and autoncs. Add 1st shell waters after big cycle 3.
`Run_3:`	`-TLS -autoncs -M WaterUpdate2ndShell -nsmall 500`	Continue TLS refinement and autoncs. Add 2nd shell waters.

quick: performs a single BUSTER refinement

Run_1:

-RB 9.0,4.0 -autoncs -nsmall 500 -M WaterUpdate2ndShell -WAT 3

Rigid Body refinement (1st big cycle) - resolution restricted to range 9.0 - 4.0A. Turn on autoncs. Add 1st and 2nd shell waters from the end of big cycle 3.

A further run of refinement is performed at the end of each of the above, as follows:

Run_L:

-autoncs -nsmall 500 -TLS -L FftMapMaxHighResLimit=1.9

Carry on TLS and autoncs refinement . Turn on "ligand-chasing" algorithm to elicit unmodelled density. Restrict calculation of map coefficients to 1.9Å

The subsequent run of Rhofit will use the output model from the final run of BUSTER, together with the output mtz file from this "extra" BUSTER run. The rationale behind this is to separate "genuine" waters added to the model from waters added by the ligand-chasing (-L) algorithm.

8. Appendix C: Pipedream Data Input Plugin

The plugin mechanism has been implemented to allow the user to query a database(s) / other source(s) to automatically provide Pipedream with both the identity and location of any or all of the various data sources (x-ray images or pre-processed mtz file, reference mtz file, input model(s), ligand restraint dictionary/dictionaries) required.

8.1. Use of plugin

In order to access the plugin functionality, you will need to provide a script/binary, which should run as:

pluginscript "<identifier>"

where <identifier> is a string (or strings), possibly one or more unique database identifiers, that the script will interpret and act upon.

The internals of this script/binary (what it does based on the input) is entirely up to the user to decide, however, the required output from this script/binary are the identity and location of the required Pipedream input data, provided in JSON format, as the ONLY information output by the script/binary to standard output.

The file permissions on this script/binary must be set to ensure that it is executable.

In order to configure Pipedream to to see and run this executable, the environment variable BDG_TOOL_PIPEDREAM_PLUGIN must be defined to point to the script / binary.

For example:

setenv BDG_TOOL_PIPEDREAM_PLUGIN /software/local/bin/pluginscript

or

export BDG_TOOL_PIPEDREAM_PLUGIN=/software/local/bin/pluginscript

We would strongly recommend adding these to the \$BDG_home/setup_local.csh and \$BDG_home/setup_local.sh files respectively.

Pipedream can be run to invoke this mechanism with the -plugin command line option, e.g.

pipedream -plugin “<identifier>”

Please note that the argument to the -plugin option should be specified inside double quotes.

The actual command that Pipedream will execute is

\$BDG_TOOL_PIPEDREAM_PLUGIN <identifier>

The location and identity of any one or indeed all of the required data inputs to Pipedream can be provided in this manner.

Any of the required inputs that are not returned through this mechanism can (and must) be specified as usual on the Pipedream command line, for example:

pipedream -plugin 12345 -xyzin input.pdb -rhofit ligand.cif

Please note that command line options take precedence over information returned by the plugin mechanism, irrespective of order on the command line and thus can be used to override any information returned by the plugin. For instance if Pipedream is run as above and the plugin returns the input model, the use of the -xyzin flag will tell Pipedream to use input.pdb in place of any input model returned by the plugin.

8.2. Required output from the plugin script/binary

The required structure/format of the JSON output from the plugin script/binary is as follows:

{
 "PipedreamInput": {
  "PipedreamExperimentalData": {
    "<INPUTDATATYPE>": "<INPUTDATASPEC>"
  },
  "PipedreamModelData": {
    "PipedreamInputPDB": "/data/input/input.pdb",
    "PipedreamReferenceMTZ": "/data/input/reference.mtz",
    "PipedreamInputRestraints": "/data/input/cofactor.cif",
    "PipedreamRhofitRestraints": "/data/input/ligand.cif"
  }
 }
}

The output contains a number of defined, nested JSON objects.

The primary object should be “PipedreamInput”. Nested below this, there are two secondary objects, “PipedreamExperimentalData”, which contains information pertaining to the experimental data (raw x-ray images or pre-processed mtz file) required, and “PipedreamModelData”, which contains information pertaining to the input model(s) and restraint dictionaries required.

If populated, the object “PipedreamExperimentalData” should contain a single name / value pair, of the form “INPUTDATATYPE : INPUTDATASPEC”, which must follow one of the following patterns:

a) “PipedreamImagedir” : “<INPUTDATASPEC>”

where INPUTDATASPEC shows the full path to the directory containing the raw x-ray images (the equivalent of the -imagedir option in Pipedream). For example

“PipedreamImagedir” : “/data/input/lyso-123”

b) “PipedreamImageScan” : “<INPUTDATASPEC>”

where INPUTDATASPEC shows a full autoPROC scan definition to define the location and specific image scan/ranges (the equivalent of the -imagescan option in Pipedream). For example

“PipedreamImageScan” : “lyso-123,/data/input/lyso-123,lyso-123_1_###.img,1,180”

c) “PipedreamH5Master” : “<INPUTDATASPEC>”

where INPUTDATASPEC shows the full path to a master input file for Eiger H5 data (the equivalent of the -h5master option in Pipedream). For example

“PipedreamH5Master” : “/data/input/lyso-123.master”

d) “PipedreamHklin” : “<INPUTDATASPEC>”

where INPUTDATASPEC shows the full path to and name of a pre-processed mtz file (the equivalent of the -hklin option in Pipedream). For example

“PipedreamHklin”: “/data/input/lyso-123.mtz”

ONLY ONE of the above name / value pairs may be defined, otherwise Pipedream will terminate.

If populated, the “PipedreamModelData” object may contain any combination of the “PipedreamInputPDB”, “PipedreamReferenceMTZ”, "PipedreamInputRestraints" and “PipedreamRhofitRestraints” name / value pairs. These are the equivalent of the -xyzin, -hklref, -l and -rhofit options in Pipedream.

9. Appendix D: Pipedream Data Harvesting

Pipedream has been specifically designed to make it easy to be run from automated pipelines. As well as being able to automate setting up and running Pipedream, it is also important to be able to harvest the results generated by Pipedream.

The summary.out file is meant to be a human-readable summary of the outcome of a Pipedream run. Although it does contain all of the significant statistics from each step of the process to indicate how the job has progressed, we would caution against using it for automated data harvesting. It has not been specifically formatted to allow easy data capture and although we do try to ensure a degree of stability in the format of summary.out, we cannot guarantee that updates and improvements to Pipedream will not necessitate changes to summary.out that may well negatively impact on any data harvesting scripts.

For this reason, Pipedream also harvests all of the significant data and outputs them in a far more robust, machine-readable format, namely XML. The summary.xml output file has been specifically written to allow for automated data harvesting. Although future updates may result in additional information being added to the summary.xml file, they should have no impact on any script written to harvest data from previous revisions. For this reason, we would strongly recommend use of the summary.xml file for data harvesting purposes.

10. Appendix E: Revision History

1.4.0

Released May 2023
Add "ALL" argument to -chains command
Entire model will be hydrogenated after Rhofit if resolution is better than 2.0A
Add -hydrogenation_res option to control the resolution cutoff for full hydrogenation
Add -hydrogenation_zeroocc to add hydrogens at zero occupancy (rather than full) if the whole model is hydrogenated.

1.3.1

Released July 2021
autoPROC summary.tar.gz accepted as valid input to -autoprocdir option
more robust treatment of multi-sweep datasets with small differences in wavelength between sweeps
updaterd json output

1.3.0

Released February 2021
Include Perl JSON module PP in distribution
Remove Perl JSON module Tiny from distribution
Included new version 2 refinement protocols
enhanced reporting of waters and ligand sites in summary.out

1.2.5

Released September 2020
Included Perl XML module TreePP in distribution
TreePP is written by Yusuke Kawasaki (http://www.kawa.net/)
TreePP Repository = https://github.com/kawanet/XML-TreePP

1.2.1

Released 14th May 2018
Introduction of plugin mechanism for data input

1.2.0

Released 27th November 2017
First incorporation of use of PDB_REDO programs

1.1.2

Released 8th May 2017
More comprehensive Staraniso output use
ability to input Eiger .h5 files
Multiple minor fixes / improvements

1.1.1

Released 24th February 2016
First release to allow use of Staraniso output data

1.1.0

Released in snapshot 16th March 2015
First release of multiple model input functionality
Resolution range for rigid body refinement limited to 4.0Å

1.0.0

Initial general release of Pipedream (released 4th April 2014)

0.1.4

Released 17th November 2012.
adaptation to allow use with "large" structures
added phaser "refine" step to limited MR.
added checkdeps functionality.

0.1.3

Released 31st October 2012.
added imagescan option for autoPROC.
added option to input scalepack or d*TREK reflection files.
reference mtz file now optional.
integrated buster-report into Pipedream
included stand-alone limited MR script, lmr.

0.1.2

Released 23rd October 2011.
Limited MR modified to allow individual chains/groups of chains to be moved independently.
Added option of "Brute Force" Translation function.
make use of openmp in phaser.
added short post-refinement step on "top" solution from Rhofit.

0.1.1

Initial consortium release of Pipedream (released 9th August 2011)