autoSHARP User Manual previous
Chapter 2

Interpreting the autoSHARP output

Copyright    © 2001-2006 by Global Phasing Limited
 
  All rights reserved.
 
  This software is proprietary to and embodies the confidential technology of Global Phasing Limited (GPhL). Possession, use, duplication or dissemination of the software is authorised only pursuant to a valid written licence from GPhL.
 
Documentation    (2001-2006)  Clemens Vonrhein
 
Contact sharp-develop@GlobalPhasing.com


This manual will give some additional information about the output autoSHARP is presenting. Remember that you might need to press the "Reload" button (or "Shift+Reload" or "ALT-R" or ...) sometimes to get a fresh version of some of the HTML documents (while autoSHARP is still running).


Contents


Introduction

All information about the various steps autoSHARP performs is available from the main log file: sharpfiles/logfiles/<project>.<id>/LISTautoSHARP.html. Each step provides a details link to another LISTautoSHARP.html file. From there, the details link will give you access to the full log file of each program that was run.

Whenever there is an explanations link it will lead you directly to the relevant section of the manual.

Remember that this log file is generated while the job is running - so you need to reload the document from time to time (using the reload button of your browser).

Depending on the Speed/Accuracy setting, some of the following steps might be skipped or significantly shorter.


Checking supplied information

This does more or less only a check on the information you supplied through the "autoSHARP Control Panel" to see if it's syntactically correct. If anything goes wrong here (and the program stops with an error message) you can probably correct it based on the error message within the interface.

General

In this section, mainly information about sequence, number of residues and molecular weight are checked. These are all assumed to be for the asymmetric unit only. Make sure that all information presented is consistent. The easiest is to supply a sequence file.

autoSHARP tries to figure out how many molecules within the asymmetric unit you might have. At this stage this can best be done from the sequence file. So if this sequence file contains several copies of the same amino acid sequence, autoSHARP will automatically determine how many molecules there are. From that, the number of residues and molecular weight of each monomer is determined.

If only a single copy of sequence is given, autoSHARP will try to get the most probable number of molecules within the asymmetric unit by calculating the theoretical solvent content for each possibility. The number of molecules that give a solvent content wit ht he highest probability is used. If no information whatsoever about the content of the asymmetric unit is given, a single protein molecule based on a solvent content of 0.5 is assumed.

A simple check on the CCP4 version is done (at the time of writing, the latest version 6.0.1 is recommended).

Each dataset

For each datasets the information about wavelength, f' and f'', a file of known heavy atom positions, the number and type of atoms to use is checked. If no f' and/or f'' values are supplied they will be calculated - which should be avoided (or at least thoroughly checked by you) if the data was collected close to the edge of the heavy atom!

Summary

A short summary is given to confirm that everything is properly defined (or autoSHARP was able to extract the necessary information). A list of used files is given. The program will stop on errors.


Converting files

If you started from reflexion files that aren't already in MTZ format, autoSHARP will try to convert these. As with all file format conversions there is an unlimited number of things that can go wrong (although autoSHARP tries to avoid most of them). Make sure to check things like resolution limits, cell dimensions, space group and number of reflections to see if these values are the ones you expect.

Anomalous differences are analysed to detect suspiciously large values (which happens in cases where the scaling/merging program failed to reject one of the anomalous pair).


Extracting additional information

Based on the data that autoSHARP now has, all kind of information can be extracted from the data. These include

  1. resolution limits

    Especially when files were converted from non-MTZ format it is a good idea to check if the limits shown are the ones you expected. Problems in file conversion quite often show up here. It is also a good idea to only give data up to a reasonable resolution to autoSHARP: the usual I/sig(I) and completeness criteria should be used.

  2. Guessing molecular weight

    If neither molecular weight nor number of residues nor a sequence file is given, a solvent content of 50% and standard protein density is assumed. This will give a very rough estimate of molecular weight (and number of residues) in the asymmetric unit.

  3. Wilson plot based overall B-factor

    If data to better than 3 Å is available, a Wilson plot is used to estimate scale and overall temperature factor. Especially at lower resolutions this can be quite a bit off.

    The overall temperature factor is used during heavy atom refinement as a starting point for the heavy atom B-factor. If it is not possible to calculate it here (due to too low resolution) it defaults to 30.0 Å2.

  4. theoretical solvent content

    Using the number of residues in the asymmetric unit (with average number of atoms and some ordered water molecules), a theoretical solvent content can be calculated. This will differ slightly from the solvent content as calculated by the Matthews coefficient.

  5. statistics on F and F/sigma(F)

    Some information about average values for amplitudes and standard deviation is presented. If the average F/sd(F) is rather low or very high a warning is printed (this could be because of problems with proper estimation of standard deviations, e.g. in SCALA or SCALEPACK. But if your crystals diffracted much better than the resolution limit used a high <F/sd(F)> should not be surprising).

  6. Consistency

    If data from several files is used, the space groups and cell parameters are checked to be consistent. Cell parameters are considered identical if the differences are below 1%. In MAD cases, the cell of the first wavelength is used; in SIR(AS)/MIR(AS) it is the cell of the native dataset.

    See also discussion of cell parameter in SHARP manual.

  7. Matthews coefficient

    Using all available information, a Matthews coefficient is calculated for a range of possible number of molecules. The solvent content based on these values will differ slightly from the one above (which takes ordered waters into account).

  8. SAD/MAD-specific analysis

    If we're doing a MAD/SAD run with Se or S heavy atoms and a sequence is given, we can check the number of sites you specified against this sequence and the most likely number of monomers in the asymmetric unit (determined using Matthews coefficient). This is a very valuable check on consistency in these special cases. However, e.g. in Se-Met MAD experiments an N-terminal Met residue is quite often very disordered. So don't be surprised if not all possible Se-Met atoms are found. Furthermore, in S-SAD phasing it is possible that the crystallisation buffer contains sulfates - and some of them might be bound to the protein and picked up during heavy atom detection.

  9. Overall values

    Minimum, maximum and overall resolution limits for all datasets is calculated. Additionally, the common resolution range between the various datasets is shown.

Summary

A short summary is given to confirm that everything is properly defined (or autoSHARP was able to extract the necessary information). The program will stop on errors.


Collecting and analysing all data

All necessary data from the various data files is collected into a single MTZ file. Anomalous differences (if present) are analysed.

Analysing anomalous differences

Newer versions of MTZ files should have the appropriate "missing number flags" set for unmeasured data. To make sure, that the resulting MTZ file will conform to this standard, checks on standard deviation(s) for amplitude (SMID) and (if present) anomalous differences (SANO) are done. Furthermore, a simple analysis on the absolute value of anomalous differences is performed: it is very unlikely, that an anomalous difference should be nearly twice as large as the mean amplitude for this reflection. Earlier versions of SCALEPACK seem to have had some problems in cases where one of the Friedel pairs wasn't collected properly (which should have been rejected during merging but wasn't). All suspiciously large anomalous differences are flagged and highly suspicious one rejected.

Collecting data

The relevant columns for a given dataset are extracted, renamed and appended to the main MTZ file for this project. Please note, that each autoSHARP run will again produce a new MTZ file in your sharpfiles/datafiles directory. To avoid overwriting older MTZ files, a prefix might be added (a running index, starting at 1).


Adding test set column

A FreeR flag column is added to the MTZ file. The fraction to use is calculated so that between 5 and 10% of the reflections constitute a test set (with ideally 1000 reflections per set).

Extend test set(s)

After extending the test set(s), the completeness for each data set (amplitude and - if present - anomalous differences) is presented. These are calculated only in the resolution range available for a specific data set. You might want to check this against your statistics from data processing to make sure nothing got lost during file conversions etc.

autoSHARP will give you a warning if the completeness is exceptional low. You might want to reconsider if you really want to include the data at this stage of your structure determination. Obviously you want to collect a complete dataset. But even when restrictions (beam time, cell parameters, space group, crystal decay, ...) don't allow a highly complete and redundant dataset it should (in general) be possible to collect a reasonable complete dataset. Here, autoSHARP considers completeness below ~ 70% to be suspicious.


Scaling/analysis of merged data

All datasets are scaled relative to each other (if requested) and several quantities are used for analysis. Analysis is always done using SCALEIT. The second run will always exclude large outliers.

Checking for outliers

As an additional test to find problematic reflections before actually trying to scale or analyse the data, we look at the normalised structure factors (E values) for a given reflection. If we have several datasets we can assume, that the E values shouldn't differ too much.

Furthermore, the R-factors at low resolution between the various datasets is analysed to spot potential problems in the low resolution range.

Scaling against itself

If the first dataset (ie native in SIR(AS)/MIR(AS) or first wavelength in SAD/MAD) contains anomalous data, it is "scaled" against itself. This is done only to get a reliable outlier rejection criteria for anomalous differences.

Scaling against a reference

All data sets are scaled/analysed against all the others. Any results from scaling to a non-reference data set is used only in analysis, ie the compilation of the various cross-tables at the end. Only the scaling results against the first dataset (native or first wavelength) are actually used.

If you specified, that all datasets are already scaled (on the first page of the "autoSHARP Control Panel"), you should expect scale factors very close to 1.0. If these differ considerably, you should check your scaling procedure and consult the details for more information.

The gradient information given is for a normal probability analysis (Howell & Smith, 1992). Several different R-factors are presented. For a discussion of these values see the SCALEIT documentation.

FHSCAL run

If scaling was requested for SIR(AS) or MIR(AS) data, the program FHSCAL is used for scaling. This should take a certain degree of heavy-atom substitution into account. Analysis is still done using SCALEIT.

Calculate correlation coefficients on anomalous differences

In case of a MAD experiment, the correlation coefficient between anomalous differences of different wavelengths can be used to asses the overall quality of the data. Furthermore, it might help to decide on a sensible resolution cut-off for heavy atom detection. (This analysis is based on ideas from the XPREP program by G. Sheldrick)

Statistics between all dataset

All relevant R-factors for each pair of datasets are calculated in the common resolution range and presented as a table. If you use several derivatives in MIR(AS) the cross-table of R-factors can give you some information about clusters of isomorphous datasets. It can also help to establish which datasets scale well together and which might be better to leave out at the beginning of a structure solution.

In case of a MAD experiment, the correlations between the various anomalous differences for each wavelength are used to judge quality and a reasonable high resolution cut-off for the heavy atom detection algorithm.


Additional analysis (NCS, sequence ...)

Since even more information (about data quality and possible number of molecules) is available, the self-rotation and native Patterson functions are calculated (and analysed) to get some initial indications about the most likely number of monomers in the asymmetric unit.


Get optimised values of anomalous scattering (RANTAN only)

For heavy-atom detection of a MAD experiment using the program RANTAN, a FM (or FA) value is calculated. For this to work well, the given f'/f'' values should be as precise as possible! All possible outliers are rejected before calculating these values.


Calculate E values (RANTAN only)

For heavy-atom detection using the program RANTAN, all possible sources of substructure information (isomorphous/dispersive and anomalous differences) are used to calculate E values (normalised structure factors) from. The test and working set are extracted into separate files. These will be used to find the heavy atom substructure.


Finding sites with RANTAN

The direct methods program RANTAN is used to detect the heavy atom substructure. For that it uses all available sources of phase information (isomorphous/dispersive and anomalous differences) as well as the FA values in the case of MAD. For each of these values it generates several phase sets that are subsequently analysed to find the best set of initial heavy atoms.

General

For the specific space group, some information about indeterminate axes and possible origin translations is given. Especially in MIR(AS) experiments, this has the effect, that possible solutions for the various derivatives might not be on the same origin.

Processing each dataset

For each source of phase information in each data set, the Harker section(s) of a Patterson map are printed. Then the corresponding E values (normalised structure factors) are used to generate 3 phase sets. Each phase set represents a possible substructure solution and is used to extract and analyse this possible solution.

Phase sets

The direct methods program RANTAN (Yao, 1981) uses normalised structure factors to generate several phase sets with possible solutions to the substructure. It seems to work well with up to 20 atoms in the substructure. However, in some space groups (low symmetry like monoclinic for example) it seems to have problems. One thing to look at in the details is the number of reflections used for calculating PSI(zero): if only very few are used the resulting statistics are unreliable and the phase sets picked might not be the best ones.

Weeding

Because of the problems described above, some "weeding" has to be done, to get rid of sites that show one or more of the following characteristics:

Hopefully, the highest sites found directly from the phase set should remain in the list. However, if a lot of the highest sites are removed this solution is probably not very reliable.

Statistics

Based on the (remaining) list of possible heavy atom sites, an increasing subset of these positions is used to calculate correlation coefficients between observed and calculated E values (normalised structure factors). Starting with all positions with peak heights above 6 sigma, the list of positions is increased by reducing this sigma cut-off in steps of one down to a final value of 3 sigma. At each step, several criteria are used to determine if inclusion of additional positions is likely to increase the correlation coefficients.

As a rule of thumb:

Summary

In MIR(AS) experiments, the best possible substructure solutions are presented for each derivative. In SIR(AS), SAD or MAD experiments, the overall best solution is shown.


Finding sites with SHELXC/SHELXD

Sorting available dataset

The available datasets are sorted so that the scenario which is most likely to give a correct set of sites is used first. E.g. in a 3-wavelength MAD experiment, this results in the order

  1. MAD with all wavelengths - should have cleanest heavy-atom substructure signal
  2. SAD (peak wavelength) - since it has largest anomalous signal
  3. MAD with first two wavelengths - in case of radiation damage

Trying all (sorted) possibilities

Since SHELXC requires SCALEPACK formatted files, it might be necessary to convert the MTZ files into a "pseudo-SCALEPACK" format. If the initial data was already in the correct format, this is obviously not necessary.

For each of the scenarios determined above, SHELXC is run to create a reflection input file for SHELXD. The data analysis from this program is also used to determine an adequate resolution cut-off (to use only data with significant signal).

For the SHELXD run, the resolution cuts as well as the limits on E-values are adjusted to get a large enough number of reflections (but at the same time avoiding too many reflections that would slow things down). Therefore, SHELXD might be stopped and restarted with slightly different parameters.

Whenever a solution with a higher CC(all) is found, some information and plots are presented. If a good solution has been found (depending on the Speed/Accuracy setting you used), SHELXD is stopped and the heavy atom solution is analysed. For heavy-atom soaks, there might not be a clear "jump" in occupancy between the last correct and the first wrong solution.

Analysis of results

The finally used solution is reported. This is the set of sites that will go into the initial SHARP refinement.


Create a SIN file

Based on the initial set of heavy atom positions an input file for SHARP is created. This will have sensible defaults for refinement strategy, heavy atom occupancy and temperature factor. You can use this SIN file outside of autoSHARP by loading it into the "SHARP Input Editor" to modify it and run your own SHARP. This is a way to quickly get started if you're not sure about the hierarchy used in SHARP.


Run first round of SHARP

The first run of SHARP will refine the initial set of heavy atom positions and calculate residual maps. It is run in a separate directory from the autoSHARP job.


Automatic interpretation of residual maps

The residual maps of the first SHARP run are analysed to see if these initial sites survived refinement and if additional sites are present in the residual maps. The decisions for adding additional sites are made quite conservatively - so it is always a good idea to check the various residual maps by hand.

Checking existing sites

The set of initial sites is checked to make sure, that for each dataset these sites are present in the corresponding model map. If for any dataset, this heavy atom model map doesn't show enough density for a given site it is excluded. Dispersive differences in MAD are ignored and only the more reliable anomalous differences are used. Analysis of residual map for wrong f'/f'' If a significant number of sites have residual density (in the anomalous residual map for that wavelength) exactly at the heavy atom position, the refinement of f' and/or f'' is switched on.

Checking residual maps for new sites

All (reliable) residual maps are searched for possible new sites. Only sites present in all of these residual maps at the same time are considered. The current list of heavy atom sites is updated for each dataset up to the user specified total number of heavy atoms expected (see input preparation manual).


Run final round of SHARP

If new sites where added by the previous step, another round of refinement is done (in a new sub-directory). Otherwise the first SHARP run is used. In any case, this will lead to the first set of phases (original hand).

Since the handedness is not yet determined, another (phasing only) run of SHARP will produce a second set of phases (inverted hand). This is again done in a new sub-directory to your sharpfiles/logfiles directory.

This way you will have either 2 (no change after automatic interpretation of residual map) or at least 3 (addition/deletion of sites during automatic interpretation of residual map) different SHARP runs with the last two producing phases in the original and inverted hand.


Running density modification

To find out which of the two phase sets has the correct hand, a single cycle of solvent flipping is performed. The one with the higher score (usually) is the correct hand. However, if your experimental phases are very poor (or wrong) the relative difference between these two scores might not be above noise and the wrong hand is picked after all.

Automatic density modification

Once a decision on the correct hand is done, these phases will be used in an iterative density modification cycle, where the solvent content is varied to optimise the overall score. This usually will lead to a good initial map. However, once you are sure that these phases are correct it is a good idea to try to fine-tune the parameters to get the best map possible for building your initial model.

At the end a button will give access to the "Phase Improvement and Interpretation Control Panel" where you can view the solvent flattened map.


Running automatic building

If the resolution of your data is sufficiently high and the density modification gave a good enough map (judged on simple statistics) automated model building with ARP/wARP is performed. If you supplied a sequence file side chain docking is attempted. This can easily lead to a nearly complete model. However, you might want to improve on that by altering the parameters used by the ARP/wARP protocol.

At the end a button will give access to the "Phase Improvement and Interpretation Control Panel" where you can view the solvent flattened map as well as the final ARP/wARP map and model..


Looking at results

Depending on the interface you are using to run autoSHARP (through our default interface, the CCP4i interface we provide or from the command line), results will be presented in a variety of ways:
Last modification: 21.10.11