autoSHARP User Manual | previous |
Chapter 2 |
Copyright | © 2001-2006 by Global Phasing Limited |
All rights reserved. | |
This software is proprietary to and embodies the confidential technology of Global Phasing Limited (GPhL). Possession, use, duplication or dissemination of the software is authorised only pursuant to a valid written licence from GPhL. | |
Documentation | (2001-2006) Clemens Vonrhein |
Contact | sharp-develop@GlobalPhasing.com |
This manual will give some additional information about the output autoSHARP is presenting. Remember that you might need to press the "Reload" button (or "Shift+Reload" or "ALT-R" or ...) sometimes to get a fresh version of some of the HTML documents (while autoSHARP is still running).
Whenever there is an explanations link it will lead you directly to the relevant section of the manual.
Remember that this log file is generated while the job is running - so you need to reload the document from time to time (using the reload button of your browser).
Depending on the Speed/Accuracy setting, some of the following steps might be skipped or significantly shorter.
In this section, mainly information about sequence, number of residues and molecular weight are checked. These are all assumed to be for the asymmetric unit only. Make sure that all information presented is consistent. The easiest is to supply a sequence file.
autoSHARP tries to figure out how many molecules within the asymmetric unit you might have. At this stage this can best be done from the sequence file. So if this sequence file contains several copies of the same amino acid sequence, autoSHARP will automatically determine how many molecules there are. From that, the number of residues and molecular weight of each monomer is determined.
If only a single copy of sequence is given, autoSHARP will try to get the most probable number of molecules within the asymmetric unit by calculating the theoretical solvent content for each possibility. The number of molecules that give a solvent content wit ht he highest probability is used. If no information whatsoever about the content of the asymmetric unit is given, a single protein molecule based on a solvent content of 0.5 is assumed.
A simple check on the CCP4 version is done (at the time of writing, the latest version 6.0.1 is recommended).
For each datasets the information about wavelength, f' and f'', a file of known heavy atom positions, the number and type of atoms to use is checked. If no f' and/or f'' values are supplied they will be calculated - which should be avoided (or at least thoroughly checked by you) if the data was collected close to the edge of the heavy atom!
A short summary is given to confirm that everything is properly defined (or autoSHARP was able to extract the necessary information). A list of used files is given. The program will stop on errors.
Anomalous differences are analysed to detect suspiciously large values (which happens in cases where the scaling/merging program failed to reject one of the anomalous pair).
Especially when files were converted from non-MTZ format it is a good idea to check if the limits shown are the ones you expected. Problems in file conversion quite often show up here. It is also a good idea to only give data up to a reasonable resolution to autoSHARP: the usual I/sig(I) and completeness criteria should be used.
If neither molecular weight nor number of residues nor a sequence file is given, a solvent content of 50% and standard protein density is assumed. This will give a very rough estimate of molecular weight (and number of residues) in the asymmetric unit.
If data to better than 3 Å is available, a Wilson plot is used to estimate scale and overall temperature factor. Especially at lower resolutions this can be quite a bit off.
The overall temperature factor is used during heavy atom refinement as a starting point for the heavy atom B-factor. If it is not possible to calculate it here (due to too low resolution) it defaults to 30.0 Å2.
Using the number of residues in the asymmetric unit (with average number of atoms and some ordered water molecules), a theoretical solvent content can be calculated. This will differ slightly from the solvent content as calculated by the Matthews coefficient.
Some information about average values for amplitudes and standard deviation is presented. If the average F/sd(F) is rather low or very high a warning is printed (this could be because of problems with proper estimation of standard deviations, e.g. in SCALA or SCALEPACK. But if your crystals diffracted much better than the resolution limit used a high <F/sd(F)> should not be surprising).
If data from several files is used, the space groups and cell parameters are checked to be consistent. Cell parameters are considered identical if the differences are below 1%. In MAD cases, the cell of the first wavelength is used; in SIR(AS)/MIR(AS) it is the cell of the native dataset.
See also discussion of cell parameter in SHARP manual.
Using all available information, a Matthews coefficient is calculated for a range of possible number of molecules. The solvent content based on these values will differ slightly from the one above (which takes ordered waters into account).
If we're doing a MAD/SAD run with Se or S heavy atoms and a sequence is given, we can check the number of sites you specified against this sequence and the most likely number of monomers in the asymmetric unit (determined using Matthews coefficient). This is a very valuable check on consistency in these special cases. However, e.g. in Se-Met MAD experiments an N-terminal Met residue is quite often very disordered. So don't be surprised if not all possible Se-Met atoms are found. Furthermore, in S-SAD phasing it is possible that the crystallisation buffer contains sulfates - and some of them might be bound to the protein and picked up during heavy atom detection.
Minimum, maximum and overall resolution limits for all datasets is calculated. Additionally, the common resolution range between the various datasets is shown.
A short summary is given to confirm that everything is properly defined (or autoSHARP was able to extract the necessary information). The program will stop on errors.
Analysing anomalous differences
Newer versions of MTZ files should have the appropriate "missing number flags" set for unmeasured data. To make sure, that the resulting MTZ file will conform to this standard, checks on standard deviation(s) for amplitude (SMID) and (if present) anomalous differences (SANO) are done. Furthermore, a simple analysis on the absolute value of anomalous differences is performed: it is very unlikely, that an anomalous difference should be nearly twice as large as the mean amplitude for this reflection. Earlier versions of SCALEPACK seem to have had some problems in cases where one of the Friedel pairs wasn't collected properly (which should have been rejected during merging but wasn't). All suspiciously large anomalous differences are flagged and highly suspicious one rejected.
The relevant columns for a given dataset are extracted, renamed and appended to the main MTZ file for this project. Please note, that each autoSHARP run will again produce a new MTZ file in your sharpfiles/datafiles directory. To avoid overwriting older MTZ files, a prefix might be added (a running index, starting at 1).
After extending the test set(s), the completeness for each data set (amplitude and - if present - anomalous differences) is presented. These are calculated only in the resolution range available for a specific data set. You might want to check this against your statistics from data processing to make sure nothing got lost during file conversions etc.
autoSHARP will give you a warning if the completeness is exceptional low. You might want to reconsider if you really want to include the data at this stage of your structure determination. Obviously you want to collect a complete dataset. But even when restrictions (beam time, cell parameters, space group, crystal decay, ...) don't allow a highly complete and redundant dataset it should (in general) be possible to collect a reasonable complete dataset. Here, autoSHARP considers completeness below ~ 70% to be suspicious.
As an additional test to find problematic reflections before actually trying to scale or analyse the data, we look at the normalised structure factors (E values) for a given reflection. If we have several datasets we can assume, that the E values shouldn't differ too much.
Furthermore, the R-factors at low resolution between the various datasets is analysed to spot potential problems in the low resolution range.
If the first dataset (ie native in SIR(AS)/MIR(AS) or first wavelength in SAD/MAD) contains anomalous data, it is "scaled" against itself. This is done only to get a reliable outlier rejection criteria for anomalous differences.
All data sets are scaled/analysed against all the others. Any results from scaling to a non-reference data set is used only in analysis, ie the compilation of the various cross-tables at the end. Only the scaling results against the first dataset (native or first wavelength) are actually used.
If you specified, that all datasets are already scaled (on the first page of the "autoSHARP Control Panel"), you should expect scale factors very close to 1.0. If these differ considerably, you should check your scaling procedure and consult the details for more information.
The gradient information given is for a normal probability analysis (Howell & Smith, 1992). Several different R-factors are presented. For a discussion of these values see the SCALEIT documentation.
If scaling was requested for SIR(AS) or MIR(AS) data, the program FHSCAL is used for scaling. This should take a certain degree of heavy-atom substitution into account. Analysis is still done using SCALEIT.
Calculate correlation coefficients on anomalous differences
In case of a MAD experiment, the correlation coefficient between anomalous differences of different wavelengths can be used to asses the overall quality of the data. Furthermore, it might help to decide on a sensible resolution cut-off for heavy atom detection. (This analysis is based on ideas from the XPREP program by G. Sheldrick)
Statistics between all dataset
All relevant R-factors for each pair of datasets are calculated in the common resolution range and presented as a table. If you use several derivatives in MIR(AS) the cross-table of R-factors can give you some information about clusters of isomorphous datasets. It can also help to establish which datasets scale well together and which might be better to leave out at the beginning of a structure solution.
In case of a MAD experiment, the correlations between the various anomalous differences for each wavelength are used to judge quality and a reasonable high resolution cut-off for the heavy atom detection algorithm.
For the specific space group, some information about indeterminate axes and possible origin translations is given. Especially in MIR(AS) experiments, this has the effect, that possible solutions for the various derivatives might not be on the same origin.
For each source of phase information in each data set, the Harker section(s) of a Patterson map are printed. Then the corresponding E values (normalised structure factors) are used to generate 3 phase sets. Each phase set represents a possible substructure solution and is used to extract and analyse this possible solution.
The direct methods program RANTAN (Yao, 1981) uses normalised structure factors to generate several phase sets with possible solutions to the substructure. It seems to work well with up to 20 atoms in the substructure. However, in some space groups (low symmetry like monoclinic for example) it seems to have problems. One thing to look at in the details is the number of reflections used for calculating PSI(zero): if only very few are used the resulting statistics are unreliable and the phase sets picked might not be the best ones.
Because of the problems described above, some "weeding" has to be done, to get rid of sites that show one or more of the following characteristics:
Hopefully, the highest sites found directly from the phase set should remain in the list. However, if a lot of the highest sites are removed this solution is probably not very reliable.
Based on the (remaining) list of possible heavy atom sites, an increasing subset of these positions is used to calculate correlation coefficients between observed and calculated E values (normalised structure factors). Starting with all positions with peak heights above 6 sigma, the list of positions is increased by reducing this sigma cut-off in steps of one down to a final value of 3 sigma. At each step, several criteria are used to determine if inclusion of additional positions is likely to increase the correlation coefficients.
As a rule of thumb:
In MIR(AS) experiments, the best possible substructure solutions are presented for each derivative. In SIR(AS), SAD or MAD experiments, the overall best solution is shown.
The available datasets are sorted so that the scenario which is most likely to give a correct set of sites is used first. E.g. in a 3-wavelength MAD experiment, this results in the order
Trying all (sorted) possibilities
Since SHELXC requires SCALEPACK formatted files, it might be necessary to convert the MTZ files into a "pseudo-SCALEPACK" format. If the initial data was already in the correct format, this is obviously not necessary.
For each of the scenarios determined above, SHELXC is run to create a reflection input file for SHELXD. The data analysis from this program is also used to determine an adequate resolution cut-off (to use only data with significant signal).
For the SHELXD run, the resolution cuts as well as the limits on E-values are adjusted to get a large enough number of reflections (but at the same time avoiding too many reflections that would slow things down). Therefore, SHELXD might be stopped and restarted with slightly different parameters.
Whenever a solution with a higher CC(all) is found, some information and plots are presented. If a good solution has been found (depending on the Speed/Accuracy setting you used), SHELXD is stopped and the heavy atom solution is analysed. For heavy-atom soaks, there might not be a clear "jump" in occupancy between the last correct and the first wrong solution.
The finally used solution is reported. This is the set of sites that will go into the initial SHARP refinement.
The set of initial sites is checked to make sure, that for each dataset these sites are present in the corresponding model map. If for any dataset, this heavy atom model map doesn't show enough density for a given site it is excluded. Dispersive differences in MAD are ignored and only the more reliable anomalous differences are used. Analysis of residual map for wrong f'/f'' If a significant number of sites have residual density (in the anomalous residual map for that wavelength) exactly at the heavy atom position, the refinement of f' and/or f'' is switched on.
Checking residual maps for new sites
All (reliable) residual maps are searched for possible new sites. Only sites present in all of these residual maps at the same time are considered. The current list of heavy atom sites is updated for each dataset up to the user specified total number of heavy atoms expected (see input preparation manual).
Since the handedness is not yet determined, another (phasing only) run of SHARP will produce a second set of phases (inverted hand). This is again done in a new sub-directory to your sharpfiles/logfiles directory.
This way you will have either 2 (no change after automatic interpretation of residual map) or at least 3 (addition/deletion of sites during automatic interpretation of residual map) different SHARP runs with the last two producing phases in the original and inverted hand.
Automatic density modification
Once a decision on the correct hand is done, these phases will be used in an iterative density modification cycle, where the solvent content is varied to optimise the overall score. This usually will lead to a good initial map. However, once you are sure that these phases are correct it is a good idea to try to fine-tune the parameters to get the best map possible for building your initial model.
At the end a button will give access to the "Phase Improvement and Interpretation Control Panel" where you can view the solvent flattened map.
At the end a button will give access to the "Phase Improvement and Interpretation Control Panel" where you can view the solvent flattened map as well as the final ARP/wARP map and model..