SHARP/autoSHARP User Manual | previous next |
Chapter 4 |
Copyright | © 2001-2006 by Global Phasing Limited |
All rights reserved. | |
This software is proprietary to and embodies the confidential technology of Global Phasing Limited (GPhL). Possession, use, duplication or dissemination of the software is authorised only pursuant to a valid written licence from GPhL. | |
Documentation | (2001-2006) Clemens Vonrhein |
(1995-1998) Eric de La Fortelle | |
Contact | sharp-develop@GlobalPhasing.com |
General information. No refinable parameters here, just the usual crystallographic setup.
List of all possible sites. These are called Geometric Sites. A Geometric Site (or G-Site) is a positional placeholder in the unit cell (defined by its only parameters, the fractional coordinates x, y, z), which can be subsequently referred to in the description of substituents. These "points" provide a general mechanism for specifying common sites shared between several isomorphous components (in MIR) or between several wavelengths (in MAD).
Refinable parameters : x, y, z.
Refinable parameters : none .
Refinable parameters : occupancy , temperature factor (isotropic or anisotropic).
Refinable parameters : none.
Similarly, to accommodate variability caused by crystal decay, the scale parameters and Lack-of-isomorphism (LOI) parameters are refined at this level as well.
Refinable parameters : scale K and temperature scale factor B, global non-isomorphism parameters, model imperfection parameters, scattering factors f' , f".
Please note that not all characters are allowed here: letters, numbers and "-" are ok, but special characters like &, %, @, _, "." etc might cause problems in some of the scripts. You also have to avoid any white-space (" " or tabs).
As an extra protection against 'hidden' outliers (i.e. those that will not show up in the compulsory histograms of intensity, isomorphous/anomalous differences etc. before ML refinement), another round of rejections can be added, based on the value of the likelihood for each reflexion. This will always be turned on for users defined as "standard" level in the Preferences settings. If "advanced" (or higher), we recommend not to turn this off except for very special applications, where you are sure there is no outlier in the dataset.
If the ML Parameter refinement tick box is activated, SHARP will vary all parameters marked for refinement until it reaches a maximum value of the log-likelihood function. The refinement stops when the step in parameter space (in units of standard deviations) is considered sufficiently small.
There is some strategy involved in this procedure. The reason for this is, that parameters are very different in terms of their impact on the refinement. In simple terms, it means that some parameters will have much more influence over the increase of likelihood than others, and therefore should be refined first to smooth out the convergence.
|
|
FOR SHARP 2.0.0 and higher: |
You can specify the starting BIG CYCLE number, the final BIG
CYCLE number as well as the number of small cycles within each
BIG CYCLE.
You can fine tune the refinement strategy for each BIG CYCLE. The defaults are probably correct for nearly all cases. It is always a good idea to start refining the most important parameters first and slowly adding additional parameters during the refinement progress. (see also description of STRATEGY keyword) |
|
An element in the Hessian matrix is discarded if it 'belongs' to a pair of atomic variables derived from two different G-sites with a distance above Sparse_cut (symmetry equivalents are taken into account).
Since this is an approximation, convergence to the optimum set of parameters is slower - but for very large structures with many G-sites, the construction of the sparse Hessian is significantly faster. This can give an overall reduction of compute time necessary to achieve convergence.
Tests indicate that the use of a sparse approximation is useful if there are more than 50 G-sites using a Sparse_cut value of 16 Å. It is recommended to run an additional BIG cycle without the sparse approximation to verify that the approximation doesn't lead to a wrong stationary point in parameter space.
(see also description of SPARSE keyword)
If a G-site is being weeded it will have:
(see also description of WEED keyword)
A further obvious use of these coefficients is to use them in subsequent phase improvement and interpretation steps (see the phase improvement and interpretation user guide). For this, additional items are calculated per reflection (Hendrickson-Lattman coefficients, figure-of-merit etc).
Note : In cases where you used external phase information during this phase calculation step within SHARP you have to be careful: the coefficients and values calculated here will include this external phase information. This combination might only be adequate, if the external phase information is independent of the heavy atom model you are using for phase calculation.
Important note : The cell dimensions that SHARP will use are those of the MTZ file (the program will stop in case it detects an inconsistency). But beware: these cell dimensions should be those of the REFERENCE dataset. The usual convention is to have the cell dimensions of the native dataset as cell dimensions for the global MTZ file. This is fine in the general case where the native is the reference, but if for some reason (correlated non-isomorphism) you want to have another dataset as the reference (first wavelength of COMPOUND-1), then you must accordingly change the cell parameters in the global MTZ file.
Note : Only light atoms (excluding hydrogens) should be specified here. (i.e. specify all non-hydrogen atoms that are not part of the heavy-atom model and therefore invisible to any source of phase information). Alternatively, you can specify the number of amino acid residues in the asymmetric unit and and approximated atomic composition will be calculated - or vice versa. This is a good check if your input values here are reasonable.
Example and syntax :
C 1250 O 5234 N 2200 S 5
If, in addition, this file is group-readable, it will appear in the menu of datafiles you can use. Selecting it will trigger additional control options to appear at the bottom of the page.
Where do these Hendrickson-Lattman coefficients come from ?
If the information comes from an independent heavy-atom phasing experiment, these coefficients are usually output by the phasing programs. In any case, you should re-refine and phase this other derivative using SHARP.
In case you have a known fragment, or an imperfect MR, there is no established way of producing these coefficients. The CCP4 program SIGMAA (Read, 1986) calculates these coefficients, but until recently did not output them. In recent versions, it may be possible to make SIGMAA write them out. If you're interested in a simple program that uses a phase and it's associated weight (e.g. from SIGMAA), you can use the mkhl binary that comes with the distribution.
If you place your known fragment or model as model_*.pdb into the datafiles directory of your SHARP installation, you can use a utility in the Phase Improvement and Interpretation Panel to calculate these coefficients (and produce a correct MTZ file). This might be the easiest way.
Warning : These coefficients may be biased, especially in the case of a Molecular Replacement solution. Do not take them at face value, especially at high resolution. You could try to blur them with a factor that increases exponentially with resolution, or cut off the high resolution. A rational approach to this problem, involving maximum-likelihood refinement of an imperfection parameter based on the Luzzati model (Luzzati, 1952) has been developed by Gérard Bricogne in BUSTER, but it is not yet distributed.
Are there alternative ways of incorporating additional information ?
Yes: in the Phase Improvement and Interpretation Panel you can use an existing (initial) model within the general density improvement procedure.
The other way to create a large number of G-Sites, is to press the Add button after having specified a name of 'coordinate file' in the following menu. A 'coordinate file' is an ASCII file located in the datafiles directory, with extension (.hatom), and containing information that follows the syntax :
ATOM Se 0.0903 0.3885 0.1297The first two fields must be present and separated by spaces, but are not interpreted. Then there are three fractional coordinates x, y and z, in free format, separated by spaces. The rest of the line is ignored.
Inverting the hand of heavy atom positions usually involves inversion through the origin (0,0,0) - only in I41 (1/2,0,0), I4122 (1/2,0,1/4) and F4132 (1/4,1/4,1/4) the centre of inversion is different.
space group | polar axes |
---|---|
P1 | x, y, z |
P2, P21, C2 | y |
P4, P41, P42, P43, I4, I41 | z |
P3, P31, P32, R3, H3 | z |
P6, P61, P62, P63, P64, P65 | z |
This option is only present when the details of all G-sites is visible. You can check how many C-Sites have been assigned so far to a given G-Site in the 'Usage' column of the G-Site table.
This first dataset is also a reference for lack-of-isomorphism (LOI). LOI is in its very nature a relative quantity. A crystal is never 'non-isomorphous' in itself, but relative to another crystal, which is taken as a reference. The reference for non-isomorphism is usually the native crystal, but it does not need to be. A native crystal can be defined as second compound (C-2) and have non-isomorphism defined for it with respect to a derivative defined as first compound (C-1).
In general, the reference dataset should be the one with the highest information content, i.e. most complete and diffracting to the highest resolution. However, it can happen that two derivatives share a similar pattern of non-isomorphism (eg if they have been soaked in similar conditions). In that case it is usually preferable to have one of them as a reference for non-isomorphism to avoid correlation.
To get an idea of related datasets you could calculate e.g. R-factors between all datasets (using the same resolution range). This can be done e.g. with the CCP4 program SCALEIT. The resulting table of R-factors can point you to clusters of very similar datasets. If you encounter a large cluster with significant lower R-factors it is advisable to pick the best dataset out of this cluster as a reference. This kind of analysis is done automatically if you run the autoSHARP procedure - although (for the moment) no automatic choice is made.
Note : In our SHARP convention of a SIR(AS)/MIR(AS) experiment, the native dataset is also considered a compound. In practical terms, SHARP gives no special status to the native, which is then considered a 'derivative without a heavy atom'.
The first dataset belonging to the first COMPOUND is by convention a reference for scale and lack-of-isomorphism. This means, that the scale parameters for the first dataset of the first COMPOUND (the dataset of W-1) should not be refined. However, the scale factor (K) can be marked for estimation, in which case absolute scaling will be performed based on the chemical composition (if any) given on the global page.
This first dataset is also a reference for lack-of-isomorphism (LOI). LOI is in its very nature a relative quantity. A crystal is never 'non-isomorphous' in itself, but relative to another "reference" crystal. The reference for non-isomorphism is usually the native crystal, but it does not need to be. A native crystal can be defined as COMPOUND 2 and have non-isomorphism defined for it with respect to a derivative defined as COMPOUND 1. In a "MAD + native" calculation, one of the MAD wavelengths should always be the reference: otherwise the correlated non-isomorphisms for the MAD datasets relative to a native reference dataset is not treated correctly. (See above on how to find the best reference dataset)
Note : The first letter of the chemical symbol must be uppercase, the second (and others) lowercase for the symbol to be recognised.
Examples : Br, Hg, Au
Note: to specify the C-site as a spherical cluster the Tag field should contain the cluster name and atom type separated by a ":". When adding C-sites from scratch specify the full tag-name:atom-type in the "type" field.
Examples of atom type and spherical cluster name pairs : (Ta and Ta6Br12:Ta), (W and W18:W)
This is just a way of describing the effect of occupancy and temperature factor on the atom (C-site) positioned at this G-site: it is going to be "tuned" up and down if these parameters vary.
The one thing that can't vary from one crystal to another is the number of chemical sites : therefore, you will not be allowed to create/delete sites at this level (in pathological cases, something similar can be done by setting the occupancy to zero and not refining it).
The way to refine temperature factors is defined in the column under the heading B refinement. You have a choice between
The minimal assignments that are required are those for FMID (structure factor amplitude) and SMID (estimated standard deviation on the measurement of FMID).
If there is anomalous diffraction present in the data, then FMID = 0.5 * ( F+ + F- ) and you will need to specify three additional columns. DANO is the anomalous difference ( DANO = F+ - F- ) and SANO (the estimated standard deviation on the measurement of DANO).
ISYM is used when only one member of the anomalous pair is measured. In that case DANO = 0. , and FMID = F+ OR FMID = F- . But you do not know which, so that you are led to making a statistical error by assuming FMID is the average of the two. ISYM provides the necessary information :
Pseudo-absolute scaling will only work if you have indicated the atomic composition of the asymmetric unit in the general info page, and if your data extends to a resolution higher than 3.5Å. (Advanced users have the possibility to even refine the scale factor for the reference dataset, although this is not recommended).
Typical values for parameter NISO_BGLO (global non-isomorphism parameter on isomorphous differences) range from 0 to 3. A value of 3 already indicates quite high lack of isomorphism.
The parameter NANO_BGLO (global non-isomorphism parameter on anomalous differences) can safely have larger values (up to 10 and more): but they usually indicate either a deficiency in the estimation of standard deviations for the measured intensities or inaccuracies in the anomalous scattering model.
The value of NISO_CLOC (model imperfection parameter on isomorphous differences) and NANO_CLOC (model imperfection parameter on anomalous differences) are a percentage of the total diffracted intensity and should therefore remain between 0 and 1. Exceptional high percentages show that a significant amount of signal is not yet parametrised in the heavy atom model and considered "noise"
Especially for a MAD experiment, starting values for these parameters are very important : we recommend using values from fluorescence measurements, or at the very least, starting values from the Sasaki tables (Sasaki, 1989), possibly corrected for white lines and other non-calculable edge features. (These step-wise tables are accessible on-line from the BATCH page in the interface). You can also use the CCP4 program CROSSEC to calculate values at exact wavelengths. If no fluorescence data is available, "expert" values (e.g. from beam-line staff) for common elements like Se might even be better than the calculated/tabulated values.
Warning : Refining f', f" and the occupancy of all sites for a given heavy-atom type will lead to a redundant parametrisation - and probably to a eigenvalues being filtered in the inversion of the Hessian matrix. It is preferable to avoid this by not refining f' in SIR(AS) or MIR(AS). In MAD you should keep f' (and possibly f'' as well) fixed for one of the wavelengths - usually the remote wavelength, where these values are well determined. If your f' and f" values come from a very good fluorescence scan or they belong to a wavelength far away from the edge it might be best to keep both of them fixed until the later stages of the refinement procedure. The anomalous residual maps are a very good indicator to see if a refinement of f' and/or f'' might be necessary.
If you are happy with the parameter values in that file, just press OK to run the calculation, or Save to set up everything and start the calculation later using the 'Restart' facility in the Control Panel. If you see something that needs to be corrected, just press Cancel to close the verification page, and modify what needs to be. Then Submit again.