| SHARP/autoSHARP User Manual | previous next |
| Chapter 4 |
Copyright © 1999-2005 by Clemens Vonrhein
Copyright © 1995-1998 by Eric de La Fortelle
and the Buster Development Group.
and the Buster Development Group.
All rights reserved.
General information. No refinable parameters here, just the usual crystallographic setup.
List of all possible sites. These are called Geometric Sites. A Geometric Site (or G-Site) is a positional placeholder in the unit cell (defined by its only parameters, the fractional coordinates x, y, z), which can be subsequently referred to in the description of substituents. These "points" provide a general mechanism for specifying common sites shared between several isomorphous components (in MIR) or between several wavelengths (in MAD).
Refinable parameters : x, y, z.
Refinable parameters : none .
Refinable parameters : occupancy , temperature factor (isotropic or anisotropic).
Refinable parameters : none.
Similarly, to accommodate variability caused by crystal decay, the scale parameters and Lack-of-isomorphism (LOI) parameters are refined at this level as well.
Refinable parameters : scale K and temperature scale factor B, global non-isomorphism parameters, model imperfection parameters, scattering factors f' , f".
Please note that not all characters are allowed here: letters, numbers and "-" are ok, but special characters like &, %, @, _, "." etc might cause problems in some of the scripts. You also have to avoid any whitespace (" " or tabs).
Title
The title is there to remind you what is specific to the calculation
you are currently undertaking. Keep in mind that the graphical
interface makes it very easy to generate many different jobs. You will
need the title to sort out which is which. For these reasons, the
first few words of the title will appear in the pop-ups alongside with
the project ID.
As an extra protection against 'hidden' outliers (i.e. those that will not show up in the compulsory histograms of intensity, isomorphous/anomalous differences etc. before ML refinement), another round of rejections can be added, based on the value of the likelihood for each reflexion. This will always be turned on for users defined as "standard" level in the Preferences settings. If "advanced" (or higher), we recommend not to turn this off except for very special applications, where you are sure there is no outlier in the dataset.
ML parameter refinement
Note : Before refinement, a procedure will automatically try
to estimate good starting values for all parameters marked for
estimation, to ensure that refinement starts reasonably close to the
target values of the parameters.
If the ML Parameter refinement tick box is activated, SHARP will vary all parameters marked for refinement until it reaches a maximum value of the log-likelihood function. The refinement stops when the step in parameter space (in units of standard deviations) is considered sufficiently small.
There is some strategy involved in this procedure. The reason for this is, that parameters are very different in terms of their impact on the refinement. In simple terms, it means that some parameters will have much more influence over the increase of likelihood than others, and therefore should be refined first to smooth out the convergence.
| |
|
| FOR SHARP 2.0.0 and higher: |
You can specify the starting BIG CYCLE number, the final BIG
CYCLE number as well as the number of small cycles within each
BIG CYCLE.
You can fine tune the refinement strategy for each BIG CYCLE. The defaults are probably correct for nearly all cases. It is always a good idea to start refining the most important parameters first and slowly adding additional parameters during the refinement progress. (see also description of STRATEGY keyword) |
| |
|
FOR SHARP 2.0.0 and higher: Sparseness
It is possible to use a sparse approximation to the full Hessian
matrix. The sparseness is defined using a distance cut-off,
Sparse_cut, between G-sites.
An element in the Hessian matrix is discarded if it 'belongs' to a pair of atomic variables derived from two different G-sites with a distance above Sparse_cut (symmetry equivalents are taken into account).
Since this is an approximation, convergence to the optimum set of parameters is slower - but for very large structures with many G-sites, the construction of the sparse Hessian is significantly faster. This can give an overall reduction of compute time necessary to achieve convergence.
Tests indicate that the use of a sparse approximation is usefull if there are more than 50 G-sites using a Sparse_cut value of 16 Å. It is recommended to run an additional BIG cycle without the sparse approximation to verify that the approximation doesn't lead to a wrong stationary point in parameter space.
(see also description of SPARSE keyword)
FOR SHARP 2.0.0 and higher: Weeding
Weeding is a simple mechanism to detect possibly incorrect sites in
the heavy atom model. Such incorrect sites can cause problems for the
optimization procedure. So if you are not certain that all your sites
are correct we strongly recommend using this weeding mechanism.
Weeding is done at the beginning of each new BIG cycle.
If a G-site is being weeded it will have:
(see also description of WEED keyword)
Residual (Log-Likelihood Gradient = LLG) maps
Even though the parameters of the current heavy atom model are optimal
at convergence, this model can still be incomplete or wrong. The most
common instances of this are :
Centroid electron density map
In case you want to inspect the electron density map at the current
stage of the refinement, this option will make SHARP calculate the relevant Fourier
coefficients for you. These coefficients are computed according to the
time-honored method of Blow and Crick's "Best Fourier" (Blow & Crick, 1959), extended to two-dimensional
centroid structure factors. More information can be found in the SHARP output guide
A further obvious use of these coefficients is to use them in subsequent phase improvement and interpretation steps (see the phase improvement and interpretation user guide). For this, additional items are calculated per reflection (Hendrickson-Lattman coefficients, figure-of-merit etc).
Note : In cases where you used external phase information during this phase calculation step within SHARP you have to be careful: the coefficients and values calculated here will include this external phase information. This combination might only be adequate, if the external phase information is independent of the heavy atom model you are using for phase calculation.
Important note : The cell dimensions that SHARP will use are those of the MTZ file (the program will stop in case it detects an inconsistency). But beware: these cell dimensions should be those of the REFERENCE dataset. The usual convention is to have the cell dimensions of the native dataset as cell dimensions for the global MTZ file. This is fine in the general case where the native is the reference, but if for some reason (correlated non-isomorphism) you want to have another dataset as the reference (first wavelength of COMPOUND-1), then you must accordingly change the cell parameters in the global MTZ file.
Note : Only light atoms (excluding hydrogens) should be specified here. (i.e. specify all non-hydrogen atoms that are not part of the heavy-atom model and therefore invisible to any source of phase information). Alternatively, you can specify the number of amino acid residues in the asymmetric unit and and approximated atomic composition will be calculated - or vice versa. This is a good check if your input values here are reasonable.
Example and syntax :
C 1250 O 5234 N 2200 S 5
If, in addition, this file is group-readable, it will appear in the menu of datafiles you can use. Selecting it will trigger additional control options to appear at the bottom of the page.
Where do these Hendrickson-Lattman coefficients come from ?
If the information comes from an independent heavy-atom phasing experiment, these coefficients are usually output by the phasing programs. In any case, you should re-refine and phase this other derivative using SHARP.
In case you have a known fragment, or an imperfect MR, there is no established way of producing these coefficients. The CCP4 program SIGMAA (Read, 1986) calculates these coefficients, but until recently did not output them. In recent versions, it may be possible to make SIGMAA write them out. If you're interested in a simple program that uses a phase and it's associated weight (e.g. from SIGMAA), you can use the mkhl binary that comes with the distribution.
If you place your known fragment or model as model_*.pdb into the datafiles directory of your SHARP installation, you can use a utility in the Phase Improvement and Interpretation Panel to calculate these coefficients (and produce a correct MTZ file). This might be the easiest way.
Warning : These coefficients may be biased, especially in the case of a Molecular Replacement solution. Do not take them at face value, especially at high resolution. You could try to blur them with a factor that increases exponentially with resolution, or cut off the high resolution. A rational approach to this problem, involving maximum-likelihood refinement of an imperfection parameter based on the Luzzati model (Luzzati, 1952) has been developed by Gérard Bricogne in BUSTER, but it is not yet distributed.
Are there alternative ways of incorporating additional information ?
Yes: in the Phase Improvement and Interpretation Panel you can use an existing (initial) model within the general density improvement procedure.
Adding one or more G-sites
Clicking on Create after having specified
how many new sites you wanted to create, will add that number of extra
lines in the G-site table below. You then have to fill the
newly-created empty fields with the appropriate fractional
coordinate.
The other way to create a large number of G-Sites, is to press the Add button after having specified a name of 'coordinate file' in the following menu. A 'coordinate file' is an ASCII file located in the datafiles directory, with extension (.hatom), and containing information that follows the syntax :
ATOM Se 0.0903 0.3885 0.1297The first two fields must be present and separated by spaces, but are not interpreted. Then there are three fractional coordinates x, y and z, in free format, separated by spaces. The rest of the line is ignored.
Inverting hand
In most cases of heavy atom refinement and phasing it might not be
clear if the correct hand is used. The Invert button enables the user to invert the
handedness of all G-sites at once. The program will automatically
switch to the enantiomorph space group if necessary.
Inverting the hand of heavy atom positions usually involves inversion
through the origin (0,0,0) - only in I41 (1/2,0,0),
I4122 (1/2,0,1/4) and F4132 (1/4,1/4,1/4) the
centre of inversion is different.
Setting refinement flags
To reset all coordinates of all G-sites automatically to either
"refine" or "norefine" you just have to use the
Set button. Be aware that no check for
space groups with polar axes will be done (ie where the origin is not
defined along one or more axes and you might need to fix one or more
of you coordinates). These include
| space group | polar axes |
|---|---|
| P1 | x, y, z |
| P2, P21, C2 | y |
| P4, P41, P42, P43, I4, I41 | z |
| P3, P31, P32, R3, H3 | z |
| P6, P61, P62, P63, P64, P65 | z |
Deleting G-Sites
You can also delete G-Sites by first clicking a number of Delete ? tick boxes and then pressing the Delete button. But beware: all C-sites that are
placed at this geometrical position will be removed as well.
This option is only present when the details of all G-sites is visible. You can check how many C-Sites have been assigned so far to a given G-Site in the 'Usage' column of the G-Site table.
This first dataset is also a reference for lack-of-isomorphism (LOI). LOI is in its very nature a relative quantity. A crystal is never 'non-isomorphous' in itself, but relative to another crystal, which is taken as a reference. The reference for non-isomorphism is usually the native crystal, but it does not need to be. A native crystal can be defined as second compound (C-2) and have non-isomorphism defined for it with respect to a derivative defined as first compound (C-1).
How to choose the correct
reference ?
In general, the reference dataset should be the one with the highest information content, i.e. most complete and diffracting to the highest resolution. However, it can happen that two derivatives share a similar pattern of non-isomorphism (eg if they have been soaked in similar conditions). In that case it is usually preferable to have one of them as a reference for non-isomorphism to avoid correlation.
To get an idea of related datasets you could calculate e.g. R-factors between all datasets (using the same resolution range). This can be done e.g. with the CCP4 program SCALEIT. The resulting table of R-factors can point you to clusters of very similar datasets. If you encounter a large cluster with significant lower R-factors it is advisable to pick the best dataset out of this cluster as a reference. This kind of analysis is done automatically if you run the autoSHARP procedure - although (for the moment) no automatic choice is made.
Note : In our SHARP convention of a SIR(AS)/MIR(AS) experiment, the native dataset is also considered a compound. In practical terms, SHARP gives no special status to the native, which is then considered a 'derivative without a heavy atom'.
The first dataset belonging to the first COMPOUND is by convention a reference for scale and lack-of-isomorphism. This means, that the scale parameters for the first dataset of the first COMPOUND (the dataset of W-1) should not be refined. However, the scale factor (K) can be marked for estimation, in which case absolute scaling will be performed based on the chemical composition (if any) given on the global page.
This first dataset is also a reference for lack-of-isomorphism (LOI). LOI is in its very nature a relative quantity. A crystal is never 'non-isomorphous' in itself, but relative to another "reference" crystal. The reference for non-isomorphism is usually the native crystal, but it does not need to be. A native crystal can be defined as COMPOUND 2 and have non-isomorphism defined for it with respect to a derivative defined as COMPOUND 1. In a "MAD + native" calculation, one of the MAD wavelengths should always be the reference: otherwise the correlated non-isomorphisms for the MAD datasets relative to a native reference dataset is not treated correctly. (See above on how to find the best reference dataset)
Adding or deleting a C-site
To add a C-site you need to specify a chemical identity for a
corresponding G-site. You can do this for a whole list of sites by
giving a "type" and a starting G-site number. Deleting a
C-site follows the same mechanism as for G-sites: after selecting one
or more C-sites the Delete button will
remove these C-sites from the current compound.
Specifying a C-site
For each C-site entry, a correspondence must be established with one
of the G-sites in the list and with a heavy-atom (to be written in
the small free-format field under the heading Atom).
Note : The first letter of the chemical symbol must be uppercase, the second (and others) lowercase for the symbol to be recognized.
Examples : Br, Hg, Au
This is just a way of describing the effect of occupancy and temperature factor on the atom (C-site) positioned at this G-site: it is going to be "tuned" up and down if these parameters vary.
The one thing that can't vary from one crystal to another is the number of chemical sites : therefore, you will not be allowed to create/delete sites at this level (in pathological cases, something similar can be done by setting the occupancy to zero and not refining it).
Specifying a T-site
By definition, each T-site entry corresponds to the C-site with the
same number. It is compulsory to specify an occupancy and a
temperature factor (B-factor). Additionally, if you think it necessary
(usually because you noticed characteristic features in the residual maps) six anisotropic
increments to the isotropic B-factor can be specified in the form of
the unique upper triangle of a symmetric tensor.
The way to refine temperature factors is defined in the column under the heading B refinement. You have a choice between
Setting refinement flags
To reset all occupancies and/or all (isotropic) temperature factors of
all T-sites and automatically set their refinement flags to either
"refine" or "norefine" you just have to use the
Set button. Negative values for occupancy
and/or temperature factor signal to the program to leave these values
as they are.
The minimal assignments that are required are those for FMID (structure factor amplitude) and SMID (estimated standard deviation on the measurement of FMID).
If there is anomalous diffraction present in the data, then FMID = 0.5 * ( F+ + F- ) and you will need to specify three additional columns. DANO is the anomalous difference ( DANO = F+ - F- ) and SANO (the estimated standard deviation on the measurement of DANO).
ISYM is used when only one member of the anomalous pair is measured. In that case DANO = 0. , and FMID = F+ OR FMID = F- . But you do not know which, so that you are led to making a statistical error by assuming FMID is the average of the two. ISYM provides the necessary information :
Scaling parameters
By convention, the first dataset in the list (first wavelength of
first crystal of first compound) is always taken as the reference for
all other scales - the other scaling factors will then be relative to
this one.
Estimation of reference scaling parameters :
pseudo-absolute scale.
Even though the first dataset is taken as a reference, some rough
absolute scaling can be performed on it by activating the estimate button for the multiplier scale factor
(K). The temperature scale factor (B) could be put on absolute scale as well, but
in practice SHARP will not allow
it, since this is equivalent to setting the average temperature factor
of the protein atoms to zero.
Pseudo-absolute scaling will only work if you have indicated the
atomic composition of the asymmetric unit in the general info page,
and if your data extends to a resolution higher than 3.5Å.
(Advanced users have the possibility to even refine the scale
factor for the reference dataset, although this is not recommended).
Estimation of non-reference
scaling parameters : relative scale.
For all other datasets, the estimation of multiplier and temperature
scale factors (relative to those of the reference) is triggered in the
same way, by activating the estimate
buttons for these parameters.
ML refinement of constant and
'temperature' scaling parameters
Starting from the estimated values of these parameters (if the
estimate button was set) or otherwise from the values you
typed in, the scaling parameters will be refined along with all
other requested parameters.
Global non-isomorphism parameter
This parameter describes one component of the lack of isomorphism (the
one that increases with resolution) as a random perturbation of all
atoms, assumed to be uniformly distributed over the whole asymmetric
unit (Luzzati, 1952). When applied to the
anomalous signal its meaning becomes less intuitive: but it can be
useful (as an "extra variance" parameter) to detect and partially
correct for a bad estimation of measurement errors, especially at high
resolution.
Typical values for parameter NISO_BGLO (global non-isomorphism parameter on isomorphous differences) range from 0 to 3. A value of 3 already indicates quite high lack of isomorphism.
The parameter NANO_BGLO (global non-isomorphism parameter on anomalous differences) can safely have larger values (up to 10 and more): but they usually indicate either a deficiency in the estimation of standard deviations for the measured intensities or inaccuracies in the anomalous scattering model.
Model imperfection parameters
This parameter describes the remaining inaccuracies (decreasing
with resolution) that arise from a yet imperfect knowledge of the
heavy-atom substitution. It could also be called "low-resolution
lack-of-isomorphism".
The value of NISO_CLOC (model imperfection parameter on isomorphous differences) and NANO_CLOC (model imperfection parameter on anomalous differences) are a percentage of the total diffracted intensity and should therefore remain between 0 and 1. Exceptional high percentages show that a significant amount of signal is not yet parameterized in the heavy atom model and considered "noise"
Anomalous scattering properties by atom type
All the atom types specified at the crystal level are by default given
the values of f' and f" (anomalous scattering factors) at CuK(alpha)
wavelength for this atomic type. You can correct these values if the
wavelength is different and specify whether you want to refine then.
Especially for a MAD experiment, starting values for these parameters are very important : we recommend using values from fluorescence measurements, or at the very least, starting values from the Sasaki tables (Sasaki, 1989), possibly corrected for white lines and other non-calculable edge features. (These step-wise tables are accessible on-line from the BATCH page in the interface). You can also use the CCP4 program CROSSEC to calculate values at exact wavelengths. If no fluorescence data is available, "expert" values (e.g. from beam-line staff) for common elements like Se might even be better than the calculated/tabulated values.
Warning : Refining f', f" and the occupancy of all sites for a given heavy-atom type will lead to a redundant parametrisation - and probably to a eigenvalues being filtered in the inversion of the Hessian matrix. It is preferable to avoid this by not refining f' in SIR(AS) or MIR(AS). In MAD you should keep f' (and possibly f'' as well) fixed for one of the wavelengths - usually the remote wavelength, where these values are well determined. If your f' and f" values come from a very good fluorescence scan or they belong to a wavelength far away from the edge it might be best to keep both of them fixed until the later stages of the refinement procedure. The anomalous residual maps are a very good indicator to see if a refinement of f' and/or f'' might be necessary.
If you are happy with the parameter values in that file, just press OK to run the calculation, or Save to set up everything and start the calculation later using the 'Restart' facility in the Control Panel. If you see something that needs to be corrected, just press Cancel to close the verification page, and modify what needs to be. Then Submit again.