SHARP/autoSHARP User Manual previous next
Chapter 4

SHARP Input Preparation Guide

Copyright    © 2001-2006 by Global Phasing Limited
 
  All rights reserved.
 
  This software is proprietary to and embodies the confidential technology of Global Phasing Limited (GPhL). Possession, use, duplication or dissemination of the software is authorised only pursuant to a valid written licence from GPhL.
 
Documentation    (2001-2006)  Clemens Vonrhein
  (1995-1998)  Eric de La Fortelle
 
Contact sharp-develop@GlobalPhasing.com


This document describes the various control statements that SHARP provides. It is best used at the time of job preparation as a reference.

Contents


General information about preparing input for SHARP

One major innovation in SHARP is the hierarchical organisation of the collection of parameters (both refinable and non-refinable) that describe : This data structure is organised as a tree, with a root and four successive hierarchical levels. Note : The tree is visualised in the left-hand side of the SHARP input editor and can be edited using the buttons at the top of the window. As you can see from the description above, this data structure enables SHARP to process SAD, SIR(AS), MIR(AS) or MAD data or any mixture of them in a general and flexible way. The graphical user interface provides an easy way to accurately define your particular type of experiment.


Preparing input for refinement jobs

If it is the first time you are using SHARP on this data, filling in this form will require some time and concentration. To simplify the procedure, the following pieces of information should be ready at hand :


Field-by-field description of the input pages

The following is a list of the various fields and options that you can edit to tell SHARP exactly what you want to do. The SHARP input editor contains hyper-links to these items at the appropriate position.


Identification

Project name

A given crystallographic problem will probably need several runs of refinement, possibly some trial-and-error, heavy-atom model updates for minor sites etc. Our convention is to give the same name to all these runs, and increment an ordinal number as an extension to this stem. The stem can be modified in the field following "Project ID". For example, if you choose "Lysozyme-MAD-X31" as your project ID, your successive runs of this project will be named Lysozyme-MAD-X31.1 , Lysozyme-MAD-X31.2 etc.

Please note that not all characters are allowed here: letters, numbers and "-" are ok, but special characters like &, %, @, _, "." etc might cause problems in some of the scripts. You also have to avoid any white-space (" " or tabs).

Title

The title is there to remind you what is specific to the calculation you are currently undertaking. Keep in mind that the graphical interface makes it very easy to generate many different jobs. You will need the title to sort out which is which. For these reasons, the first few words of the title will appear in the pop-ups alongside with the project ID.


Calculation Options

Outlier rejection using likelihood histogram

(For a "standard" user this option is always on.)

As an extra protection against 'hidden' outliers (i.e. those that will not show up in the compulsory histograms of intensity, isomorphous/anomalous differences etc. before ML refinement), another round of rejections can be added, based on the value of the likelihood for each reflexion. This will always be turned on for users defined as "standard" level in the Preferences settings. If "advanced" (or higher), we recommend not to turn this off except for very special applications, where you are sure there is no outlier in the dataset.

ML parameter refinement

Note : Before refinement, a procedure will automatically try to estimate good starting values for all parameters marked for estimation, to ensure that refinement starts reasonably close to the target values of the parameters.

If the ML Parameter refinement tick box is activated, SHARP will vary all parameters marked for refinement until it reaches a maximum value of the log-likelihood function. The refinement stops when the step in parameter space (in units of standard deviations) is considered sufficiently small.

There is some strategy involved in this procedure. The reason for this is, that parameters are very different in terms of their impact on the refinement. In simple terms, it means that some parameters will have much more influence over the increase of likelihood than others, and therefore should be refined first to smooth out the convergence.

    
FOR SHARP 2.0.0 and higher: You can specify the starting BIG CYCLE number, the final BIG CYCLE number as well as the number of small cycles within each BIG CYCLE.

You can fine tune the refinement strategy for each BIG CYCLE. The defaults are probably correct for nearly all cases. It is always a good idea to start refining the most important parameters first and slowly adding additional parameters during the refinement progress.

(see also description of STRATEGY keyword)

    

FOR SHARP 2.0.0 and higher: Sparseness

It is possible to use a sparse approximation to the full Hessian matrix. The sparseness is defined using a distance cut-off, Sparse_cut, between G-sites.

An element in the Hessian matrix is discarded if it 'belongs' to a pair of atomic variables derived from two different G-sites with a distance above Sparse_cut (symmetry equivalents are taken into account).

Since this is an approximation, convergence to the optimum set of parameters is slower - but for very large structures with many G-sites, the construction of the sparse Hessian is significantly faster. This can give an overall reduction of compute time necessary to achieve convergence.

Tests indicate that the use of a sparse approximation is useful if there are more than 50 G-sites using a Sparse_cut value of 16 Å. It is recommended to run an additional BIG cycle without the sparse approximation to verify that the approximation doesn't lead to a wrong stationary point in parameter space.

(see also description of SPARSE keyword)

FOR SHARP 2.0.0 and higher: Weeding

Weeding is a simple mechanism to detect possibly incorrect sites in the heavy atom model. Such incorrect sites can cause problems for the optimisation procedure. So if you are not certain that all your sites are correct we strongly recommend using this weeding mechanism. Weeding is done at the beginning of each new BIG cycle.

If a G-site is being weeded it will have:

The method is an eigenvalue analysis of the block-diagonal elements corresponding to the coordinates of the G-sites. Each G-site is given a score based on both the magnitudes and signs of is eigenvalues. They are then ranked, and the mean and standard uncertainty is computed from the scores for the G-sites having the correct sign of all its eigenvalues.

(see also description of WEED keyword)

Residual (Log-Likelihood Gradient = LLG) maps

Even though the parameters of the current heavy atom model are optimal at convergence, this model can still be incomplete or wrong. The most common instances of this are : To detect these problems a residual map is calculated. It will show positive features where your current heavy atom model is lacking and negative features where it contains wrong information. Therefore, these maps are a very valuable tool for improving your heavy atom model. The highest positive and negative peaks of the residual maps need to be examined carefully to detect and characterise these features. More information can be found in the SHARP output guide

Centroid electron density map

In case you want to inspect the electron density map at the current stage of the refinement, this option will make SHARP calculate the relevant Fourier coefficients for you. These coefficients are computed according to the time-honoured method of Blow and Crick's "Best Fourier" (Blow & Crick, 1959), extended to two-dimensional centroid structure factors. More information can be found in the SHARP output guide

A further obvious use of these coefficients is to use them in subsequent phase improvement and interpretation steps (see the phase improvement and interpretation user guide). For this, additional items are calculated per reflection (Hendrickson-Lattman coefficients, figure-of-merit etc).

Note : In cases where you used external phase information during this phase calculation step within SHARP you have to be careful: the coefficients and values calculated here will include this external phase information. This combination might only be adequate, if the external phase information is independent of the heavy atom model you are using for phase calculation.


Datafile, Symmetry and Cell

Datafile

Remember (see here) that all the measurement info has to be included in a multi-column MTZ file. If it is present in the datafiles directory , if it is readable by the 'sharp' account, and if it has the right extension (.data.mtz), it will be listed in this menu.

Space group

The space group will be automatically picked up when you change the selected data-file. Make sure that this is the correct space group (screw axes included etc). Also check, that you are using normal settings, e.g.

Cell

Cell dimensions in Angströms, angles in degrees. You should pick them up from the MTZ file. These are the cell parameters for the reference dataset.

Important note : The cell dimensions that SHARP will use are those of the MTZ file (the program will stop in case it detects an inconsistency). But beware: these cell dimensions should be those of the REFERENCE dataset. The usual convention is to have the cell dimensions of the native dataset as cell dimensions for the global MTZ file. This is fine in the general case where the native is the reference, but if for some reason (correlated non-isomorphism) you want to have another dataset as the reference (first wavelength of COMPOUND-1), then you must accordingly change the cell parameters in the global MTZ file.


Other information

Chemical composition of the asymmetric unit

The atomic composition of the asymmetric unit is only used in case you ask for an estimation of the absolute scale of the data (down by estimating the scale factor for the reference wavelength). It is therefore not compulsory, but finding occupancies in the [0,1] range (as should be the case if the data is on near-absolute scale) can be a welcome sanity check for the behaviour of the refinement.

Note : Only light atoms (excluding hydrogens) should be specified here. (i.e. specify all non-hydrogen atoms that are not part of the heavy-atom model and therefore invisible to any source of phase information). Alternatively, you can specify the number of amino acid residues in the asymmetric unit and and approximated atomic composition will be calculated - or vice versa. This is a good check if your input values here are reasonable.

Example and syntax :

C 1250
O 5234
N 2200
S    5

External phase information (Advanced users only)

If you have phase information from other sources (Molecular Replacement (MR) model, known fragment, non-isomorphous derivative ...) and you want to use it to help in all SHARP calculations, you should have this information encoded in Hendrickson-Lattman coefficients (compulsory column names : HLA HLB HLC HLD), added to your existing *.data.mtz MTZ file , and deposit this MTZ file in the datafiles directory of your SHARP installation, with extension .data.mtz

If, in addition, this file is group-readable, it will appear in the menu of datafiles you can use. Selecting it will trigger additional control options to appear at the bottom of the page.

Where do these Hendrickson-Lattman coefficients come from ?

If the information comes from an independent heavy-atom phasing experiment, these coefficients are usually output by the phasing programs. In any case, you should re-refine and phase this other derivative using SHARP.

In case you have a known fragment, or an imperfect MR, there is no established way of producing these coefficients. The CCP4 program SIGMAA (Read, 1986) calculates these coefficients, but until recently did not output them. In recent versions, it may be possible to make SIGMAA write them out. If you're interested in a simple program that uses a phase and it's associated weight (e.g. from SIGMAA), you can use the mkhl binary that comes with the distribution.

If you place your known fragment or model as model_*.pdb into the datafiles directory of your SHARP installation, you can use a utility in the Phase Improvement and Interpretation Panel to calculate these coefficients (and produce a correct MTZ file). This might be the easiest way.

Warning : These coefficients may be biased, especially in the case of a Molecular Replacement solution. Do not take them at face value, especially at high resolution. You could try to blur them with a factor that increases exponentially with resolution, or cut off the high resolution. A rational approach to this problem, involving maximum-likelihood refinement of an imperfection parameter based on the Luzzati model (Luzzati, 1952) has been developed by Gérard Bricogne in BUSTER, but it is not yet distributed.

Are there alternative ways of incorporating additional information ?

Yes: in the Phase Improvement and Interpretation Panel you can use an existing (initial) model within the general density improvement procedure.


Geometric Site Editor(list of sites)

At the top of the hierarchical data organisation in SHARP stands a list of positions for ALL heavy-atoms in ALL compounds (see introduction). Because we only specify at this stage the three coordinates of a point in real-space, this piece of information will be called a Geometric Site (G-Site in short).

Adding one or more G-sites

Clicking on Create after having specified how many new sites you wanted to create, will add that number of extra lines in the G-site table below. You then have to fill the newly-created empty fields with the appropriate fractional coordinate.

The other way to create a large number of G-Sites, is to press the Add button after having specified a name of 'coordinate file' in the following menu. A 'coordinate file' is an ASCII file located in the datafiles directory, with extension (.hatom), and containing information that follows the syntax :

ATOM Se 0.0903  0.3885  0.1297
The first two fields must be present and separated by spaces, but are not interpreted. Then there are three fractional coordinates x, y and z, in free format, separated by spaces. The rest of the line is ignored.

Inverting hand

In most cases of heavy atom refinement and phasing it might not be clear if the correct hand is used. The Invert button enables the user to invert the handedness of all G-sites at once. The program will automatically switch to the enantiomorph space group if necessary.

Inverting the hand of heavy atom positions usually involves inversion through the origin (0,0,0) - only in I41 (1/2,0,0), I4122 (1/2,0,1/4) and F4132 (1/4,1/4,1/4) the centre of inversion is different.

Setting refinement flags

To reset all coordinates of all G-sites automatically to either "refine" or "norefine" you just have to use the Set button. Be aware that no check for space groups with polar axes will be done (ie where the origin is not defined along one or more axes and you might need to fix one or more of you coordinates). These include

space grouppolar axes
P1x, y, z
P2, P21, C2y
P4, P41, P42, P43, I4, I41z
P3, P31, P32, R3, H3z
P6, P61, P62, P63, P64, P65z

Deleting G-Sites

You can also delete G-Sites by first clicking a number of Delete ? tick boxes and then pressing the Delete button. But beware: all C-sites that are placed at this geometrical position will be removed as well.

This option is only present when the details of all G-sites is visible. You can check how many C-Sites have been assigned so far to a given G-Site in the 'Usage' column of the G-Site table.


What is a reference dataset ?

Definition

The first wavelength of the first crystal belonging to the first compound is by convention a reference for scale and lack-of-isomorphism. This means, that the scale parameters for the first dataset of the first compound (the dataset of W-1) will not be refined. The SCAL_K (constant scaling factor) parameter can be marked for estimation, in which case absolute scaling will be performed based on the chemical composition (if any) given in the Global page.

This first dataset is also a reference for lack-of-isomorphism (LOI). LOI is in its very nature a relative quantity. A crystal is never 'non-isomorphous' in itself, but relative to another crystal, which is taken as a reference. The reference for non-isomorphism is usually the native crystal, but it does not need to be. A native crystal can be defined as second compound (C-2) and have non-isomorphism defined for it with respect to a derivative defined as first compound (C-1).

How to choose the correct reference ?

In general, the reference dataset should be the one with the highest information content, i.e. most complete and diffracting to the highest resolution. However, it can happen that two derivatives share a similar pattern of non-isomorphism (eg if they have been soaked in similar conditions). In that case it is usually preferable to have one of them as a reference for non-isomorphism to avoid correlation.

To get an idea of related datasets you could calculate e.g. R-factors between all datasets (using the same resolution range). This can be done e.g. with the CCP4 program SCALEIT. The resulting table of R-factors can point you to clusters of very similar datasets. If you encounter a large cluster with significant lower R-factors it is advisable to pick the best dataset out of this cluster as a reference. This kind of analysis is done automatically if you run the autoSHARP procedure - although (for the moment) no automatic choice is made.


C-Site Editor
(Compound level)

Once the list of G-Sites has been established, for each compound we will need to assign a chemical identity to the relevant subset of sites in that list. These sites then become, in our terminology, 'Chemical Sites' (C-Sites in short).

Note : In our SHARP convention of a SIR(AS)/MIR(AS) experiment, the native dataset is also considered a compound. In practical terms, SHARP gives no special status to the native, which is then considered a 'derivative without a heavy atom'.

The first dataset belonging to the first COMPOUND is by convention a reference for scale and lack-of-isomorphism. This means, that the scale parameters for the first dataset of the first COMPOUND (the dataset of W-1) should not be refined. However, the scale factor (K) can be marked for estimation, in which case absolute scaling will be performed based on the chemical composition (if any) given on the global page.

This first dataset is also a reference for lack-of-isomorphism (LOI). LOI is in its very nature a relative quantity. A crystal is never 'non-isomorphous' in itself, but relative to another "reference" crystal. The reference for non-isomorphism is usually the native crystal, but it does not need to be. A native crystal can be defined as COMPOUND 2 and have non-isomorphism defined for it with respect to a derivative defined as COMPOUND 1. In a "MAD + native" calculation, one of the MAD wavelengths should always be the reference: otherwise the correlated non-isomorphisms for the MAD datasets relative to a native reference dataset is not treated correctly. (See above on how to find the best reference dataset)

Adding or deleting a C-site

To add a C-site you need to specify a chemical identity for a corresponding G-site. You can do this for a whole list of sites by giving a "type" and a starting G-site number. Deleting a C-site follows the same mechanism as for G-sites: after selecting one or more C-sites the Delete button will remove these C-sites from the current compound.

Specifying a C-site

For each C-site entry, a correspondence must be established with one of the G-sites in the list and with a heavy-atom (to be written in the small free-format field under the heading Atom).

Note : The first letter of the chemical symbol must be uppercase, the second (and others) lowercase for the symbol to be recognised.

Examples : Br, Hg, Au

Note: to specify the C-site as a spherical cluster the Tag field should contain the cluster name and atom type separated by a ":". When adding C-sites from scratch specify the full tag-name:atom-type in the "type" field.

Examples of atom type and spherical cluster name pairs : (Ta and Ta6Br12:Ta), (W and W18:W)


T-Site Editor
(Crystal level)

Once the chemical identity of the sites is established (at the compound level), we also need to define their occupancies and temperature factors to make them "tunable" sites (T-Sites). This is done at the crystal level: because these parameters can vary - for the same compound - from one physical crystal to another.

What are "tunable"sites ?

This is just a way of describing the effect of occupancy and temperature factor on the atom (C-site) positioned at this G-site: it is going to be "tuned" up and down if these parameters vary.

The one thing that can't vary from one crystal to another is the number of chemical sites : therefore, you will not be allowed to create/delete sites at this level (in pathological cases, something similar can be done by setting the occupancy to zero and not refining it).

Specifying a T-site

By definition, each T-site entry corresponds to the C-site with the same number. It is compulsory to specify an occupancy and a temperature factor (B-factor). Additionally, if you think it necessary (usually because you noticed characteristic features in the residual maps) six anisotropic increments to the isotropic B-factor can be specified in the form of the unique upper triangle of a symmetric tensor.

The way to refine temperature factors is defined in the column under the heading B refinement. You have a choice between

Setting refinement flags

To reset all occupancies and/or all (isotropic) temperature factors of all T-sites and automatically set their refinement flags to either "refine" or "norefine" you just have to use the Set button. Negative values for occupancy and/or temperature factor signal to the program to leave these values as they are.


Wavelength level editor

At present, this page is only used to specify inner and outer resolution limits for a wavelength (i.e. all data of a given crystal that have been recorded at the same X-ray wavelength). This level nevertheless is important for generality, since different wavelengths have to be defined in a MAD experiment.


Batch level editor

What we call a batch is a 'time slice' in your crystallographic data collection experiment. Within this batch all parameters are assumed to be constant. It can also be defined in a more operational manner as "a set of images that merge well with each other" ! However, defining different time slices for your collected data is hardly ever used at the moment. The batch level is the lowest hierarchical level in our data structure.

Column assignment

Because the information in the crystallographic data file (MTZ format) is column-driven, it is necessary to give the program for each batch the name of the columns that correspond to the information it needs.

The minimal assignments that are required are those for FMID (structure factor amplitude) and SMID (estimated standard deviation on the measurement of FMID).

If there is anomalous diffraction present in the data, then FMID = 0.5 * ( F+ + F- ) and you will need to specify three additional columns. DANO is the anomalous difference ( DANO = F+ - F- ) and SANO (the estimated standard deviation on the measurement of DANO).

ISYM is used when only one member of the anomalous pair is measured. In that case DANO = 0. , and FMID = F+ OR FMID = F- . But you do not know which, so that you are led to making a statistical error by assuming FMID is the average of the two. ISYM provides the necessary information :

The ISYM column is produced by CCP4 program truncate, for each dataset for which anomalous pairs have been recorded: you just need to include it into your merged data-file.

Scaling parameters

By convention, the first dataset in the list (first wavelength of first crystal of first compound) is always taken as the reference for all other scales - the other scaling factors will then be relative to this one.

Estimation of reference scaling parameters : pseudo-absolute scale.

Even though the first dataset is taken as a reference, some rough absolute scaling can be performed on it by activating the estimate button for the multiplier scale factor (K). The temperature scale factor (B) could be put on absolute scale as well, but in practise SHARP will not allow it, since this is equivalent to setting the average temperature factor of the protein atoms to zero.

Pseudo-absolute scaling will only work if you have indicated the atomic composition of the asymmetric unit in the general info page, and if your data extends to a resolution higher than 3.5Å. (Advanced users have the possibility to even refine the scale factor for the reference dataset, although this is not recommended).

Estimation of non-reference scaling parameters : relative scale.

For all other datasets, the estimation of multiplier and temperature scale factors (relative to those of the reference) is triggered in the same way, by activating the estimate buttons for these parameters.

ML refinement of constant and 'temperature' scaling parameters

Starting from the estimated values of these parameters (if the estimate button was set) or otherwise from the values you typed in, the scaling parameters will be refined along with all other requested parameters.

Global non-isomorphism parameter

This parameter describes one component of the lack of isomorphism (the one that increases with resolution) as a random perturbation of all atoms, assumed to be uniformly distributed over the whole asymmetric unit (Luzzati, 1952). When applied to the anomalous signal its meaning becomes less intuitive: but it can be useful (as an "extra variance" parameter) to detect and partially correct for a bad estimation of measurement errors, especially at high resolution.

Typical values for parameter NISO_BGLO (global non-isomorphism parameter on isomorphous differences) range from 0 to 3. A value of 3 already indicates quite high lack of isomorphism.

The parameter NANO_BGLO (global non-isomorphism parameter on anomalous differences) can safely have larger values (up to 10 and more): but they usually indicate either a deficiency in the estimation of standard deviations for the measured intensities or inaccuracies in the anomalous scattering model.

Model imperfection parameters

This parameter describes the remaining inaccuracies (decreasing with resolution) that arise from a yet imperfect knowledge of the heavy-atom substitution. It could also be called "low-resolution lack-of-isomorphism".

The value of NISO_CLOC (model imperfection parameter on isomorphous differences) and NANO_CLOC (model imperfection parameter on anomalous differences) are a percentage of the total diffracted intensity and should therefore remain between 0 and 1. Exceptional high percentages show that a significant amount of signal is not yet parametrised in the heavy atom model and considered "noise"

Anomalous scattering properties by atom type

All the atom types specified at the crystal level are by default given the values of f' and f" (anomalous scattering factors) at CuK(alpha) wavelength for this atomic type. You can correct these values if the wavelength is different and specify whether you want to refine then.

Especially for a MAD experiment, starting values for these parameters are very important : we recommend using values from fluorescence measurements, or at the very least, starting values from the Sasaki tables (Sasaki, 1989), possibly corrected for white lines and other non-calculable edge features. (These step-wise tables are accessible on-line from the BATCH page in the interface). You can also use the CCP4 program CROSSEC to calculate values at exact wavelengths. If no fluorescence data is available, "expert" values (e.g. from beam-line staff) for common elements like Se might even be better than the calculated/tabulated values.

Warning : Refining f', f" and the occupancy of all sites for a given heavy-atom type will lead to a redundant parametrisation - and probably to a eigenvalues being filtered in the inversion of the Hessian matrix. It is preferable to avoid this by not refining f' in SIR(AS) or MIR(AS). In MAD you should keep f' (and possibly f'' as well) fixed for one of the wavelengths - usually the remote wavelength, where these values are well determined. If your f' and f" values come from a very good fluorescence scan or they belong to a wavelength far away from the edge it might be best to keep both of them fixed until the later stages of the refinement procedure. The anomalous residual maps are a very good indicator to see if a refinement of f' and/or f'' might be necessary.


Checking the calculation

Once you are satisfied with all your parameters at all levels of the hierarchy, you can press Submit. You will be presented with a verification page: this is the ASCII file that will be read by SHARP when it starts running the calculation.

If you are happy with the parameter values in that file, just press OK to run the calculation, or Save to set up everything and start the calculation later using the 'Restart' facility in the Control Panel. If you see something that needs to be corrected, just press Cancel to close the verification page, and modify what needs to be. Then Submit again.


Last modified: Mon Sep 26 12:39:39 BST 2011