SHARP Output Guide
Copyright |
© 2001-2006 by Global Phasing Limited |
|
|
All rights reserved. |
|
|
This software is proprietary to and embodies the confidential
technology of Global Phasing Limited (GPhL). Possession, use,
duplication or dissemination of the software is authorised only
pursuant to a valid written licence from GPhL.
|
|
Documentation |
(2001-2006) Clemens Vonrhein |
|
(1995-1997) Eric de La Fortelle |
|
Contact |
sharp-develop@GlobalPhasing.com |
This document describes the output of the SHARP program and how to interpret
it. Our convention in the main output generated by SHARP is, that if we think some
explanations may be necessary a hyper-link "explanation" can
be used to directly view the relevant information.
Contents
The SHARP output is divided into
four sections :
- Preparation of data
This section is present in all types of calculations. The
crystallographic data are read, undergo statistical information
gathering and filtering and (optional) robust estimation of
scale.
- ML Refinement
All parameters that have been marked for refinement in the input
will be allowed to vary in order to maximise the log-likelihood
function. This refinement is done following a Newton method with
full second-order derivative information. The Hessian matrix
(normal matrix of the maximisation) is filtered of its positive
eigenvalues, so that the refinement only proceeds in the
parameter subspace where the function is convex. The Hessian
matrix is also used to calculate standard deviations for all
parameters. This in turn provides a way to objectively measure
the distance of each step (parameter update) in parameter space,
in units of standard deviations. Convergence is achieved when
this step becomes small enough.
- Residual maps
Once convergence is achieved for the parameters of the current
model, it is possible to detect unaccounted-for substitution
features using the various kinds of residual maps provided. The
Fourier coefficients for these maps are the components of the
gradient of the log-likelihood function. This means that after a
Fourier Synthesis positive peaks will appear where the data
"expects" more heavy-atom density and negative peaks
where the data "wants" less heavy-atom density.
- Electron-density maps
Once the substitution model is satisfactory (no significant
peaks left in the residual maps), it is time to inspect the
electron-density maps, post-process them using density
modification algorithms and try to interpret them in terms of
molecular structure. Fourier coefficients calculated by SHARP are an extension of the
"best phases" in the sense of Blow and Crick (Blow & Crick, 1959):
instead of being
one-dimensional phase centroids they are two-dimensional
centroids of the probability distribution of the complex native
structure factor.
Note : The phase information output by SHARP is not
limited to the centroid structure factor that is used to
calculate the electron-density map. A complete summary of the
phase probability distribution is available via
Hendrickson-Lattman (HL) coefficients. In many cases, the
quality of phasing that SHARP provides may not be noticeable at
the level of centroid maps, but only after the HL coefficients
have been used in a density modification procedure.
Depending on the calculation options
chosen not all of these section have to be present.
Before the calculation starts, SHARP has to read the parameter file
that has been prepared through the graphical interface. In order to
check that the most basic pieces of information have been properly
understood and processed by the program, four self-explanatory
hyper-links are presented.
Shows the values of the cell parameters as understood by
SHARP, and standard operations
performed on them, to calculate :
- reciprocal cell parameters
- the volume of the unit cell
- the Brookhaven matrix (which, multiplied by a vector of
fractional coordinates, returns the vector of coordinates in
Å), and the inverse Brookhaven matrix (which
performs the inverse operation).
Make sure that this is the cell of the reference dataset.
Presents the name of the space group as understood by SHARP and the results of group analysis
operations performed on the group operators. Since the symmetry
operators are fetched from a library file you might want to check that
the correct operators are present.
Shows the composition of the asymmetric unit as understood by SHARP and the coefficients that it will
use to calculate resolution-dependent scattering factors for these
atoms.
what SHARP understood of the
input parameter file
If you want to check that the rest of the parameters have been read
properly, this link takes you to a copy of the ASCII input file -
written by SHARP after parsing it.
In order to make significant statistical calculations in all the
resolution bins, it is necessary to ensure that all these bins are
properly populated, ie that they contain at least 100 reflexions
each. If this is not the case, SHARP outputs the message "Bin
population problem". To get over this, we first suppose that the user
has not given resolution limits that correspond to the actual limits of
the data. The limits are then re-set to the precise boundaries of
accepted reflexions. If this is not enough to overcome the bin
population problem, the only remaining remedy is to reduce the number
of bins iteratively until satisfaction of the above criterion.
Advanced users can specify the exact number of bins during preparation
of the SHARP input file. A maximum
of 20 bins is possible.
All reflexions in the file are read. Those that have at least
two measurements within the resolution limits that you indicated are
retained for further use. A link to details why some reflexions might
have been rejected is provided.
This is a simple protection against the most obvious
outliers. Histograms are calculated on quantities that do not require
knowledge of the heavy-atom substitution (amplitudes, isomorphous
differences, anomalous differences if present). The "tail"
of these distributions is iteratively cut at 5 standard deviations
(iterations are needed because the action of cutting the tail reduces
the standard deviation).
Once the most visible outliers are out of the way, SHARP performs several passes through
the data in order to collect statistical quantities that will be used
in different areas of the program. These quantities are :
- the mean squared amplitudes for each batch
- the mean squared isomorphous differences for each batch
- the mean squared anomalous differences for each batch
- the mean squared native amplitudes
Each of these is presented as a function of resolution (in resolution
bins).
Note 1 : The statistics of mean squared isomorphous - or
in the case of MAD dispersive - and anomalous differences as a
function of resolution (as plotted in this section in a log-graph) are
a very good indicator of the quality of your derivative. The plot
should more or less follow a straight line. It has repeatedly happened
that high-resolution data deemed worthless by other criteria have been
very useful for phasing. As a general rule DO
NOT DISCARD DATA WITH LOW PHASING POWER. These data are
extremely useful for refining scale factors, heavy-atom temperature
factors and other parameters. In addition, their full potential is
only expressed in the density modification step, where high-resolution
reflexions with FOM of 0.1 of 0.2 on average are crucial to obtain the
best possible map as an outcome. You are strongly advised to judge for
yourself in the tutorials.
Note 2 : When there is no native dataset (eg in MAD), the
mean squared native amplitudes will be approximated as those of the
reference dataset until big cycle 3. After that they are more accurately
estimated by subtracting the heavy-atom contribution to the statistics
of the reference dataset. This estimation will be done again at the
start of each new big cycle, because their accuracy depends on the
exactness of the substitution model.
If selected and
based on the atomic composition of the asymmetric unit, SHARP will estimate a pseudo-absolute
scale for the first dataset. The only purpose of this is to provide an
extra check on the chemical reasonableness of the heavy-atom
occupancies. The refinement can proceed unharmed on any other scale,
provided the starting occupancies are not too far from the
"scaled" occupancies.
Scaling of all datasets relative to the reference is extremely
important to start the refinement under good conditions. The estimate
provided in this robust algorithm is usually precise - so it should be
used whenever you are not absolutely sure of the scaling of one
dataset to another. Obviously, after this has be done once it needn't
be repeated in subsequent SHARP
refinements for the same dataset.
The Wilson-plot is there to show you how well the logarithm of the
relative scale fits a straight line (as a function of resolution). Any
significant departure from the line is an alarming symptom and that
dataset should be revisited. If all Wilson-plots show a comparable
pattern of misbehaviour, the reference dataset may need questionable.
The parameters for the lack of isomorphism can also be estimated prior
to ML refinement. The mean lack of closure is first calculated (in
resolution bins) by a faster method generalising what is described in
Terwilliger &
Eisenberg (1983). In a second step, a Wilson-plot analysis yields
an initial estimate of the "global" non-isomorphism
parameters according to the Luzzati model of non-isomorphism. (Luzzati (1952), Read (1986), Dumas (1994)).
Details
If you follow the corresponding hyper-link you can see the details of
the iterative LOI estimation procedure in up to five cycles. At each
cycle, a Wilson-plot shows you how well the mean square lack of
closure follows the Luzzati statistical model (straight
line). Deviations from the straight line can be attributed to :
- incomplete heavy-atom model, characterised by values above the
line at low resolution
- non-crystallographic symmetry, characterised by a hump at medium
resolution
- questionable data quality, characterised by marked wiggles
around the line
Note 1 : Only the isomorphous lack of isomorphism is
supposed to follow Luzzati statistics. Wilson plots of the anomalous
lack of closure will usually not follow a straight line.
Always have a look at the table printed in the main SHARP log-file for all estimated
parameters: any comment after one parameter should be taken as a
warning and investigated further. If your refinement starts with very
large values of non-isomorphism it might not recover from these
starting values during the refinement.
Note 2 : by default the estimation of non-isomorphism
parameters is switched off in the interface.
The second and main section of each SHARP run is the refinement of all
selected parameters. This follows the strategy described by
BIG CYCLEs.
In order to detect and reject outliers that were not apparent in the
first histograms based on simple data analysis
and statistics, we provide a second filter for the data based on the
value of the log-likelihood function for each reflexion. Reflexions
for which the log-likelihood is very small will maximally disagree
with the model parameters at the current stage of the refinement.
These reflexions may be outliers, but they can also be well measured
but in maximal disagreement with the current model. Therefore,
outliers according to this criterion must be rejected carefully in
order to avoid bias towards the current values. A mild filter (5
standard deviations) is chosen.
This likelihood filtering procedure is applied again to all reflexions
(including those which were previously rejected by that procedure) at
the beginning of each new big cycle. The number of rejected reflexions
based on this filter should in general decrease during the refinement.
Because some parameters have a much greater influence on the
likelihood maximisation process than others (especially if the
starting model is far from the solution), SHARP refines some parameters before
others : the list of refined parameters gets augmented at the
start of the first three big cycles. Parameters can also be withdrawn
from refinement if they bump into non-physical values (eg lack-of
isomorphism parameters below 0 ).
Therefore, the list of parameters that are being refined may vary
during a run of SHARP. You will be
given a link to the current list at the beginning of the ML
refinement. Every time the list gets modified it will be presented
too. The tree-like representation on the left side mirrors the
hierarchical organisation of parameters within SHARP.
The "Auxiliary Cycle Information" file contains details about the
current small cycle.
For each small cycle SHARP computes the
likelihood, the gradient (1st order derivatives )and Hessian (2nd
order derivatives) at the current set of parameters. From this
information a trial point is determined and the likelihood value at
that point is computed. If this value has increased, the trial step
becomes the 'new' current step and SHARP
continues with another small cycle. If the value has decreased, a new
trial point is determined using a smaller radius of the trust region.
What you'll see in the file depends on your user level'.
- STANDARD and ADVANCED:
A line per trial step with the value of the trust region radius
is printed.
- EXPERT:
There are links to files with the gradient and (diagonal elements) of
the Hessian of the likelihood at the current set of parameters
as well as a link to a file with the eigenvalues of the Hessian.
If you add the keyword PRINT_HESS to the general section of the SIN file, the
Hessian file will contain all elements and there will also be a link
to a file with the eigenvectors. The latter is useful to detect
problems due to over-parametrisation.
Also, at each trial point, additional information is printed which can
help if you report problems to the SHARP developers.
When weeding has been requested
the file contains a list of G-Sites for which the refinement of
coordinates and (optionally) B-factor(s) has been stopped for this BIG
cycle. If the user
level is EXPERT, more details are given.
We call parameter space the configuration space spun by all
parameters that are refined at the current cycle. Thus, the set of
refinable parameters at any cycle can be described as a point in this
space. One iteration of the refinement creates a new point with its
coordinates being the updated values of all refined parameters.
Details
The details page for this section gives you access to the mechanism of
the log-likelihood maximisation procedure. It reports the value of the
log-likelihood function at the start of the iteration. Don't be
surprised if this value is negative: remember it's a log!
Two hyper-links point to the first and second-order derivatives of the
log-likelihood function with respect to the refined parameters. This
is used to check if a value connected to a particular parameter is
surprising (for instance, positive values for second-order derivatives
are not healthy).
The next analysis is performed on the normal matrix and consists in
an eigenvalue decomposition of that matrix. This algorithm looks for
'principal directions' in parameter space, and tells you which
combinations of parameters are badly conditioned (coordinates of
filtered eigenvalue(s)).
For each parameter the shift is tested if it would lead to an invalid
parameter (Allowed range analysis).
The rest of the page consists of technical comments about the fastest
way to maximise the function.
Note on eigenvalues : If the
refinement is well-conditioned, all eigenvalues should be negative
(meaning that the log-likelihood function is convex at that point in
parameter space). In practise, it happens quite often that a few
eigenvalues are filtered during the first cycles of a new refinement,
meaning that the starting point is far from the maximum of the
log-likelihood function. If the number of filtered eigenvalues does
not decrease rapidly - and is not zero at convergence - this is a
strong diagnostic of a pathology in the description of the heavy-atom
substitution.
Remedies : You should then find out what combinations of
parameters are associated with the most unfavourable (i.e. strongly
positive) eigenvalues, and figure out where the problem comes from.
- In MIR, it may happen that lack-of-isomorphism parameters are
estimated at a high value because the starting parameters are
very far from optimal. If the refinement converges too rapidly
thereafter, it sometimes happens that lack-of-isomorphism
parameters remain "stuck" at high values and cannot refine
because the function is not convex at that point. In this case,
the coordinates of that filtered eigenvalue are almost 0
everywhere and almost 1 for that parameter. You should set
that parameter back to zero (or half its value). Keep the
refined value of all other parameters and start refining
again.
- In MAD, it is sometimes the case that a false maximum is found -
with filtered eigenvalues. This usually arises from a wrong
starting combination of f' and f'' parameters. You should try
refinement again from the start, with a different (more
realistic) set of f' and f'' values - ideally from a good
fluorescence measurement.
In order to calculate distances between two points in parameter space,
we need to define a metric in that space. A unitary metric is useless
because the refinement mixes parameters on very different scales (such
as temperature factors and coordinates). The "natural"
metric is then provided by the Hessian matrix of the log-likelihood
function.
To provide a simple image of this, let us consider the case when only
one parameter is being refined. Its standard deviation is then best
approximated by the square root of the inverse of the second-order
derivative of the log-likelihood function with respect to this
parameter. The "natural" measure of distance is then given
as a number of standard deviations, also called CHI SQUARE.
This simple picture can be generalised in the multi-dimensional
parameter space : the reduced CHI SQUARE distance is then the
distance in the metric of the Hessian matrix. It still can be
understood intuitively as a "generalised number of standard
deviations".
This link points to a page where tables of mean square lack of
isomorphism values are displayed as a function of resolution. The
values "re-calculated" from the parameter model used in the
ML refinement are compared to statistics of lack of closure calculated
directly from the data.
Near the bottom of the page the same information is available as a
plot. This way you can check if the current model for lack of
isomorphism fits the noise in the data properly on average. According
to our experience, the model for "isomorphous" lack of isomorphism is
valid - except when there are several NCS-related molecules in the
asymmetric unit. The "anomalous" lack of isomorphism often displays a
bad fit to the lack of closure analysis. But it is usually smaller
than the measurement noise anyhow.
This hyper-link provides you with some common heavy-atom refinement
statistics during the course of the refinement cycles. These
statistics are : the Cullis R-factor , the Kraut R-factor and the
phasing power. These figures will be displayed as tables (function of
resolution) and you will be able to view a graphical summary of these
tables.
The definitions of these statistical quantities are :
Rcullis = <phase-integrated lack of closure> / < | Fph - Fp | >
Rkraut = <phase-integrated lack of closure> / < | Fph | >
Ppower = < [ | Fh(calc) | / phase-integrated lack of closure ] >
The residual maps provided by SHARP are a very valuable tool to check
the current heavy atom model for errors and missing sites.
SHARP calculates coefficients for
all possible residual maps at the lowest level (i.e. anomalous and
isomorphous residual maps for all batches).
- In SIRAS or MIRAS, it is important to compare the isomorphous
and anomalous maps for each batch. These have different
degrees of clarity and may offer different kinds
of information.
Example : if the heavy-atom substituent in the
crystal is PtCl2, the Platinum will appear in both isomorphous
and anomalous residual maps in all batches of this compound. But the two
Chlorines will only be seen (if the resolution and the quality
of the data are good enough) in the isomorphous residual map.
- In MAD, the best information is found in the anomalous map of
the wavelength where f" is maximal.
- In MAD, if a specific anomalous residual map shows peaks
exactly on all (or most) of the heavy atom positions, it
probably indicates an error in f''. These values could be
refined or better starting values supplied.
The residual map, as explained in the introduction,
shows positive peaks wherever the data wants "more heavy-atom density"
and negative peaks wherever the data wants "less heavy-atom
density". By default, positive density will be displayed in red,
negative density in blue and density for the known heavy-atoms in
green.
A good rule of thumb for determining if a peak is significant, is as
follows :
- A peak above 6 sigma levels is signal
- A peak under 5 sigma levels is noise
In practise, various shapes are characteristic of physical
effects (ideally) :
- A positive peak far from any known site is a minor site
- One or more positive peaks close to a known site - without
negative peaks in the vicinity - are light-atom ligands of the
heavy-atom
- A anti-symmetrical arrangement of positive and negative peaks
close to a known site is characteristic for anisotropic thermal
motion of that site
- At high resolution, a lone negative peak close to a known site
is 'something' (usually a water molecule) that has been pushed
away by the substituent
FOM (Figure Of Merit) statistics are usually associated with the
calculation of phases and are shown in resolution
bins. These quantities describe a "confidence level" for the
calculated centroid phases.
Note : These figures are a figure of merit for the quality
of the centroid structure factor and not
of the phase distributions. For instance, in a SIR case, very sharp
bimodal phase distributions will yield a poor average figure of
merit. Thus, the statistical quality of the phasing is better
appreciated when consulting the phasing power statistics at
convergence.
SHARP calculates Fourier
coefficients for an electron-density map by taking the two-dimensional
centroid of the probability distribution for the native complex
structure factor. The two-dimensional centroid is then the centre of
gravity of a two-dimensional probability distribution.
Why take the centroid ? Blow
& Crick (1959) have demonstrated the power of taking centroids as Fourier
coefficients to plot an electron-density map. They have applied it to
the one-dimensional case, namely when the modulus |Fp| is assumed to
be perfectly known. SHARP frees
itself from this assumption, so we have to work in the whole complex
plane of the Harker diagram, thus fully exploiting the optimality of
the centroid.
At this level you can choose what you want to see in the electron
density map:
- protein/DNA/RNA alone (i.e. the part of the structure which is
common to all datasets without any heavy atoms). This
will mean that in Se-MAD the Met side-chain has a hole in
it!
- density of reference: this will contain any heavy atoms that
were defined for the reference. So in Se-MAD it will have proper
Se-Met side-chains.
- density with average heavy atom contribution: contains
protein/DNA/RNA and an average of all heavy atoms declared in
SHARP.
This button takes you to the Phase
Improvement and Interpretation Control Panel where you can start
various protocols for improving the quality of your electron-density
maps. The value given in the input field is ignored in this version of
SHARP/autoSHARP (only
there for backwards-compatibility).
Last modification: 25.07.06