SHARP/autoSHARP User Manual previous next
Chapter 5

SHARP Output Guide

Copyright    © 2001-2006 by Global Phasing Limited
 
  All rights reserved.
 
  This software is proprietary to and embodies the confidential technology of Global Phasing Limited (GPhL). Possession, use, duplication or dissemination of the software is authorised only pursuant to a valid written licence from GPhL.
 
Documentation    (2001-2006)  Clemens Vonrhein
  (1995-1997)  Eric de La Fortelle
 
Contact sharp-develop@GlobalPhasing.com


This document describes the output of the SHARP program and how to interpret it. Our convention in the main output generated by SHARP is, that if we think some explanations may be necessary a hyper-link "explanation" can be used to directly view the relevant information.

Contents


Introduction 

The SHARP output is divided into four sections : Depending on the calculation options chosen not all of these section have to be present.

Reading the input parameter file

Before the calculation starts, SHARP has to read the parameter file that has been prepared through the graphical interface. In order to check that the most basic pieces of information have been properly understood and processed by the program, four self-explanatory hyper-links are presented.

Cell information

Shows the values of the cell parameters as understood by SHARP, and standard operations performed on them, to calculate : Make sure that this is the cell of the reference dataset.

Symmetry information

Presents the name of the space group as understood by SHARP and the results of group analysis operations performed on the group operators. Since the symmetry operators are fetched from a library file you might want to check that the correct operators are present.

Atomic scattering information

Shows the composition of the asymmetric unit as understood by SHARP and the coefficients that it will use to calculate resolution-dependent scattering factors for these atoms.

what SHARP understood of the input parameter file

If you want to check that the rest of the parameters have been read properly, this link takes you to a copy of the ASCII input file - written by SHARP after parsing it.


Preparation of data

Changing the resolution bins

In order to make significant statistical calculations in all the resolution bins, it is necessary to ensure that all these bins are properly populated, ie that they contain at least 100 reflexions each. If this is not the case, SHARP outputs the message "Bin population problem". To get over this, we first suppose that the user has not given resolution limits that correspond to the actual limits of the data. The limits are then re-set to the precise boundaries of accepted reflexions. If this is not enough to overcome the bin population problem, the only remaining remedy is to reduce the number of bins iteratively until satisfaction of the above criterion.

Advanced users can specify the exact number of bins during preparation of the SHARP input file. A maximum of 20 bins is possible.

Accepted reflexions

All reflexions in the file are read. Those that have at least two measurements within the resolution limits that you indicated are retained for further use. A link to details why some reflexions might have been rejected is provided.

Initial outlier rejection

This is a simple protection against the most obvious outliers. Histograms are calculated on quantities that do not require knowledge of the heavy-atom substitution (amplitudes, isomorphous differences, anomalous differences if present). The "tail" of these distributions is iteratively cut at 5 standard deviations (iterations are needed because the action of cutting the tail reduces the standard deviation).

Computing statistics

Once the most visible outliers are out of the way, SHARP performs several passes through the data in order to collect statistical quantities that will be used in different areas of the program. These quantities are : Each of these is presented as a function of resolution (in resolution bins).

Note 1 : The statistics of mean squared isomorphous - or in the case of MAD dispersive - and anomalous differences as a function of resolution (as plotted in this section in a log-graph) are a very good indicator of the quality of your derivative. The plot should more or less follow a straight line. It has repeatedly happened that high-resolution data deemed worthless by other criteria have been very useful for phasing. As a general rule DO NOT DISCARD DATA WITH LOW PHASING POWER. These data are extremely useful for refining scale factors, heavy-atom temperature factors and other parameters. In addition, their full potential is only expressed in the density modification step, where high-resolution reflexions with FOM of 0.1 of 0.2 on average are crucial to obtain the best possible map as an outcome. You are strongly advised to judge for yourself in the tutorials.

Note 2 : When there is no native dataset (eg in MAD), the mean squared native amplitudes will be approximated as those of the reference dataset until big cycle 3. After that they are more accurately estimated by subtracting the heavy-atom contribution to the statistics of the reference dataset. This estimation will be done again at the start of each new big cycle, because their accuracy depends on the exactness of the substitution model.

Estimation of the absolute scale

If selected and based on the atomic composition of the asymmetric unit, SHARP will estimate a pseudo-absolute scale for the first dataset. The only purpose of this is to provide an extra check on the chemical reasonableness of the heavy-atom occupancies. The refinement can proceed unharmed on any other scale, provided the starting occupancies are not too far from the "scaled" occupancies.

Estimation of the relative scale

Scaling of all datasets relative to the reference is extremely important to start the refinement under good conditions. The estimate provided in this robust algorithm is usually precise - so it should be used whenever you are not absolutely sure of the scaling of one dataset to another. Obviously, after this has be done once it needn't be repeated in subsequent SHARP refinements for the same dataset.

The Wilson-plot is there to show you how well the logarithm of the relative scale fits a straight line (as a function of resolution). Any significant departure from the line is an alarming symptom and that dataset should be revisited. If all Wilson-plots show a comparable pattern of misbehaviour, the reference dataset may need questionable.

Estimation of lack of isomorphism

The parameters for the lack of isomorphism can also be estimated prior to ML refinement. The mean lack of closure is first calculated (in resolution bins) by a faster method generalising what is described in Terwilliger & Eisenberg (1983). In a second step, a Wilson-plot analysis yields an initial estimate of the "global" non-isomorphism parameters according to the Luzzati model of non-isomorphism. (Luzzati (1952), Read (1986), Dumas (1994)).

Details

If you follow the corresponding hyper-link you can see the details of the iterative LOI estimation procedure in up to five cycles. At each cycle, a Wilson-plot shows you how well the mean square lack of closure follows the Luzzati statistical model (straight line). Deviations from the straight line can be attributed to : Note 1 : Only the isomorphous lack of isomorphism is supposed to follow Luzzati statistics. Wilson plots of the anomalous lack of closure will usually not follow a straight line.

Always have a look at the table printed in the main SHARP log-file for all estimated parameters: any comment after one parameter should be taken as a warning and investigated further. If your refinement starts with very large values of non-isomorphism it might not recover from these starting values during the refinement.

Note 2 : by default the estimation of non-isomorphism parameters is switched off in the interface.


Maximum-Likelihood refinement

The second and main section of each SHARP run is the refinement of all selected parameters. This follows the strategy described by BIG CYCLEs.

Outlier rejection using likelihood histograms

In order to detect and reject outliers that were not apparent in the first histograms based on simple data analysis and statistics, we provide a second filter for the data based on the value of the log-likelihood function for each reflexion. Reflexions for which the log-likelihood is very small will maximally disagree with the model parameters at the current stage of the refinement. These reflexions may be outliers, but they can also be well measured but in maximal disagreement with the current model. Therefore, outliers according to this criterion must be rejected carefully in order to avoid bias towards the current values. A mild filter (5 standard deviations) is chosen.

This likelihood filtering procedure is applied again to all reflexions (including those which were previously rejected by that procedure) at the beginning of each new big cycle. The number of rejected reflexions based on this filter should in general decrease during the refinement.

List of refined parameters

Because some parameters have a much greater influence on the likelihood maximisation process than others (especially if the starting model is far from the solution), SHARP refines some parameters before others : the list of refined parameters gets augmented at the start of the first three big cycles. Parameters can also be withdrawn from refinement if they bump into non-physical values (eg lack-of isomorphism parameters below 0 ).

Therefore, the list of parameters that are being refined may vary during a run of SHARP. You will be given a link to the current list at the beginning of the ML refinement. Every time the list gets modified it will be presented too. The tree-like representation on the left side mirrors the hierarchical organisation of parameters within SHARP.

Auxiliary Cycle Information

The "Auxiliary Cycle Information" file contains details about the current small cycle.

For each small cycle SHARP computes the likelihood, the gradient (1st order derivatives )and Hessian (2nd order derivatives) at the current set of parameters. From this information a trial point is determined and the likelihood value at that point is computed. If this value has increased, the trial step becomes the 'new' current step and SHARP continues with another small cycle. If the value has decreased, a new trial point is determined using a smaller radius of the trust region.

What you'll see in the file depends on your user level'.

Weeding

When weeding has been requested the file contains a list of G-Sites for which the refinement of coordinates and (optionally) B-factor(s) has been stopped for this BIG cycle. If the user level is EXPERT, more details are given.

Refinement step in parameter space

We call parameter space the configuration space spun by all parameters that are refined at the current cycle. Thus, the set of refinable parameters at any cycle can be described as a point in this space. One iteration of the refinement creates a new point with its coordinates being the updated values of all refined parameters.

Details

The details page for this section gives you access to the mechanism of the log-likelihood maximisation procedure. It reports the value of the log-likelihood function at the start of the iteration. Don't be surprised if this value is negative: remember it's a log!

Two hyper-links point to the first and second-order derivatives of the log-likelihood function with respect to the refined parameters. This is used to check if a value connected to a particular parameter is surprising (for instance, positive values for second-order derivatives are not healthy).

The next analysis is performed on the normal matrix and consists in an eigenvalue decomposition of that matrix. This algorithm looks for 'principal directions' in parameter space, and tells you which combinations of parameters are badly conditioned (coordinates of filtered eigenvalue(s)).

For each parameter the shift is tested if it would lead to an invalid parameter (Allowed range analysis).

The rest of the page consists of technical comments about the fastest way to maximise the function.

Note on eigenvalues : If the refinement is well-conditioned, all eigenvalues should be negative (meaning that the log-likelihood function is convex at that point in parameter space). In practise, it happens quite often that a few eigenvalues are filtered during the first cycles of a new refinement, meaning that the starting point is far from the maximum of the log-likelihood function. If the number of filtered eigenvalues does not decrease rapidly - and is not zero at convergence - this is a strong diagnostic of a pathology in the description of the heavy-atom substitution.

Remedies : You should then find out what combinations of parameters are associated with the most unfavourable (i.e. strongly positive) eigenvalues, and figure out where the problem comes from.

Step length

In order to calculate distances between two points in parameter space, we need to define a metric in that space. A unitary metric is useless because the refinement mixes parameters on very different scales (such as temperature factors and coordinates). The "natural" metric is then provided by the Hessian matrix of the log-likelihood function.

To provide a simple image of this, let us consider the case when only one parameter is being refined. Its standard deviation is then best approximated by the square root of the inverse of the second-order derivative of the log-likelihood function with respect to this parameter. The "natural" measure of distance is then given as a number of standard deviations, also called CHI SQUARE.

This simple picture can be generalised in the multi-dimensional parameter space : the reduced CHI SQUARE distance is then the distance in the metric of the Hessian matrix. It still can be understood intuitively as a "generalised number of standard deviations".

Lack of isomorphism

This link points to a page where tables of mean square lack of isomorphism values are displayed as a function of resolution. The values "re-calculated" from the parameter model used in the ML refinement are compared to statistics of lack of closure calculated directly from the data.

Near the bottom of the page the same information is available as a plot. This way you can check if the current model for lack of isomorphism fits the noise in the data properly on average. According to our experience, the model for "isomorphous" lack of isomorphism is valid - except when there are several NCS-related molecules in the asymmetric unit. The "anomalous" lack of isomorphism often displays a bad fit to the lack of closure analysis. But it is usually smaller than the measurement noise anyhow.

Other statistics

This hyper-link provides you with some common heavy-atom refinement statistics during the course of the refinement cycles. These statistics are : the Cullis R-factor , the Kraut R-factor and the phasing power. These figures will be displayed as tables (function of resolution) and you will be able to view a graphical summary of these tables.

The definitions of these statistical quantities are :

Rcullis = <phase-integrated lack of closure> / < | Fph - Fp | >

Rkraut = <phase-integrated lack of closure> / < | Fph | >

Ppower = < [ | Fh(calc) | / phase-integrated lack of closure ] >


Residual maps

The residual maps provided by SHARP are a very valuable tool to check the current heavy atom model for errors and missing sites.

Type of residual map

SHARP calculates coefficients for all possible residual maps at the lowest level (i.e. anomalous and isomorphous residual maps for all batches).

What to look for in a residual map

The residual map, as explained in the introduction, shows positive peaks wherever the data wants "more heavy-atom density" and negative peaks wherever the data wants "less heavy-atom density". By default, positive density will be displayed in red, negative density in blue and density for the known heavy-atoms in green.

A good rule of thumb for determining if a peak is significant, is as follows :

In practise, various shapes are characteristic of physical effects (ideally) :

Electron-density map

Statistics

FOM (Figure Of Merit) statistics are usually associated with the calculation of phases and are shown in resolution bins. These quantities describe a "confidence level" for the calculated centroid phases.

Note : These figures are a figure of merit for the quality of the centroid structure factor and not of the phase distributions. For instance, in a SIR case, very sharp bimodal phase distributions will yield a poor average figure of merit. Thus, the statistical quality of the phasing is better appreciated when consulting the phasing power statistics at convergence.

Displaying an electron-density map

SHARP calculates Fourier coefficients for an electron-density map by taking the two-dimensional centroid of the probability distribution for the native complex structure factor. The two-dimensional centroid is then the centre of gravity of a two-dimensional probability distribution.

Why take the centroid ? Blow & Crick (1959) have demonstrated the power of taking centroids as Fourier coefficients to plot an electron-density map. They have applied it to the one-dimensional case, namely when the modulus |Fp| is assumed to be perfectly known. SHARP frees itself from this assumption, so we have to work in the whole complex plane of the Harker diagram, thus fully exploiting the optimality of the centroid.

At this level you can choose what you want to see in the electron density map:

Solvent flattening (density modification)

This button takes you to the Phase Improvement and Interpretation Control Panel where you can start various protocols for improving the quality of your electron-density maps. The value given in the input field is ignored in this version of SHARP/autoSHARP (only there for backwards-compatibility).
Last modification: 25.07.06