SHARP/autoSHARP User Manual previous next Chapter 5

# SHARP Output Guide

 Copyright © 2001-2006 by Global Phasing Limited All rights reserved. This software is proprietary to and embodies the confidential technology of Global Phasing Limited (GPhL). Possession, use, duplication or dissemination of the software is authorised only pursuant to a valid written licence from GPhL. Documentation (2001-2006)  Clemens Vonrhein (1995-1997)  Eric de La Fortelle Contact sharp-develop@GlobalPhasing.com

This document describes the output of the SHARP program and how to interpret it. Our convention in the main output generated by SHARP is, that if we think some explanations may be necessary a hyper-link "explanation" can be used to directly view the relevant information.

## Introduction

The SHARP output is divided into four sections :
• Preparation of data

This section is present in all types of calculations. The crystallographic data are read, undergo statistical information gathering and filtering and (optional) robust estimation of scale.

• ML Refinement

All parameters that have been marked for refinement in the input will be allowed to vary in order to maximise the log-likelihood function. This refinement is done following a Newton method with full second-order derivative information. The Hessian matrix (normal matrix of the maximisation) is filtered of its positive eigenvalues, so that the refinement only proceeds in the parameter subspace where the function is convex. The Hessian matrix is also used to calculate standard deviations for all parameters. This in turn provides a way to objectively measure the distance of each step (parameter update) in parameter space, in units of standard deviations. Convergence is achieved when this step becomes small enough.

• Residual maps

Once convergence is achieved for the parameters of the current model, it is possible to detect unaccounted-for substitution features using the various kinds of residual maps provided. The Fourier coefficients for these maps are the components of the gradient of the log-likelihood function. This means that after a Fourier Synthesis positive peaks will appear where the data "expects" more heavy-atom density and negative peaks where the data "wants" less heavy-atom density.

• Electron-density maps

Once the substitution model is satisfactory (no significant peaks left in the residual maps), it is time to inspect the electron-density maps, post-process them using density modification algorithms and try to interpret them in terms of molecular structure. Fourier coefficients calculated by SHARP are an extension of the "best phases" in the sense of Blow and Crick (Blow & Crick, 1959): instead of being one-dimensional phase centroids they are two-dimensional centroids of the probability distribution of the complex native structure factor.

Note : The phase information output by SHARP is not limited to the centroid structure factor that is used to calculate the electron-density map. A complete summary of the phase probability distribution is available via Hendrickson-Lattman (HL) coefficients. In many cases, the quality of phasing that SHARP provides may not be noticeable at the level of centroid maps, but only after the HL coefficients have been used in a density modification procedure.

Depending on the calculation options chosen not all of these section have to be present.

### Reading the input parameter file

Before the calculation starts, SHARP has to read the parameter file that has been prepared through the graphical interface. In order to check that the most basic pieces of information have been properly understood and processed by the program, four self-explanatory hyper-links are presented.

#### Cell information

Shows the values of the cell parameters as understood by SHARP, and standard operations performed on them, to calculate :
• reciprocal cell parameters
• the volume of the unit cell
• the Brookhaven matrix (which, multiplied by a vector of fractional coordinates, returns the vector of coordinates in Å), and the inverse Brookhaven matrix (which performs the inverse operation).
Make sure that this is the cell of the reference dataset.

#### Symmetry information

Presents the name of the space group as understood by SHARP and the results of group analysis operations performed on the group operators. Since the symmetry operators are fetched from a library file you might want to check that the correct operators are present.

#### Atomic scattering information

Shows the composition of the asymmetric unit as understood by SHARP and the coefficients that it will use to calculate resolution-dependent scattering factors for these atoms.

#### what SHARP understood of the input parameter file

If you want to check that the rest of the parameters have been read properly, this link takes you to a copy of the ASCII input file - written by SHARP after parsing it.

## Preparation of data

### Changing the resolution bins

In order to make significant statistical calculations in all the resolution bins, it is necessary to ensure that all these bins are properly populated, ie that they contain at least 100 reflexions each. If this is not the case, SHARP outputs the message "Bin population problem". To get over this, we first suppose that the user has not given resolution limits that correspond to the actual limits of the data. The limits are then re-set to the precise boundaries of accepted reflexions. If this is not enough to overcome the bin population problem, the only remaining remedy is to reduce the number of bins iteratively until satisfaction of the above criterion.

Advanced users can specify the exact number of bins during preparation of the SHARP input file. A maximum of 20 bins is possible.

### Accepted reflexions

All reflexions in the file are read. Those that have at least two measurements within the resolution limits that you indicated are retained for further use. A link to details why some reflexions might have been rejected is provided.

### Initial outlier rejection

This is a simple protection against the most obvious outliers. Histograms are calculated on quantities that do not require knowledge of the heavy-atom substitution (amplitudes, isomorphous differences, anomalous differences if present). The "tail" of these distributions is iteratively cut at 5 standard deviations (iterations are needed because the action of cutting the tail reduces the standard deviation).

### Computing statistics

Once the most visible outliers are out of the way, SHARP performs several passes through the data in order to collect statistical quantities that will be used in different areas of the program. These quantities are :
• the mean squared amplitudes for each batch
• the mean squared isomorphous differences for each batch
• the mean squared anomalous differences for each batch
• the mean squared native amplitudes
Each of these is presented as a function of resolution (in resolution bins).

Note 1 : The statistics of mean squared isomorphous - or in the case of MAD dispersive - and anomalous differences as a function of resolution (as plotted in this section in a log-graph) are a very good indicator of the quality of your derivative. The plot should more or less follow a straight line. It has repeatedly happened that high-resolution data deemed worthless by other criteria have been very useful for phasing. As a general rule DO NOT DISCARD DATA WITH LOW PHASING POWER. These data are extremely useful for refining scale factors, heavy-atom temperature factors and other parameters. In addition, their full potential is only expressed in the density modification step, where high-resolution reflexions with FOM of 0.1 of 0.2 on average are crucial to obtain the best possible map as an outcome. You are strongly advised to judge for yourself in the tutorials.

Note 2 : When there is no native dataset (eg in MAD), the mean squared native amplitudes will be approximated as those of the reference dataset until big cycle 3. After that they are more accurately estimated by subtracting the heavy-atom contribution to the statistics of the reference dataset. This estimation will be done again at the start of each new big cycle, because their accuracy depends on the exactness of the substitution model.

### Estimation of the absolute scale

If selected and based on the atomic composition of the asymmetric unit, SHARP will estimate a pseudo-absolute scale for the first dataset. The only purpose of this is to provide an extra check on the chemical reasonableness of the heavy-atom occupancies. The refinement can proceed unharmed on any other scale, provided the starting occupancies are not too far from the "scaled" occupancies.

### Estimation of the relative scale

Scaling of all datasets relative to the reference is extremely important to start the refinement under good conditions. The estimate provided in this robust algorithm is usually precise - so it should be used whenever you are not absolutely sure of the scaling of one dataset to another. Obviously, after this has be done once it needn't be repeated in subsequent SHARP refinements for the same dataset.

The Wilson-plot is there to show you how well the logarithm of the relative scale fits a straight line (as a function of resolution). Any significant departure from the line is an alarming symptom and that dataset should be revisited. If all Wilson-plots show a comparable pattern of misbehaviour, the reference dataset may need questionable.

### Estimation of lack of isomorphism

The parameters for the lack of isomorphism can also be estimated prior to ML refinement. The mean lack of closure is first calculated (in resolution bins) by a faster method generalising what is described in Terwilliger & Eisenberg (1983). In a second step, a Wilson-plot analysis yields an initial estimate of the "global" non-isomorphism parameters according to the Luzzati model of non-isomorphism. (Luzzati (1952), Read (1986), Dumas (1994)).

#### Details

If you follow the corresponding hyper-link you can see the details of the iterative LOI estimation procedure in up to five cycles. At each cycle, a Wilson-plot shows you how well the mean square lack of closure follows the Luzzati statistical model (straight line). Deviations from the straight line can be attributed to :
• incomplete heavy-atom model, characterised by values above the line at low resolution
• non-crystallographic symmetry, characterised by a hump at medium resolution
• questionable data quality, characterised by marked wiggles around the line
Note 1 : Only the isomorphous lack of isomorphism is supposed to follow Luzzati statistics. Wilson plots of the anomalous lack of closure will usually not follow a straight line.

Always have a look at the table printed in the main SHARP log-file for all estimated parameters: any comment after one parameter should be taken as a warning and investigated further. If your refinement starts with very large values of non-isomorphism it might not recover from these starting values during the refinement.

Note 2 : by default the estimation of non-isomorphism parameters is switched off in the interface.

## Maximum-Likelihood refinement

The second and main section of each SHARP run is the refinement of all selected parameters. This follows the strategy described by BIG CYCLEs.

### Outlier rejection using likelihood histograms

In order to detect and reject outliers that were not apparent in the first histograms based on simple data analysis and statistics, we provide a second filter for the data based on the value of the log-likelihood function for each reflexion. Reflexions for which the log-likelihood is very small will maximally disagree with the model parameters at the current stage of the refinement. These reflexions may be outliers, but they can also be well measured but in maximal disagreement with the current model. Therefore, outliers according to this criterion must be rejected carefully in order to avoid bias towards the current values. A mild filter (5 standard deviations) is chosen.

This likelihood filtering procedure is applied again to all reflexions (including those which were previously rejected by that procedure) at the beginning of each new big cycle. The number of rejected reflexions based on this filter should in general decrease during the refinement.

### List of refined parameters

Because some parameters have a much greater influence on the likelihood maximisation process than others (especially if the starting model is far from the solution), SHARP refines some parameters before others : the list of refined parameters gets augmented at the start of the first three big cycles. Parameters can also be withdrawn from refinement if they bump into non-physical values (eg lack-of isomorphism parameters below 0 ).

Therefore, the list of parameters that are being refined may vary during a run of SHARP. You will be given a link to the current list at the beginning of the ML refinement. Every time the list gets modified it will be presented too. The tree-like representation on the left side mirrors the hierarchical organisation of parameters within SHARP.

### Auxiliary Cycle Information

The "Auxiliary Cycle Information" file contains details about the current small cycle.

For each small cycle SHARP computes the likelihood, the gradient (1st order derivatives )and Hessian (2nd order derivatives) at the current set of parameters. From this information a trial point is determined and the likelihood value at that point is computed. If this value has increased, the trial step becomes the 'new' current step and SHARP continues with another small cycle. If the value has decreased, a new trial point is determined using a smaller radius of the trust region.

What you'll see in the file depends on your user level'.

A line per trial step with the value of the trust region radius is printed.

• EXPERT:

There are links to files with the gradient and (diagonal elements) of the Hessian of the likelihood at the current set of parameters as well as a link to a file with the eigenvalues of the Hessian.

If you add the keyword PRINT_HESS to the general section of the SIN file, the Hessian file will contain all elements and there will also be a link to a file with the eigenvectors. The latter is useful to detect problems due to over-parametrisation.

Also, at each trial point, additional information is printed which can help if you report problems to the SHARP developers.

### Weeding

When weeding has been requested the file contains a list of G-Sites for which the refinement of coordinates and (optionally) B-factor(s) has been stopped for this BIG cycle. If the user level is EXPERT, more details are given.

### Refinement step in parameter space

We call parameter space the configuration space spun by all parameters that are refined at the current cycle. Thus, the set of refinable parameters at any cycle can be described as a point in this space. One iteration of the refinement creates a new point with its coordinates being the updated values of all refined parameters.

#### Details

The details page for this section gives you access to the mechanism of the log-likelihood maximisation procedure. It reports the value of the log-likelihood function at the start of the iteration. Don't be surprised if this value is negative: remember it's a log!

Two hyper-links point to the first and second-order derivatives of the log-likelihood function with respect to the refined parameters. This is used to check if a value connected to a particular parameter is surprising (for instance, positive values for second-order derivatives are not healthy).

The next analysis is performed on the normal matrix and consists in an eigenvalue decomposition of that matrix. This algorithm looks for 'principal directions' in parameter space, and tells you which combinations of parameters are badly conditioned (coordinates of filtered eigenvalue(s)).

For each parameter the shift is tested if it would lead to an invalid parameter (Allowed range analysis).

The rest of the page consists of technical comments about the fastest way to maximise the function.

Note on eigenvalues : If the refinement is well-conditioned, all eigenvalues should be negative (meaning that the log-likelihood function is convex at that point in parameter space). In practise, it happens quite often that a few eigenvalues are filtered during the first cycles of a new refinement, meaning that the starting point is far from the maximum of the log-likelihood function. If the number of filtered eigenvalues does not decrease rapidly - and is not zero at convergence - this is a strong diagnostic of a pathology in the description of the heavy-atom substitution.

Remedies : You should then find out what combinations of parameters are associated with the most unfavourable (i.e. strongly positive) eigenvalues, and figure out where the problem comes from.

• In MIR, it may happen that lack-of-isomorphism parameters are estimated at a high value because the starting parameters are very far from optimal. If the refinement converges too rapidly thereafter, it sometimes happens that lack-of-isomorphism parameters remain "stuck" at high values and cannot refine because the function is not convex at that point. In this case, the coordinates of that filtered eigenvalue are almost 0 everywhere and almost 1 for that parameter. You should set that parameter back to zero (or half its value). Keep the refined value of all other parameters and start refining again.
• In MAD, it is sometimes the case that a false maximum is found - with filtered eigenvalues. This usually arises from a wrong starting combination of f' and f'' parameters. You should try refinement again from the start, with a different (more realistic) set of f' and f'' values - ideally from a good fluorescence measurement.

### Step length

In order to calculate distances between two points in parameter space, we need to define a metric in that space. A unitary metric is useless because the refinement mixes parameters on very different scales (such as temperature factors and coordinates). The "natural" metric is then provided by the Hessian matrix of the log-likelihood function.

To provide a simple image of this, let us consider the case when only one parameter is being refined. Its standard deviation is then best approximated by the square root of the inverse of the second-order derivative of the log-likelihood function with respect to this parameter. The "natural" measure of distance is then given as a number of standard deviations, also called CHI SQUARE.

This simple picture can be generalised in the multi-dimensional parameter space : the reduced CHI SQUARE distance is then the distance in the metric of the Hessian matrix. It still can be understood intuitively as a "generalised number of standard deviations".

### Lack of isomorphism

This link points to a page where tables of mean square lack of isomorphism values are displayed as a function of resolution. The values "re-calculated" from the parameter model used in the ML refinement are compared to statistics of lack of closure calculated directly from the data.

Near the bottom of the page the same information is available as a plot. This way you can check if the current model for lack of isomorphism fits the noise in the data properly on average. According to our experience, the model for "isomorphous" lack of isomorphism is valid - except when there are several NCS-related molecules in the asymmetric unit. The "anomalous" lack of isomorphism often displays a bad fit to the lack of closure analysis. But it is usually smaller than the measurement noise anyhow.

### Other statistics

This hyper-link provides you with some common heavy-atom refinement statistics during the course of the refinement cycles. These statistics are : the Cullis R-factor , the Kraut R-factor and the phasing power. These figures will be displayed as tables (function of resolution) and you will be able to view a graphical summary of these tables.

The definitions of these statistical quantities are :

Rcullis = <phase-integrated lack of closure> / < | Fph - Fp | >

Rkraut = <phase-integrated lack of closure> / < | Fph | >

Ppower = < [ | Fh(calc) | / phase-integrated lack of closure ] >

## Residual maps

The residual maps provided by SHARP are a very valuable tool to check the current heavy atom model for errors and missing sites.

### Type of residual map

SHARP calculates coefficients for all possible residual maps at the lowest level (i.e. anomalous and isomorphous residual maps for all batches).
• In SIRAS or MIRAS, it is important to compare the isomorphous and anomalous maps for each batch. These have different degrees of clarity and may offer different kinds of information.

Example : if the heavy-atom substituent in the crystal is PtCl2, the Platinum will appear in both isomorphous and anomalous residual maps in all batches of this compound. But the two Chlorines will only be seen (if the resolution and the quality of the data are good enough) in the isomorphous residual map.

• In MAD, the best information is found in the anomalous map of the wavelength where f" is maximal.
• In MAD, if a specific anomalous residual map shows peaks exactly on all (or most) of the heavy atom positions, it probably indicates an error in f''. These values could be refined or better starting values supplied.

### What to look for in a residual map

The residual map, as explained in the introduction, shows positive peaks wherever the data wants "more heavy-atom density" and negative peaks wherever the data wants "less heavy-atom density". By default, positive density will be displayed in red, negative density in blue and density for the known heavy-atoms in green.

A good rule of thumb for determining if a peak is significant, is as follows :

• A peak above 6 sigma levels is signal
• A peak under 5 sigma levels is noise
In practise, various shapes are characteristic of physical effects (ideally) :
• A positive peak far from any known site is a minor site
• One or more positive peaks close to a known site - without negative peaks in the vicinity - are light-atom ligands of the heavy-atom
• A anti-symmetrical arrangement of positive and negative peaks close to a known site is characteristic for anisotropic thermal motion of that site
• At high resolution, a lone negative peak close to a known site is 'something' (usually a water molecule) that has been pushed away by the substituent

## Electron-density map

### Statistics

FOM (Figure Of Merit) statistics are usually associated with the calculation of phases and are shown in resolution bins. These quantities describe a "confidence level" for the calculated centroid phases.

Note : These figures are a figure of merit for the quality of the centroid structure factor and not of the phase distributions. For instance, in a SIR case, very sharp bimodal phase distributions will yield a poor average figure of merit. Thus, the statistical quality of the phasing is better appreciated when consulting the phasing power statistics at convergence.

### Displaying an electron-density map

SHARP calculates Fourier coefficients for an electron-density map by taking the two-dimensional centroid of the probability distribution for the native complex structure factor. The two-dimensional centroid is then the centre of gravity of a two-dimensional probability distribution.

Why take the centroid ? Blow & Crick (1959) have demonstrated the power of taking centroids as Fourier coefficients to plot an electron-density map. They have applied it to the one-dimensional case, namely when the modulus |Fp| is assumed to be perfectly known. SHARP frees itself from this assumption, so we have to work in the whole complex plane of the Harker diagram, thus fully exploiting the optimality of the centroid.

At this level you can choose what you want to see in the electron density map:

• protein/DNA/RNA alone (i.e. the part of the structure which is common to all datasets without any heavy atoms). This will mean that in Se-MAD the Met side-chain has a hole in it!

• density of reference: this will contain any heavy atoms that were defined for the reference. So in Se-MAD it will have proper Se-Met side-chains.

• density with average heavy atom contribution: contains protein/DNA/RNA and an average of all heavy atoms declared in SHARP.

### Solvent flattening (density modification)

This button takes you to the Phase Improvement and Interpretation Control Panel where you can start various protocols for improving the quality of your electron-density maps. The value given in the input field is ignored in this version of SHARP/autoSHARP (only there for backwards-compatibility).
Last modification: 25.07.06