[sharp-discuss] question on mir phasing

Fri May 28 10:23:41 CEST 2010

Hi Francis,

On Thu, May 27, 2010 at 01:18:34PM -0600, Francis E Reyes wrote:
> The reason I ask is because a lot of people emphasize low resolution  
> data, but I can sympathize with the OP. One screens diffraction images, 
> collects as high resolution as possible, and can (usually) solve the 
> structure.

I can also sympathize - but (there's always a but):

 -  "collects as high resolution as possible"

   ==> Why? What is the (sensible) reason for that? Maybe:

   * you need the high resolution to see enough detail for some
     important biological question to be answered

   * you want more reflections so refinement becomes easier
     (parameter/observation ratio)

   * a high-resolution limit of 1.94 just sounds nicer than 2.8A (or:
     "The high-resolution structure of ..." is a better title for a
     paper) ... ;-)

   Collecting high resolution data in itself is neither good nor bad -
   it's what you need the data for that determines the usefulness of
   it.

 - "and can (usually) solve the structure."

   That is something quite different: how was it solved? With
   molecular replacement or experimental phasing (or combination)? Or
   maybe direct methods?

   Some general and simplified rules:

     MR needs complete low resolution data (remember that you try to
     place a large 'blob' through rotation/translation - this is a low
     resolution problem).

     Experimental methods need accurate data (to have good
     anomalous/isomorphous/dispersive differences). One of the main
     steps here is density modification - and e.g. solvent flattening
     or histogram matching like to distinguish the protein region from
     the solvent region ... which again is a low resolutoin problem
     (to find the blob of protein versus the fussy solvent region).

     Direct methods like very high resolution data.

   If you use direct methods (SHELXD, HySS, SnB etc) to locate your
   heavy atom substructure, you will also use normalised structure
   factors (of differences). To have accurately determined E values,
   you need complete resolution bins everywhere - or at least fairly
   complete with a _random_ set of reflections missing (so that the
   average in that bin stays the same).

   However (and this is the important bit): when we're talking about
   low completeness in the low resolution shells we are looking at two
   effects.

     1) reflections that can't be measured because of beamstop

        This will affect all kind of decisions in the same way that
        having very high resolution data will determine if you can use
        anisotropic B-factor refinement, model alternative
        conformations etc.

        With missing low resolution you might not be able to solve a
        structure with MR: the information about location and
        orientation is mainly included in those low-resolution data.

        You won't be able to do density modifcation efficiently
        (solvent flattening, NCS averagin etc).

        Bulk solvent correction in refinement will be very unstable
        since you're missing the data that would influence the bulk
        solvent model parametrisation: so your model (PDB file plus
        bulk solvent) will be in error relative to the reality of your
        crystal from which you collected your data. This can have all
        kind of funny effects: ripples at the surface of the protein,
        difficulties in modeling solvent structure at surface, weird
        connectivity problems in otherwise perfect density etc.

     2) overloaded reflections

        This is a non-random set of reflections that are missing: not
        just any (say) 30% of reflections in the lowest resolution
        shell, but the 30% strongest ones. So we have a systematic
        error here - with the same effects as above (unable to match
        model/parametrisation with physical reality).

        And for experimental phasing: those strongest reflections
        would have been measured/integrated most accurately (stron,
        high I/sigI values) and therefore would have given you the
        most accurate and largest difference values for substructure
        solution and phasing ... so one is missing the most valuable
        bits of data.

        I'd rather throw away 1000 high-resolution reflections than
        missing the 20 strongest low-resolution reflections due to
        overloads ;-)

> The structure that I collected on during Rapidata (which was 
> solved)

Congratulations!

> has a low resolution completeness (as judged by phenix.xtriage) 
> from 28.6 - 10.61 A of about 68.7%.

Which I would classify as very problematic: the beamstop probably
restricts you to 30A (which is ok) and the 30% missing reflections are
all overloads.

> However, 100% in all bins up to the 
> resolution limit of 2.0A.

2A is nice for refinement - but for structure solution (either MR or
HA phasing) I'd rather have a 2.5-2.8A dataset with the low resolution
range more or less complete. And by lowering the dose (no overloads)
and moving the detector away (smaller missing region behind beamstop)
you could get 95% complete data in the 40-10A range maybe?

Yes, it is annoying to maybe having to collect two datasets: one with
complete low-resolution data for solving the structure and another
high-resolution one for refinement ... but that might just be 2 trips
to the synchrotron instead of half a dozen where the first step
(structure solution) failed because of low-resolution issues.

> By default HKL2000 does not output the low resolution bins (the
> first bin in this dataset was 50-5A and in the scale log its at a
> 100% completion).

I don't know HKL2000, but I'm sure you could run scalepack with a
resolution setting of 30-4.0 to get that statistic? Otherwise, SFTOOLS
is very good at this (you can specify the exact number of bins to
use).

> Is it only when you have issues that it becomes valuable? (poor  
> anomalous signal, etc etc)

Turn that around: if you don't have that (valuable) good low
resolution data it becomes an issue ... every time. Yes, you might get
around it because the structure is 'just' a compound soak in the same
spacegroup (so no structure solution required) and your biological
question (did it bind?) can ignore all those messy features like
breaks in main-chain density or noisy solvent boundary.

> P.S. the Reply-To: field for sharp-discuss defaults to the sender, it  
> would be useful for the replies to head back to the list so those of us 
> who are simply 'watching' can listen in and hopefully learn something.

Yes(ish): the logic here is that we don't want people accidentially
reply to the whole group if they only want to reply to a single
person. I think that is the same default as e.g. for CCP4bb? One might
always use the group-reply feature of the email client to get the
reply to list and sender ...

Cheers

Clemens

-- 

***************************************************************
* Clemens Vonrhein, Ph.D.     vonrhein AT GlobalPhasing DOT com
*
*  Global Phasing Ltd.
*  Sheraton House, Castle Park 
*  Cambridge CB3 0AX, UK
*--------------------------------------------------------------
* BUSTER Development Group      (http://www.globalphasing.com)
***************************************************************