Content:

Introduction

Here are some viewpoints, explanations and background information regarding recent and numerous discussions (e.g. on the CCP4bb) regarding various aspects of data handling - and what apparently should be done or should not be done. We try to stay as neutral as possible and refrain from pointing to particular software: we might misrepresent or misunderstand software from other groups and it is better for readers to form their own opinions anyway.

Please also check more software-specific pages at the BUSTER Wiki (since October 2020) - espcially the section about combining mmCIF reflection data from processing and refinement, STARANISO server (since December 2019) and the BUSTER discussion list (exchange February 2023).

Data truncation

Selecting reflection data can be a balancing act: are we excluding useful signal or are we including problematic noise? When talking about data truncation we might want to consider different ways this might happen or be done:

  • The experimental setup might prevent the collection of actually observable reflection data via the beamstop, a beamstop holder, some inactive detector area (module gaps), an inadequate crystal-detector distance (detector too far away) etc.
  • Applying an isotropic cut-off, ie using a sphere in reciprocal space that defines the region in reciprocal space outside which all integrated intensities are to be discarded: this cut-off surface is usually described by a so-called "high-resolution cutoff" and is often based on specific data quality metrics (<I/sigI>, CC1/2, Completeness, Rpim etc - or combinations thereof) computed in local shells.
  • Applying an ellipsoidal cut-off, i.e. using an ellipsoid in reciprocal space that defines the region in reciprocal space outside which all integrated intensities are to be discarded: this is usually described by the three principal axes of the ellpsoid centred at the origin, i.e. three orthogonal d* vectors (diffraction limits) that in turn are based on some analysis of the reflection intensities in terms of anisotropy, i.e. the falloff in intensity in different directions.
  • Applying an anisotropic cut-off, i.e. a general cut-off surface in reciprocal space that defines the region in reciprocal space outside which all integrated intensities are to be discarded: this can not be described by a simple resolution cutoff (resulting in an isotropic/spherical cut-off surface, ie. a sphere) or three diffraction limits (resulting in an ellipsoidal cut-off surface, i.e. an ellipsoid), but requires an explicit mask in reciprocal space. However, one can fit an ellipsoid to that general cut-off surface to have some simplified approximate description of it.
  • Outlier rejection during data analysis and refinement as done by some refinement programs (either by default or upon user request) to exclude those measurements that do not agree well with the current model.

Anisotropic correction

This is usually applied only to amplitudes at different stages:

  • During data processing (i.e. before a model is available): at this point the correction factors (anisotropic B-factors, Bij) come from an analysis of data anisotropy.
  • During refinement (by scaling the observed data to the model using anisotropic B-factors, Bij): these rescaled observations will then be used to provide the map coefficients for electron-density maps (2mFo-DFc) and difference-density maps (mFo-DFc). Different refinement programs also write the observed data into the same output reflection file as the map coefficients - some with their original/unmodified values, others with modified (rescaled and anisotropically corrected) values.

Conversion from intensities to amplitudes

Multiple programs exist for doing this: all of those will carry the input intensities through into output. Downstream programs/processes need to take care not to disrupt the connection between the original intensities and the resulting amplitudes: MTZ files are a very good and structured format to keep collections of tightly related data items (intensities, amplitudes, anomalous data etc) intact and together.

Deposition considerations

There are good reasons for applying any of those truncation or correction methods described above - and good reasons not to do so: different stages of data analysis, structure solution or model refinement might require different views or versions of the same underlying diffraction data. The important point is to provide as complete a record of the reflection data as possible so that downstream programs (and future methods) have acccess to the data they require or want. Then all of the above topics, methods and approaches don't really matter since a user is able to do something different with the same data - maybe because new methods are available, personal preferences differ or as a check and validation of a deposited structural model. Therefore, the deposition process itself is as important as writing any part of the methods section of a paper: reproducibility is key here.

For those reasons, we think access to the following would be ideal:

  • Deposition of original, raw diffraction data:
    • There are multiple general public archives avaailable, as well as specific solutions provided by synchrotrons or academic institutions.
    • Data needs to be easily accessible (i.e. without the need of registration, login or contacting authors) as soon as a structure has been released in the PDB archive: it needs to be as easily accessible as a PDB entry itself (if associated with one).
    • We encourage the use of "standard" archiving tools that work nicely on all platforms - especially on those usually used for data (re)processing: *.tar.gz files are nice and simple while e.g. ZIP files can be problematic (partial ZIP files are useless) - especially if provided by online on-the-fly generation.
      • Multi-file datasets should be packed into larger archives if possible: it is much easier to download a *.tar file containing *.cbf.gz files than to download 3600 files one after the other.
      • One compression is enough: no need to have *.bz2 files inside a *.tar.gz file.
    • Raw diffraction data associated with a released PDB entry should be accessible immediately and not "upon request" (including e.g. requiring loading from a tape library or applying for some access token etc).
  • Processed diffraction data as a mmCIF file with multiple data blocks:
    • scaled+unmerged measurements (intensities) without any cut-off
    • scaled+merged intensities without any cut-off
    • scaled+merged intensities and amplitudes after cut-off
  • Refinement results as mmCIF file with (potentially) multiple data blocks:
    • original intensity/amplitude data as given to the reinement program on input
    • map coefficients for the different types of maps a user might want to inspect (elctron density, difference density, F(early)-F(late), anomalous differences, event maps, interpolated maps etc).
    • this needs to be combined with the mmCIF file from data processing to provide all levels of data from scaled+unmerged to final refinement results.