DepositionMmCif

Content

Introduction

Relevant files from refinement

From BUSTER
From other refinement packages

Additional files from data processing

From autoPROC
From STARANISO
Combining deposition-ready files from autoPROC (or STARANISO) with data from various refinement programs using aB_deposition_combine

General notes

Notes regarding wwPDB deposition

Additional examples, FAQs and use cases

Introduction

The information here (from October, 11th, 2022) tries to always be up-to-date - but be prepared for changes at external sites or software systems beyond our control.

Our refinement program BUSTER provides mmCIF files ready for deposition to the wwPDB since the 20190214 release: these are automatically generated at the end of each refinement. In order to be aware of any recent developments within this area, please:

check the current release notes,

ensure your BUSTER installation is up-to-date,

see if the issues page has release-specific information,

be careful about using conversion or harvesting tools (like PDB_EXTRACT or SF_CONVERT) as part of the deposition process. They are e.g. not always aware of autoPROC as a processing package - or of the type of analysis STARANISO would perform - and are not necessarily up-to-date with the correct source (log files etc) for a particular mmCIF item. All this can easily result in confusion and incorrect data making it into the archived file(s).

Relevant files from refinement

A refinement program should always provide two PDBx/mmCIF files ready for deposition: one for the model and one for the reflection data. In general, these would just be a different format/representation of the output PDB (model) and MTZ (reflection) files that a user might be more familiar with. If there are no deposition-ready PDBx/mmCIF files generated automatically, various conversion tools are available to generate those - but it is always better to use the PDBx/mmCIF files generated natively by the refinement program itself!

From BUSTER

The two relevant files for deposition of BUSTER refinement results are

BUSTER_model.cif: This file contains the atomic model as well as any non-standard restraint information (e.g. for a unique ligand or compound).

It also contains the refinement statistics (R values, number of reflections used etc).
It does not contain the data processing statistics ("Table 1" containing e.g. completeness, redundancy, CC1/2, Rpim etc): this needs to come from the data processing stage and extra care is needed to use the correct set of data quality metrics clearly associated with the data used during refinement!

BUSTER_refln.cif: This file can contain several so-called data blocks. The first one contains all the standard reflection data (observed amplitudes and their sigmas ¹ , model structure factors, figure-of-merit, Hendrickson-Lattmann coefficients, test-set flags - see also here) as well as map coefficients (mFo-DFc difference map and 2mFo-DFc map) for all observed reflections.

To accomodate different types of map-coefficients (isotropically filled-in 2mFo-DFc and/or anisotropically filled-in 2mFo-DFc coefficients as defined by e.g. STARANISO), additional data loops could be present with relevant information stored in each "_diffrn.details" tag to describe a data block.

From other refinement packages

Please consult the relevant information of the refinement program used to find out how to generate those two deposition-ready PDBx/mmCIF files directly.

Additional files from data processing

See also here for additional background information.

The reflection file (in deposition-ready PDBx/mmCIF format) from refinement often doesn't contain the full reflection data available from the original data processing - but only a subset relevant to the final stages of refinement. Therefore, we would like to combine that reflection data subset with the richer reflection data from the data processing step leading to the input reflection data used during refinement.

From autoPROC

autoPROC provides its own deposition-ready PDBx/mmCIF files for reflection data - including all the processing (scaling and merging) data quality metrics.

These need to be combined with the model and reflection data files coming from BUSTER in order to have only two mmCIF files (one model and one reflection) as expected by the deposition system.

In most cases the relevant file(s) from data processing will be called

Data_1_autoPROC_STARANISO_all.cif (for anisotropic/STARANISO analysis - corresponds to staraniso_alldata-unique.mtz), or
Data_2_autoPROC_TRUNCATE_all.cif (for traditional/isotropic analysis - corresponds to truncate-unique.mtz)

When processing multi-sweep and/or multi-wavelength data there could be fairly self-explanatory insertions into the file naming to distinguish datasets.

A user needs to pick the relevant one, depending on what (MTZ) file from data processing was picked for further structure solution and refinement.

The aB_deposition_combine tool should be used to (1) detect/check for the correct processing result and (2) combine those with the two files from BUSTER - after taking potential re-indexing and SG differences into account. It will also ensure that the correct set of data quality metrics from data processing are transferred into the model mmCIF file (where they are kept for historical reasons within archived PDB entries and expected by the deposition system).

See aB_deposition_combine -h for further information about the usage.

Please ensure the use of autoPROC and/or STARANISO is correctly referenced.

If the data quality metrics should be provided manually to the deposition system, the relevant scaling/merging statistics for STARANISO data (staraniso_alldata-unique.mtz) are in staraniso_alldata-unique.table1. Please do not use PDB_EXTRACT with a randomly picked log file from that processing directory: it will extract incorrect values!

For the traditional (isotropic) data (truncate-unique.mtz) these are in truncate-unique.table1. Again, do not use PDB_EXTRACT with a randomly picked log file from that processing directory: it will extract incorrect values!

From STARANISO

See here for details.

Combining deposition-ready files from autoPROC (or STARANISO) with data from various refinement programs using aB_deposition_combine

Let's discuss what the aB_deposition_combine tool tries to achieve: for each reflection data block within the multi-block mmCIF files produced by autoPROC (which it tries to find within the directory given by the -aP flag), it will try and match up reflections with those reported by BUSTER (in BUSTER_refln.cif). The assumption is that the autoPROC result files in the given output directory were used without any major modification as input to BUSTER - e.g. the MTZ file staraniso_alldata-unique.mtz.

However, that might not have been the case here: if you e.g. used intermediate data (unscaled or even scaled intensities) with some other scaling program or a different procedure to go from intensities to amplitudes, the amplitudes in BUSTER_refln.cif (describing the data as input into refinement) will be different from those in e.g. Data_1_autoPROC_STARANISO_all.cif.

You might then see in the log file something like

NOTE : found 32/100 in
       ./process_03//Data_1_autoPROC_STARANISO_all.cif
       (1_staraniso)

NOTE : too few matches found

telling you that there are some matching amplitude/sigma pairs found - but not quite enough (only 32 out of 100). Maybe some other changes (re-indexing/scaling? SG assignment changed?) occured between the autoPROC job and the final BUSTER refinement?

You can change some of the decision making e.g. by running

  aB_deposition_combine \
    autoBUSTER_DepositionCombine_FindProcessingCif_RandomHit=0.2 \
    ...

to allow for further analysis even if only 20% of the initial comparisons (between 100 random reflections) are successful. Or increase the autoBUSTER_DepositionCombine_FindProcessingCif_RandomFuz parameter (default = 0.02) to allow for more difference between amplitudes. However, you should also double check if ./process_03/ is the autoPROC result directory containing the actually used reflection data that went into BUSTER.

Another potential problem can be so-called daisy-chaining of reflection data - e.g. taking the staraniso_alldata-unique.mtz from autoPROC into MR, then the output MTZ file from that MR step into refinement program A and the resulting output MTZ file into refinement with program B. That is always a recipe for confusion with potential data modification or rescaling happening.

In our hands, if e.g. the staraniso_alldata-unique.mtz file was taken as-is for refinement you should then see 100/100 reflections matching and the tool creating the final combined versions without any problem.

General notes

Make sure to deposit the "all" versions of Rmerge, Rmeas (=Rrim) and Rpim and not the "within" ones! Do not deposit R-values from XDS/XSCALE if run with the FRIEDEL'S_LAW=FALSE flag set: these correspond to the "within" version and don't describe the merged intensity (IMEAN) values itself.

Remember that completeness is a measure of the number of actually observed reflections relative to the number of reflections that are expected to be theoretically observable.

For anisotropic (STARANISO) data, only the reflections within the ellipsoid fitted to the cut-off surface (as determined by STARANISO) would ever be observable. So as a measure of the quality of the experiment (how well was it designed to record all observable reflections), the "Completeness (ellipsoidal)" is the correct value to look at.

The "Completeness (spherical)" assumes that all reflections within a sphere (or spherical shell) could be observed. This is true for isotropic data or for the (lower resolution) enclosed sphere of anisotropic data - but falls short for anisotropic data. However, taken together with the "Completeness (ellipsoidal)" it gives a good idea about the extent of anisotropy the data shows.

It should be noted that these two definitions are actually very similar - just that the cut-off surface is defined differently:
The isotropic analysis assumes isotropy and therefore enforces a spherical cut-off surface.
The anisotropic analysis has no assumption and therefore doesn't enforce a particular cshape to the cut-off surface - and contrary to popular belief is does not assume anisotropy or enforces anisotropy!

Be careful when using other automatic data extraction tools, since they might use a different file/source for those metrics: always check the deposited values with the values clearly presented by the actually used data processing software (e.g. autoPROC output like summary.html or the PDF reports, the STARANISO server results page etc).

See also this Gemmi tool for help in preparing reflection data mmCIF files containing both merged and unmerged data blocks - in case you didn't use autoPROC (where those data blocks are already contained in the mmCIF files).

This is ongoing work together with the wwPDB PDBx/mmCIF Working Group - for more details see:

Notes regarding wwPDB deposition/validation

The above information should provide adequate instructions when using both autoPROC+STARANISO and BUSTER for the data processing and refinement stages ... but what if different systems were used? A user might encounter a wwPDB deposition/validation problem similar to this:

So I tried to run a wwPDB validation job by submitting as mmCIF
structure factor file Data_1_autoPROC_STARANISO_all.cif, and as
mmCIF coordinate file the cif file output by phenix.refine. However,
I still get a "Structure factor file is missing freeR set" error.

We have to remember the distinction between the information given in the _refln.pdbx_r_free_flag column of a reflection mmCIF data block and the notion of a "freeR set" mentioned in the error message one gets.

The _refln.pdbx_r_free_flag assigns a numerical value to each reflection and there are basically three different systems currently or historically used in crystallography:

values of either 0 or 1 (with e.g. 5% random reflections given a 1)

values of either 0 or 1 (with e.g. 5% random reflections given a 0)

a random number between 0 and 19 (so each number represents 5% of reflections)

At the point of data processing it is not yet clear which set of reflections will be used as the test-set (i.e. excluded from refinement and used e.g. for computation of a Rfree value). Of course, in the first two situations one can hazard an educated guess since probably noone would exclude 95% of reflections and only use 5%.

When it comes to refinement, the program needs to be told which test-set flag value (_refln.pdbx_r_free_flag in mmCIF or FreeR_flag/FREE etc in MTZ) marks reflections as belonging to the test-set.

This decision by the user - or most likely the refinement program using some defaults - needs to be communicated to the deposition system, which could be done in two ways:

Upon deposition, provide the test-set flag value (_refln.pdbx_r_free_flag) that was used during refinement - in the same way that one can tell the refinement program which value to use. As far as we know this is not currently possible.

Alternatively, add another item to the _refln category (_refln.status) to provide a simple marker (ultimately based on the _refln.pdbx_r_free_flag value and a user/refinement program decision). This can only be done to the output reflection file of the refinement program.

The current situation can quickly give the following understandable impression

Thus it is not quite true that Data_1_autoPROC_STARANISO_all.cif is
"deposition-ready", because it appears to be so only if combined
with BUSTER-derived coordinates.

because (1) the loss of information and provenance when going from data processing into subsequent use of reflection data and (2) the assumptions of the deposition system what a "typical" set of two mmCIF files should look like. What we provide in the autoPROC+BUSTER world is a way of combining the different deposition-ready files in order to create two files (model and reflection) from the following input:

reflection mmCIF from data processing with several data blocks:

first one assumed to be the one typically used for refinement

all contain a _refln.pdbx_r_free_flag item

data quality metrics ("Table 1") for each data block - since they are different depending on the type of data (with/without data cut-off, with/without anisotropic correction, early or late subsets etc)

reflection mmCIF from refinement (often with several data blocks in order to store/represent different sets of map coefficients):

first one reporting the input data (unaltered in case of BUSTER) and containing a _refln.status item

model mmCIF file containing refinement statistics

We then need to do the following in order for the validation/deposition system to be happy:

Check for identity of the first data block in the refinement reflection mmCIF file with one of the data blocks in the data processing reflection mmCIF file (usually this is the first one);

to ensure we are going to combine the correct reflection data files;

if there was some kind of provenance tracking between data processing and refinement (e.g. via ISPyB or as part of a closed system like CCP4i2, XChem or such) this check would not be needed ... but it doesn't hurt to be extra vigilant and to check.

Transfer the data quality metrics ("Table 1") from the found data processing mmCIF file into the refinement model mmCIF file - since the deposition system expects to find this block there (for historical reasons going back to the times where no diffraction data was deposited at all).

Combine the two reflection files in the right order, i.e. first the data blocks as output from refinement followed by data blocks from data processing:

order matters here because the validation/deposition system only really looks at the first data block (e.g. for that _refln.status flag) and will carry the other data blocks just through into the archive.

So we would argue that all our mmCIF files are deposition ready after all (they contain the complete and correct information) - just that the validation/deposition system has certain assumptions that are tricky to meet: a reflection file from data processing will never contain a _refln.status flag since this is a derived quantity computed within downstream processes.

One might then rightly ask:

Surely it would be desirable to allow users to deposit
autoPROC-derived data, even if for whatever reason they used for
refinement a package different from BUSTER?

Absolutely - other refinement packages/systems should probably provide a similar tool to our aB_deposition_combine to combine the often much richer reflection mmCIF from data processing with the more limited reflection mmCIF from refinement ... at least as long as the deposition system itself lacks the flexibility to allow for the same combination steps and checks outlined above.

Remember that one can always "just" deposit the reflection and model mmCIF files coming out of refinement (any package) - with certain significant limitations:

the data processing statistics ("Table 1") are not available inside the model mmCIF file and need to be provided manually (with a lot of potential for mistakes);

the richer information from data processing (as e.g. provided by autoPROC with early/late datasets, uncorrected data or data without cut-off) is lost;

the actual "observed data" might have been modified by the refinement program using model information (e.g. re-scaling of Fobs to Fcalc by default) - which (as a side-note) is different from e.g. an anisotropic correction from STARANISO done with purely the data itself.

*1: these are unmodified from the input data!