Content


Best practice

A normal autoPROC (20240710 and later versions) run will create a whole range of reflection data in MTZ or mmCIF format. Not all of them are worth keeping: it is always best to refer to the main summary.html file which will describe all relevant output (anything not mentioned in there is of intermediate nature and can be ignored). If you want to keep/transfer only a single file: take the summary.tar.gz file - which will (once unpacked) provide the summary.html and all relevant files.

  • As a working file for the scaled+merged reflection data, the most important (only?) file you need to take is the staraniso_alldata-unique.mtz file.
    • We do not recommend using the truncate-unique.mtz file.
    • If you are looking for the scaled+unmerged reflection data, use the aimless_alldata_unique.mtz file (files like XDS_ASCII.HKL are unrelated to the acrtual final and recommended merged reflection data!).
  • For deposition at a later point you also want to take the Data_1_autoPROC_STARANISO_all.cif file. But as mentioned above: summary.tar.gz contains everything.

User problems with OneDep ("missing freeR set") - Oct 2022

The information on DepositionMmCif should provide adequate instructions when using both autoPROC+STARANISO and BUSTER for the data processing and refinement stages ... but what if different systems were used? A user might encounter a wwPDB deposition/validation problem similar to this:

So I tried to run a wwPDB validation job by submitting as mmCIF
structure factor file Data_1_autoPROC_STARANISO_all.cif, and as
mmCIF coordinate file the cif file output by phenix.refine. However,
I still get a "Structure factor file is missing freeR set" error.

We have to remember the distinction between the information given in the _refln.pdbx_r_free_flag column of a reflection mmCIF data block and the notion of a "freeR set" mentioned in the error message one gets.

  • The _refln.pdbx_r_free_flag assigns a numerical value to each reflection and there are basically three different systems currently or historically used in crystallography:
    • values of either 0 or 1 (with e.g. 5% random reflections given a 1)
    • values of either 0 or 1 (with e.g. 5% random reflections given a 0)
    • a random number between 0 and 19 (so each number represents 5% of reflections)
  • At the point of data processing it is not yet clear which set of reflections will be used as the test-set (i.e. excluded from refinement and used e.g. for computation of a Rfree value). Of course, in the first two situations one can hazard an educated guess since probably noone would exclude 95% of reflections and only use 5%.
  • When it comes to refinement, the program needs to be told which test-set flag value (_refln.pdbx_r_free_flag in mmCIF or FreeR_flag/FREE etc in MTZ) marks reflections as belonging to the test-set.
  • This decision by the user - or most likely the refinement program using some defaults - needs to be communicated to the deposition system, which could be done in two ways:
    • Upon deposition, provide the test-set flag value (_refln.pdbx_r_free_flag) that was used during refinement - in the same way that one can tell the refinement program which value to use. As far as we know this is not currently possible.
    • Alternatively, add another item to the _refln category (_refln.status) to provide a simple marker (ultimately based on the _refln.pdbx_r_free_flag value and a user/refinement program decision). This can only be done to the output reflection file of the refinement program.

The current situation can quickly give the following understandable impression

Thus it is not quite true that Data_1_autoPROC_STARANISO_all.cif is
"deposition-ready", because it appears to be so only if combined
with BUSTER-derived coordinates.

because (1) the loss of information and provenance when going from data processing into subsequent use of reflection data and (2) the assumptions of the deposition system what a "typical" set of two mmCIF files should look like. What we provide in the autoPROC+BUSTER world is a way of combining the different deposition-ready files in order to create two files (model and reflection) from the following input:

  • reflection mmCIF from data processing with several data blocks:
    • first one assumed to be the one typically used for refinement
    • all contain a _refln.pdbx_r_free_flag item
    • data quality metrics ("Table 1") for each data block - since they are different depending on the type of data (with/without data cut-off, with/without anisotropic correction, early or late subsets etc)
  • reflection mmCIF from refinement (often with several data blocks in order to store/represent different sets of map coefficients):
    • first one reporting the input data (unaltered in case of BUSTER) and containing a _refln.status item
  • model mmCIF file containing refinement statistics

We then need to do the following in order for the validation/deposition system to be happy:

  • Check for identity of the first data block in the refinement reflection mmCIF file with one of the data blocks in the data processing reflection mmCIF file (usually this is the first one);
    • to ensure we are going to combine the correct reflection data files;
    • if there was some kind of provenance tracking between data processing and refinement (e.g. via ISPyB or as part of a closed system like CCP4i2, XChem or such) this check would not be needed ... but it doesn't hurt to be extra vigilant and to check.
  • Transfer the data quality metrics ("Table 1") from the found data processing mmCIF file into the refinement model mmCIF file - since the deposition system expects to find this block there (for historical reasons going back to the times where no diffraction data was deposited at all).
  • Combine the two reflection files in the right order, i.e. first the data blocks as output from refinement followed by data blocks from data processing:
    • order matters here because the validation/deposition system only really looks at the first data block (e.g. for that _refln.status flag) and will carry the other data blocks just through into the archive.

So we would argue that all our mmCIF files are deposition ready after all (they contain the complete and correct information) - just that the validation/deposition system has certain assumptions that are tricky to meet: a reflection file from data processing will never contain a _refln.status flag since this is a derived quantity computed within downstream processes.

One might then rightly ask:

Surely it would be desirable to allow users to deposit
autoPROC-derived data, even if for whatever reason they used for
refinement a package different from BUSTER?

Absolutely - other refinement packages/systems should probably provide a similar tool to our aB_deposition_combine to combine the often much richer reflection mmCIF from data processing with the more limited reflection mmCIF from refinement ... at least as long as the deposition system itself lacks the flexibility to allow for the same combination steps and checks outlined above.

Remember that one can always "just" deposit the reflection and model mmCIF files coming out of refinement (any package) - with certain significant limitations:

  • the data processing statistics ("Table 1") are not available inside the model mmCIF file and need to be provided manually (with a lot of potential for mistakes);
  • the richer information from data processing (as e.g. provided by autoPROC with early/late datasets, uncorrected data or data without cut-off) is lost;
  • the actual "observed data" might have been modified by the refinement program using model information (e.g. re-scaling of Fobs to Fcalc by default) - which (as a side-note) is different from e.g. an anisotropic correction from STARANISO done with purely the data itself.

Test-set flag principles - Apr 2026

The following OneDep message ("Structure factor file is missing freeR set") is causing our users a lot of headache:

Screenshot_2026-04-15_15-53-12.png

Here is our assessment of it:

  • The message is triggered by a _refln loop within a data block not having a _refln.status item.
  • The reflection data file we provide contains multiple datablocks - most are for merged reflection data (_refln category), but also one for unmerged reflection data (_diffrn_refln category).
  • The message text is slightly misleading and unclear for users, since all our merged datablocks contain a test-set flag.
    • Remember that there is a difference between a "test-set flag" (assigning a reflection to one of N test-sets) and the "free-set" (setting one of those test-sets aside from refinement to compute e.g. R-free - in which case one could call it the "rfree-set"). Precise nomenclature and correct wording/usage matters.
  • Our provided multi-datablock reflection data file will contain data directly from the data processing step, plus the data that went into refinement and the data that came out of refinement (e.g. map coefficients).
    • Data from data processing can be richer and more varied than what was finally used in refinement (e.g. F(early) and F(late) amplitudes for radiation damage detection maps).
       Classification of a reflection so as to indicate its status with
       respect to inclusion in the refinement and the calculation of R
       factors.
  • As we see it, what is causing misunderstanding on the wwPDB OneDep side and causing problems to our users is as follows:
    • At the point of data-processing there is no knowledge of whether (i) a specific reflection will be used in the final refinement step (before deposition), or (ii) if it is included, how it will be used for the computation of R factors (R_work, R_free etc).
    • The test-set flags (usually integer numbers between 0 and N (where N is often 19, leading to 5% of reflections in each test-set) as assigned during data processing can not necessarily be assumed to be the same that are actually used during final refinements. They might for instance have been auto-assigned during automatic processing at a synchrotron site and then replaced with a set of reference test-set flags (often computed to a very high resolution to cover any upcoming dataset collected e.g. during a large scale fragment screening campaign).
    • Even if they were identical, the selection of a particular test-set as the free set is done at the point of refinement (selecting e.g. test-set 0 as the free-set, or test-set 1, or another one - or even doing complete cross-validation by using each test-set in turn).
    • Refinement will often use a susbset of reflections (low or high resolution limits, outlier rejection etc) so that the status flag as defined above can only be uniquely assigned to the output file of the last refinement step.
      • This is why BUSTER will place that data as the first datablock of the multi-datablock reflection file.

The official wwPDB documentation has

  • Structure Factors
    • The structure factor file can either be in mmCIF or mtz formats and should include h, k, l, F, SigmaF (and/or I and SigmaI) and test flags.

But:

  • F+SigmaF or I+SigmaI are not "structure factors", they are "structure factor amplitudes" or "intensities" (plus associated standard uncertainties). Therefore, the reflection data file associated with a PDB entry from a diffraction experiment can't be (and should never have been called) called a "structure factor file".
  • "test flags" are a way to associate a given merged reflection to a particular test set (often numbered 0 to 19 to have 5% of reflections in each test set) - without giving us information what those test sets are used for. We need an additional "rfree flag" per reflection to tell us for the sole purpose of computing Rwork and Rfree which reflection were used for which. Most refinement programs will actually use the "test flag" and a parameter (e.g. "0" or "1") that indicates that all reflections with a test set flag of that value will be used for Rfree computation, while all others go into the Rwork computation.
  • The URL doesn't contain any definitions or format descriptions - only a few examples.

Therefore:

  • It does not make sense to require that every merged datablock in a reflection data mmCIF file should carry a _refln.status item in a _refln loop.
  • Users should not be required to transfer refinement-specific information (like that _refln.status item) back onto original reflection data from data processing - especially because
    • this is nearly impossible to do correctly and completely (see above), and
    • the wwPDB sites are currently not doing anything with those additional datablocks anyway as far as we know.

This incorrect warning/error message from OneDep has caused a lot of problems over several years and we have tried to support users on a case-by-case basis to get the data through the annotation process unchanged. Unfortunately, this has not triggered a fix in the OneDep system as-is. We feel very strongly that the bogus check (_refln.status in datablocks N>1) should be removed from OneDep as a matter of urgency: it is preventing a lot of our users from depositing metadata-rich, multi- datablock reflection data, whereas the prevailing Zeitgeist is that depositions should be as "rich" as possible to help train Machine Learning engines.


Reflection data file warning messages - Apr 2026

If you see the following block of warnings

6-back-to-square-one.png

you have to remember first that

  • the message about "missing freeR set" is incorrect;
  • you should never switch to using pdb_extract, which (in a best case scenario) would result in a much poorer set of reflection data with much fewer metadata - and in the worst case with completely wrong values.

Also:

The warnings about "unwanted CIF item" are misleading

  • _audit.creation_method is a standard PDBx/mmCIF item that we use to record which procedure/program created that datablock.
  • OneDep seems to assume that only itself can populate that category.

Why OneDep is complaining about the _diffrn_radiation_wavelength.wt item is unclear:

  • it is apparently present in 82.2% of all PDB entries
  • maybe OneDep again thinks it has sole repsonsibility for that item

The warnings about an "abnormal" value are incorrect

  • the assessment of "abnormal" is not encoded into the PDBx/mmCIF dictionary itself as far as we can see
  • this must be a OneDep-specific internal check

The warnings about missing "mandatory items" are incorrect

                    The dataset used for the refinement should be listed as a first
                    data block and should contain diffraction indices h,k,l, observed
                    amplitudes and/or intensities, their respective sigma values
                    and refinement test set.
    • That soft (?) rule clearly shouldn't apply to those additional data blocks
  • data blocks 2 and 3 contain only map coefficients
  • since there is only one possible set of item names to hold electron density structure factors (amplitude _refln.pdbx_FWT and phase _refln.pdbx_PHWT), we need to use additional data blocks to hold additional versions of electron density
    • the main data block contains the 2mFo-DFc map coefficients for all observations (i.e. where we have a Fo/Fobs value)
    • the additional data block contain the map coefficients with DFc completion applied, i.e. reflections with missing Fo/Fobs are given the DFc term - which is better than using them with a zero value in computing maps
    • BUSTER distinguishes between the iso-fill (any reflection to the highest diffraction limit of any HKL) and the aniso-fill (a reflection is classified as missing if a tool like STARANISO assigned it as "observable" given the average significance of surrounding observations).

Bottom line: we think that all those warnings are either misleading or incorrect and can be ignored.


I've lost some files

What should you do if you "lost" those important files (summary.tar.gz or Data_1_autoPROC_STARANISO_all.cif) but you still want to deposit rich, multi-datablock reflection data?

If you only have aimless_alldata_unmerged.mtz, aimless_alldata.mtz, staraniso_alldata-unique.mtz, staraniso_alldata-unique.cif and staraniso_alldata.log, you could run

 % aP_deposition_prep -p 1
 % gemmi mtz2cif --no-comments --no-history --separate aimless_alldata.mtz aimless_alldata_unmerged.mtz 2_aimless_alldata.cif
 % cat 1_autoPROC_STARANISO_all.cif 2_aimless_alldata.cif > Data_1_autoPROC_STARANISO_all.cif

This should create the mmCIF file Data_1_autoPROC_STARANISO_all.cif that is similar (but with not as rich metadata) to the one originally created by autoPROC itself.

Now you should be able to run

  % aB_deposition_combine -aP Data_1_autoPROC_STARANISO_all.cif BUSTER_model.cif BUSTER_refln.cif

to get two deposition-ready mmCIF files

aB_deposition_combine_model.cif
aB_deposition_combine_refln.cif