We will be working in a separate directory - and create two subdirectories in there (one for the deposited data and one as a work directory): you should be able to just cut-and-paste the commands given in the green code blocks below. Sometimes those command-lines are broken up using a continuation character (backslash) at the end of a line: this is standard shell syntax and makes them easier to read on the pages here.
Please note that this tutorial was written up using XDS version 20230630!
With a series of shell commands
mkdir Deposited cd Deposited # Raw diffraction data # use "curl -O" instead of "wget" if the latter is not available wget https://data.proteindiffraction.org/sgc/SETDB1-x207_5kco.tar.bz2 tar -xvf SETDB1-x207_5kco.tar.bz2 ln -s SETDB1-x207_5kco/data Images # MR search model # use "curl -O" instead of "wget" if the latter is not available wget -q "https://files.rcsb.org/download/3DLM.pdb" egrep "^LINK|^SSBOND|^CRYST1|^ATOM|^HETATM|^ANISOU|^TER" 3DLM.pdb > start.pdb # currently deposited model (for comparison): fetch_PDB_gemmi 5KCO | tee fetch_PDB_gemmi.log ln -s r5kcosf.mtz deposited.mtz ln -s 5kco.pdb deposited.pdb cd ..
we should now have all relevant (deposited) data in the subdirectory Deposited.
Running
mkdir Work cd Work # create some symbolic links for files to be used here ln -s ../Deposited/Images . ln -s ../Deposited/start.pdb . ln -s ../Deposited/deposited.pdb . ln -s ../Deposited/deposited.mtz . # getting some Grade2 restraint dictionaries # use "curl -O" instead of "wget" if the latter is not available wget https://www.globalphasing.com/buster/wiki/plugin/attachments/GPhLTutorials5KCO/6RO.restraints.cif
will get us all required files:
To check if our software is correctly set up and configured, we can run
buster_maponly -p deposited.pdb -m deposited.mtz -o maponly.mtz | tee maponly.logto get a set of map coefficients in the maponly.mtz output file:
This will also report the R/Rfree-values for this model (against the deposited reflection data):
Rwork/Rfree/Rall = 0.2020 0.2284 0.2033
This can be done in three different ways.
process -I Images -d process.01 | tee process.01.lis
This has the advantage of avoiding any kind of bias towards the actual crystal form (cell/SG) of that dataset: sometimes the crystal form changes e.g. due to a co-crystallisation with a particular compound. However, if this dataset comes in previously observed crystal form, the associated cell/SG is not enforced - which can become relevant when not all screw-axes are observed (so processing derives at P21212 just because the last 2-fold screw was not measured) or when a particular setting is required.
If working within a larger project with several datasets, the correct SG/cell might have to be manually set/adjusted and a consistent set of test-set flags needs to be established afterwards.
process -I Images \ cell="56 64 71 90 90 90" \ symm=P212121 \ -d process.02 | tee process.02.lis
This will enforce a specific cell/SG already at the indexing stage, which can be problematic if e.g. the unit-cell has doubled and the discarding of half of the (maybe weak) spots is thus enforced.
The test-set flags automatically created at the end of processing will not be consistent with any previously determined model/dataset (see use of check_indexing and add_freerflag.sh below).
process -I Images \ -ref deposited.mtz \ -d process.03 | tee process.03.lis
Here we not only enforce a specific cell/SG (extracted from the MTZ file header: see e.g. mtzana deposited.mtz or gemmi mtz deposited.mtz output), but will also ensure that the newly processed data is consistently indexed (for spacegroups that allow different equivalent indexing solutions). Finally, the same set of test-set flags will be used and (if necessary) extended to the full resolution limits of the newly processed data.
Some "highlights" from the processing results in process.03/summary.html:
The final data quality can be seen in the STARANISO 2D plots:
and the "Table 1" statistics:
Compare these to the deposited values.
This mode of processing is recommended when working within the same project and crystal form on a large number of datasets.
As we see a fair amount of unindexed spots in specific resolution ranges
we can try and fit those "spikes" to the known resolution ranges using
aP_fit_wvl_to_spots process.03/SPOT.XDS | tee aP_fit_wvl_to_spots.log
This gives us
NOTE : will use GXPARM file autoPROC/GXPARM.XDS
NOTE : found a a total of 42797 non-indexed spots
autoPROC/SPOT.XDS within ice-ring ranges : 4688 = 4.07 %
...
initial wavelength = 0.92819 (13.3576 keV)
distance = 247.447 mm
refined wavelength = 0.91662 (13.5262 keV) [factor = 0.98754]
distance = 247.492 mm [factor = 1.00018]
autoPROC/SPOT.XDS within ice-ring ranges : 5449 = 4.73 %
...
This is not a very clear ice-ring pattern, so the fit might not be very accurate/clear. Sometimes it can be useful to encounter a dataset with severe ice-rings during a particular beamline shift (and use that one to run this kind of analysis). If there is a significant error/drift in the energy actually used at the beamline compared to the one recorded in the metadata, it most likely is the same throughout a full shift and the corrected value (from the aP_fit_wvl_to_spots analysis) can be used throughout.
We could therefore re-run the data-processing with
process -I Images \ wave="0.91662" \ -ref deposited.mtz \ -d process.04 | tee process.04.lis
Why does it matter? Although the integrated intensities will be identical no matter what wavelength we are using, the associated unit cell will differ. With an incorrect value that unit cell will either be slightly too large ot too small, and during refinement the refinement program will try very hard to keep the geometry (think: "bonds") similar to the external restraints (e.g. Engh&Huber values for amino-acid residues). This will require a higher weight on the geometry part of refinement which corresponds to a lower weight on the X-Ray part - resulting in a worse fit to the experimental data and probably poorer electron-density and fiference maps. It might not be very obvious or relevant for the type of question one wants to ask the model/density in a specific project, but one never knows.
In this example, using a corrected wavelength value leads to a 40% higher weight on the X-Ray term while achieving the same geometry (as judged on rms(bond) values).
As we have seen, this dataset contains several distinct lattices that seem to be present throughout the image range. By default, autoPROC will pick the most populated one to use in all subsequent steps. If those additional lattices (or rather: "indexing solutions", since there could be other reasons for getting those additional solutions like twinning, incorrect crystal-detector distance or incorrect image width leading to overlaps, increase in cell dimensions due to radiation damage, etc) are of interest, one can pick a specific indexing solution by running
process -I Images \ wave="0.91662" \ -ref deposited.mtz \ XdsOptimizeIdxrefPickSolution=02 \ -d process_02.04 | tee process_02.04.lis process -I Images \ wave="0.91662" \ -ref deposited.mtz \ XdsOptimizeIdxrefPickSolution=03 \ -d process_03.04 | tee process_03.04.lis
If you are feeling really adventurous, you could try and combine those three datasets (since they probably come from distinct crystals) via
mkdir scale_01-02-03.04 ( cd scale_01-02-03.04 combine_files \ -f ../process.04/INTEGRATE.mtz -P A 01 1 \ -f ../process_02.04/INTEGRATE.mtz -P A 02 1 \ -f ../process_03.04/INTEGRATE.mtz -P A 03 1 \ -o combine.mtz | tee combine_files.log )
The whole process of scaling and merging is guided by the Project, Crystal, Dataset model used by CCP4 MTZ files. The important things to remember here:
The combination of reflection data (always use the INTEGRATE.mtz file!) is followed by scaling using
( cd scale_01-02-03.04 && aP_scale -mtz combine.mtz -freemtz ../deposited.mtz | tee aP_scale.log )Always run aP_scale in a separate directory: it creates a lot of files that might overwrite other, already existing files (or aP_scale itself might get confused by existing files with the same name).
You will see the decision making during the iterative scaling process at play: autoPROC will decide on different (isotropic) resolution cutoff values during the scale determination. The final set of scales is then applied to the full set of intensities before merging - and those scaled+merged intensities are then handed over to STARANISO for analysis: determination of anisotropic cut-off surface, application of that cut-off surface to intensities, conversion of intensities to amplitudes using the adequate anisotropic prior and (finall) correction of those amplitudes for the observed anisotropy.
The same refinement with BUSTER (using "-L" for ligand detection) gives R/Rfree of 0.2335/0.2581 for the "normal" data and 0.2292/0.2575 when refining against the data combined from the different lattices.
If no reference MTZ file was provided during data processing, the newly processed reflection data might require some transformations in relation to existing reference data. This can be done e.g. via
check_indexing -v deposited.mtz process.01/staraniso_alldata-unique.mtz check_indexing -v deposited.mtz process.02/staraniso_alldata-unique.mtz
In the case here (P212121 symmetry) nothing very interesting will happen: there are no alternative indexing possibilities, potentially missed screw axes or different settings.
Of course, there are other programs available to perform similar checks (see also our own aP_select_pdb for a variant): the important point is that any newly processed data should be consistent with any previously available data of the same crystal form to avoid confusion solely because of some trivial re-indexing or difference in settings.
Within a larger project, the test-set flags for a given crystal form should be the same across different datasets. This ensures that Rfree-values computed even after a minimal amount of refinement are meaningful and not biased. We provide a simple tool (that internally runs the usual CCP4 programs) which should be run after any indexing/setting ambiguity has been resolved:
add_freerflag.sh -f deposited.mtz -m process.01/staraniso_alldata-unique.mtz add_freerflag.sh -f deposited.mtz -m process.02/staraniso_alldata-unique.mtz
If the reference file deposited.mtz was less complete (e.g. lower resolution) than the MTZ file, the test-set flag will be extended. This would happen afresh for every new dataset handled this way - which is why the creation of a highly-optimistic reference MTZ file (with test-set flags to the highest resolution envisaged or hoped for) would be a good idea.
The aB_autorefine interface to BUSTER will run a series of individual BUSTER refinement jobs with some automatic decision making in between (when to use TLS, ADP, solvent model update, outlier rejection etc). This is not a quick process, but provides a very consistent and reliable way of getting a large number of structures to a reasnoable state for subsequent analysis and manual adjustment.
aB_autorefine \ -p start.pdb \ -m process.03/staraniso_alldata-unique.mtz \ -d aB_autorefine.01 | tee aB_autorefine.01.log
or (if we want to use the combined data from all three lattices that also used the correct wavelength):
aB_autorefine \ -p start.pdb \ -m scale_01-02-03.04/staraniso_alldata-unique.mtz \ -d aB_autorefine.02 | tee aB_autorefine.02.log
Does it make a difference? Well, we get
main lattice : diffraction limits = 1.539 1.386 1.467 R/Rfree = 0.1970/0.2385 three lattices : 1.535 1.409 1.465 0.1916/0.2366
which might (or might not) point to some significant differences ...
Looking at the model/maps of that initial refinement shows enough features to do a few simple fixes:
With a follow-up refinement using
pdb2occ -p aB_autorefine.01/04/refine-coot-0.pdb \ -o aB_autorefine.01/04/refine-coot-0.occ refine \ -p aB_autorefine.01/04/refine-coot-0.pdb \ -m aB_autorefine.01/03/Plots/LogLike_filtered.mtz \ -M ADP \ -M HydrogenHybridModel \ -Gelly aB_autorefine.01/04/refine-coot-0.occ \ -d aB_autorefine.01/05 2>&1 | tee aB_autorefine.01/05.lis
we get to R/Rfree = 0.1825/0.2166 (a significant drop from 0.1970/0.2385 before the manual rebuilding). There still seem to be a couple of buffer molecules missing (probably DMSO and sulfate?) -and after fixing thos we can run
pdb2occ -p aB_autorefine.01/05/refine-coot-0.pdb \ -o aB_autorefine.01/05/refine-coot-0.occ refine \ -p aB_autorefine.01/05/refine-coot-0.pdb \ -m aB_autorefine.01/03/Plots/LogLike_filtered.mtz \ -M ADP \ -M HydrogenHybridModel \ -Gelly aB_autorefine.01/05/refine-coot-0.occ \ -d aB_autorefine.01/06 2>&1 | tee aB_autorefine.01/06.lis
to get R/Rfree = 0.1764/0.2095. That would then be a good model and map to use for ligand fitting in RhoFit.
A single BUSTER refinement run can be run using
refine -RB 4.0 \ -p start.pdb \ -m process.04/staraniso_alldata-unique.mtz \ -d refine.01 | tee refine.01.log
There are a lot of options available (see refine -h for details) for fine-tuning the behaviour, although we think that the defaults should be adequate for most situations. We start here with an initial rigid-body refinement (-RB flag) using only lower-resolution data, since the starting model (3DZQ) might need a slight re-arrangement, even if it is the same crystal form (unit cell parameters might have changed a little bit). Any subsequent refinements - e.g. after manual adjustments of the model - doesn't need to start with such a RB-refinement step of course.
Please note that this job will internally run BUSTER several times (so-called "big cycles", with updates to the solvent mask and the X-Ray weighting) and that before the last of those cycles a feature called "void correction" is activated (to account for cavities that should be excluded from the bulk solvent model).
This uses the so-called -L feature (Vonrhein, C. and Bricogne, G., 2005. Automated Structure Refinement for High-Throughput Ligand Detection with BUSTER-TNT. Acta Cryst. Sect A, 61, p.c248.) as described in more detail here. The "Polder maps" in Phenix follow a similar idea.
refine -RB 4.0 -L \ -p start.pdb \ -m process.04/staraniso_alldata-unique.mtz \ -d refine-L.01 | tee refine-L.01.log
(the -L flag is also suppoerted by aB_autorefine). The result should hopefully be a clearer difference-density map for the bound ligand.
It might be better to use aB_autorefine as a starting point for an updated solvent model, since it will run this only after the input model has gone through some additional refinement cycles (including RB, TLS etc). As a test (and for this tutorial) one could run
refine -RB 4.0 -WAT 2 \ -p start.pdb \ -m process.04/staraniso_alldata-unique.mtz \ -d refine-WAT.01 | tee refine-WAT.01.log
to (1) start with some rigid-body refinement and (2) switch on solvent update after big cycle 2.
We will need a description of the ligand in form of a mmCIF restraints dictionary before we can try and fit the ligand into the (hopefully clear) difference density map that provides our evidence for a bound compound. There are several ways of generating those type of restraint files (and the different programs should create files that are compatible with each other), with Grade2 the latest incarnation of our own approach to this. If you have the required CSD software from CCDC installed, you should be able to run
grade2 -P 6RO(for a known identifier in the chemical components dictionary), or
grade2 -r 6RO 'C[S](=O)(=O)Nc1ccc(Cl)cc1'if using a SMILES string. Please also see the extensive Grade2 documentation and the Grade2 webserver.
We will be using the results of the BUSTER refinement in "ligand detection" mode - and search in all potential binding sites ("clusters"):
rhofit -l 6RO.restraints.cif \ -allclusters \ -p refine-L.01/refine.pdb \ -m refine-L.01/refine.mtz \ -d rhofit.01 | tee rhofit.01.log
The reason for adding the -allclusters command should become clear when visualising the results:
( cd rhofit.01 && visualise-rhofit-coot)
| Site 1: missing C-terminal residues | |
| Site 2: incorrect loop | |
| Site 3: correct | |
| site 4: noise |
A better starting point is the manually corrected model (see above), i.e.
rhofit -l 6RO.restraints.cif \ -p aB_autorefine.01/06/refine.pdb \ -m aB_autorefine.01/06/refine.mtz \ -d rhofit.02 | tee rhofit.02.log
When using
( cd rhofit.02 && visualise-rhofit-coot)we can see that the only "interesting region" found (marked by a so-called cluster) is the actual binding site and that the top hit looks rather good.
If one is happy with the top hit (of each binding site), this can be put immediately into refinement using e.g.
refine -l 6RO.restraints.cif \ -p rhofit.01/merged.pdb \ -m process.04/staraniso_alldata-unique.mtz \ -d refine-6RO.01 | tee refine-6RO.01.log
However, the above command uses the initial refinement results (where we still have some misplaced loop, side-chains etc) and therefore have several incorrectly identified binding sites resulting in poor fits. Moving along the manually rebuilt route means using the unique (single-site) fit from RhoFit:
pdb2occ -p rhofit.02/merged.pdb -o rhofit.02/merged.occ refine -l 6RO.restraints.cif \ -p rhofit.02/merged.pdb \ -m process.04/staraniso_alldata-unique.mtz \ -d refine-6RO.02 | tee refine-6RO.02.log
giving us R/Rfree = 0.2042/0.2295.
Especially when working with ligand structures, buster-report can be very useful for analysing the overall quality of the model and of a particular compound. It can be invoked in several ways:
Once a BUSTER jobs has finished, running
buster-report \ -d refine-6RO.02 \ -dreport refine-6RO.02-BR
will create reports in various formats within the output directory:
refine-6RO.02-BR/index.html # HTML pages refine-6RO.02-BR/report.pdf # PDF reoport refine-6RO.02-BR/report.tsv refine-6RO.02-BR/report.xml
Some of the interesting plots visible in index.html are:
| A summary of refinement statistics | |
| Reciprocal space correlation plot | |
| Run conditions | |
| log-likelihood analysis as a function of resolution (showing outliers) | |
| Specific ligand analysis | |
| Molprobity analysis |
If you want buster-report to be run automatically after BUSTER, just add -report to the command-line.
This is the default anyway and you will have the full buster-report output of the complex ligand structure - so no specific argument required.
Putting all of the steps above into a single system:
It is very important to have a good starting/APO model for this procedure to give reliable results. Ideally, this should be a complete model refined (to convergence) against a high-quality dataset - for a crystal where the only difference is the lack of compound (e.g. a DMSO-only soak or such). The reference MTZ file needs to be consistent with this starting/APO model.
We will also use here the -allclusters keyword (for the same reason as with our standalone RhoFit job above):
pipedream -xyzin start.pdb \ -hklref deposited.mtz \ -imagedir Images \ -rhofit 6RO.restraints.cif \ -allclusters \ -d pipedream.01 | tee pipedream.01.log
(this takes 30 minutes on a i9-13900 chip). The final model and ligand-fit can be visualised using
coot --pdb pipedream.01/postrefine-*/refine.pdb \ --auto pipedream.01/postrefine-*/refine.mtz \ --dictionary 6RO.restraints.cif
pipedream -xyzin start.pdb \ -hklref deposited.mtz \ -hklin process.04/staraniso_alldata-unique.mtz \ -rhofit 6RO.restraints.cif \ -d pipedream.02 | tee pipedream.02.log