Atom Naming Features

This chapter describes how Grade2 produces the atom IDs (also know as atom names) of individual atoms in a ligand molecule.

Please note that, because of limitations in the legacy Protein Data Bank (PDB) Format Grade2 sets all atom IDs to be uppercase and attempts wherever possible to keep them to be 4 or fewer characters in length. This is because the PDB format is currently used by BUSTER and other crystallographic tools.

Default Atom IDs if they are set in the input

Where possible Grade2, by default, will reuse atom names from the input file. For instance, all PDB chemical components have specified atom IDs and it is important to use these to ensure consistency and compatibility with existing PDB data.

Atom IDs are also set in:

  • All CIF restraint dictionaries.

  • Most (but not all) MOL2 files. MOL2 files offer a flexible method for manipulating atom IDs within a molecule. The CSD-core program Mercury, provides a user-friendly interface for editing MOL2 files and adjusting atom IDs as demonstrated in the FAQ on editing a molecule.

  • Some SDF files.

If atom IDs are set in the input but you want to use different atom names then Grade2 has a number of options to set atom IDs, that will override the input IDs.

Please note that all lower case letters in atom IDs are altered to uppercase by Grade2 as programs such as BUSTER require that atom IDs are all uppercase.

Default atom IDs if not already set in the input

If atom IDs are not set in the input then Grade2 by default will base the atom IDs on the order of the atoms, unless the the molecule is a typical amino acid. The first non-hydrogen atom will be assigned an atom ID composed of its element abbreviation (made upper case), followed by 1. Subsequent non-hydrogen atoms will be assigned IDs made up of their element followed by their input list order.

Using a SMILES string N(C)[C@@H](C)[C@H](O)c1ccccc1 for ephedrine as an example, Grade2 will set atom IDs:

Grade2 atom labels for ephedrine ``N(C)[C@@H](C)[C@H](O)c1ccccc1``

The first atom in the SMILES string is a nitrogen so it is assigned atom ID N1. The second atom is a carbon and so it gets ID C2. The oxygen atom is the sixth non-hydrogen atom and so it is assigned O6 .

Hydrogen atoms IDs all start with H followed by the list number taken from the atom to which they are attached and then A, B or C if there is more than hydrogen atom attached. So in the ephedrine example above the hydrogen atom attached to nitrogen N1 is given the ID H1. As there are three hydrogen atoms attached to C2 they are assigned IDs to H2A, H2B and H2C.

It should be noted that as SMILES strings are not unique then different atom IDs can be assigned for the same molecule. If this is a problem then the Grade2 option --rdkit_canonical_atom_ids discussed below sets the IDs from a canonical atom order that is independent of the input order.

Default atom IDs for recognized amino acids

Typical alpha amino acids with an amino group and a single beta carbon atom

Grade2 will now by default, recognize typical amino acids when supplied with an input that lacks atom IDs (aka atom names), for instance a SMILES string. The exact requirement used is that the molecule matches the SMARTS pattern:

[$([NX3H2,NX4H3+])][CX4H]([#6])[CX3](=[OX1])[OX2H,OX1-]

The pattern specifies that the molecule must have have either a neutral NH2 or a NH3+ amino group followed by a a 4-valent carbon atom with one hydrogen atom and one carbon atom attached and then a neutral or charged carboxylic acid. A wider range of amino acids are recognized when the --aa_loose option is used (see next section).

If a typical amino acid is recognized then the PDB-standard atom IDs (N CA C O OXT CB) will be set for the main chain and beta carbon atoms and for the hydrogen atoms that they are bonded to. In addition, the ligand's atoms will be reordered so that the main chain atoms are first in the list. Currently, side chain atoms are assigned atom IDs using their numerical order (rather than PDB-style Greek letter remoteness codes CG CD CE etc). So using 4-fluoroglutamate from SMILES C(C(F)C(=O)O)[C@@H](C(=O)O)N as an example, Grade2 will assign atom IDs:

Grade2 atom labels for fluoroglutamate C(C(F)C(=O)O)[C@@H](C(=O)O)N

It should be noted that the --antedecent option can normally be used to assign more atom IDs from the parent amino acid, as shown below for 4-fluoroglutamate.

If you prefer for the renaming not to happen, then the Grade2 command-line --no_aa_labels option turns it off, leaving standard numerical order based atom IDs.

Note that, currently, no alterations are made if the input file specifies atom IDs (for example CIF restraint dictionaries and most MOL2 files).

In addition to setting main chain atom IDs the output restraint dictionary will have the CCP4-extension CIF item _chem_comp.group is set to peptide This enables Grade2 CIF restraint dictionaries to be used in Coot to replace protein residues with modified amino acids.

Setting atom IDs for "exotic" amino acids with the --aa_loose option

Following a user-request, the atom naming feature has been extended to a wide range of "exotic" amino acids with the command line option --aa_loose is used. If the option is not used but atom names could be set then a warning message is produced in the terminal output, for instance:

WARNING: The molecule is an "GLY-like alpha amino acid with an amino group", so ....
WARNING: ---- could set conventional amino acid atom IDs. If you want ....
WARNING: ---- this done, then please rerun with the option: --aa_loose
WARNING:

If a molecule is recognized as an amino acid by the --aa_loose option the output restraint dictionary will have the CCP4-extension CIF item _chem_comp.group is set to peptide. Please note that setup of restraints between an "exotic" amino acid and adjacent monomers is dependent on the program using the restraint dictionary and that setting atom IDs is not likely to be sufficient to ensure that correct restraints are used.

The amino acid classes that are currently recognized by --aa_loose are detailed below. If there is any need for recognition of any other class of amino acid then please let us know.

Click to expand/hide section on amino acids recognized by --aa_loose

alpha amino acid with CB and N-modification

This pattern allows modification of the nitrogen atom by a single carbon atom. The SMARTS used is:

[$([NX3])]([#6])[CX4H]([#6])[CX3](=[OX1])[OX2H,OX1-]

Atom IDs N CN CA C O OXT CB will be set. Please note that for PDB chemical components there is no standard atom name for the carbon atom attached to the nitrogen, but CN is used in N-methyl-L-serine https://www.rcsb.org/ligand/5JP and seems sensible.

For an example, given the SMILES input C[C@@H](C(=O)O)NCC the following atom IDs will be set:

Grade2 atom labels for n-ethyl-alanine C[C@@H](C(=O)O)NCC

AIB-like alpha amino acid with an amino group

This pattern matches alpha amino acids with two C beta atoms and an unmodified amino group. The SMARTS used is:

[$([NX3H2,NX4H3+])][CX4]([#6])([#6])[CX3](=[OX1])[OX2H,OX1-]

Atom IDs N CA CB1 CB2 C O OXT will be set. For an example, given the SMILES input NC(C)(CO)C(O)=O the following atom IDs will be set:

Grade2 atom labels for alpha_methyl_serine NC(C)(CO)C(O)=O

AIB-like alpha amino acid with N-modification

This pattern matches alpha amino acids with two C beta atoms and a nitrogen modified by a carbon atom. The SMARTS used is:

[$([NX3])]([#6])[CX4]([#6])([#6])[CX3](=[OX1])[OX2H,OX1-]

Atom IDs N CN CA CB1 CB2 C O OXT will be set. For an example, given the SMILES input CNC(C)(CO)C(O)=O the following atom IDs will be set:

Grade2 atom labels for n_methyl_alpha_methyl_serine CNC(C)(CO)C(O)=O

GLY-like alpha amino acid with an amino group

This pattern matches alpha amino acids that are similar to glycine in that no beta carbon atom is present and that the amino nitrogen atom is either a neutral NH2 or a NH3+. The SMARTS used is:

[$([NX3H2,NX4H3+])][CX4][CX3](=[OX1])[OX2H,OX1-]

Atom IDs N CA C O OXT will be set. For an example, given the SMILES input F[C@@H](C(=O)O)N the following atom IDs will be set:

Grade2 atom labels for fluoroglycine F[C@@H](C(=O)O)N

GLY-like alpha amino acid with N-modification

This pattern matches alpha amino acids that are similar to glycine but have a N-modification involving a carbon atom. The SMARTS used is:

$([NX3])]([#6])[CX4][CX3](=[OX1])[OX2H,OX1-]

Atom IDs N CN CA C O OXT will be set. For an example, given the SMILES input F[C@@H](C(=O)O)NC the following atom IDs will be set:

Grade2 atom labels for n_methyl_fluoroglycine F[C@@H](C(=O)O)NC

beta amino acid

This pattern matches beta amino acids. Please note that, unlike the previous patterns, the matching is promiscuous allowing matches with N-modification and modification at both the CA and CB atoms.

The SMARTS used is:

[$([NX3])][#6][#6][CX3](=[OX1])[OX2H,OX1-]

Atom IDs N CB CA C O OXT will be set. Please note that for PDB chemical components there is no standard atom name for the extra main chain carbon atom, but CB is used in both beta-alanine https://www.rcsb.org/ligand/BAL and 62H https://www.rcsb.org/ligand/62H . For an example, given the SMILES input FCC(CN)C(=O)O the following atom IDs will be set:

Grade2 atom labels for beta_fluoromethylalanine FCC(CN)C(=O)O


Grade2 options to set atom IDs

The --antecedent_disregard_element option

The --antecedent_disregard_element option (that can be shortened to -ad) is similar to --antecedent except that atoms are not required to have the same element to match. Where possible atom IDs are altered so that the non-element part of matching atoms is maintained. So for example, if atom CL24 is matched to a fluorine atom it will be given the atom ID F24 (provided there is not an another atom with that label).

Taking for example the cyclin-dependent kinase inhibitors SC8 and SC9, running grade2 for each in turn:

$ grade2 --PDB_ligand SC8
...
$ grade2 --PDB_ligand SC9
...

As can been seen below the PDB components definitions of the two inhibitors SC8 and SC9, have consistent atom numbers for the central pyrazolopyrimidine ring but the halogenophenyl and pyridine rings have distinct numbering and atom IDs.

comparing the atom names of PDB components SC8 and SC9

Rerunning Grade2 for SC9 with the --antecedent_disregard_element option:

$ grade2 --PDB_ligand SC9 -ad SC8.restraints.cif -o SC9_ad_SC8
...

overrides the input atom IDs and instead sets atom IDs by matching atoms from SC8:

atom names from grade2 --PDB_ligand SC9 -ad SC8.restraints.cif

It can be seen that all atoms are matched to equivalents SC8, including both the halogenophenyl and pyridine rings.

The --antecedent_disregard_element option is useful to set consistent IDs and produce aligned 2D diagrams for series of related inhibitors.

Basing atom IDs on the RDKit canonical SMILES string with --rdkit_canonical_atom_ids

The default procedure for setting atom IDs used by Grade2 described above, uses the atom order of the input molecule. This means that it is common for two restraint dictionaries a single compound to have completely different atom naming because the atom orders of the input descriptions to be different. To avoid this problem the --rdkit_canonical_atom_ids option (short option -R) can be used. This uses atom order in the RDKit canonical SMILES string as a basis for the atom IDs. As the RDKit canonical SMILES is independent of the input atom order this will produce the same atom IDs for a single compound whatever the source.

For example, using three different SMILES strings describing ephedrine grade2 -R will produce the same atom IDs:

atom IDs for ephedrine from -R

Hydrogen atom IDs are based on the list number of the non-hydrogen atom to which they are attached, as described above.

Please note that --rdkit_canonical_atom_ids wipes any existing atom IDs and that atoms are reordered by the option.

Basing atom IDs on the InChI canonical atom order with --inchi_canonical_atom_ids

Using the canonical RDKit canonical SMILES atom order to produce consistent atom IDs for a single molecule, with the --rdkit_canonical_atom_ids option option, works well. But one problem is that canonical SMILES strings produced by different programs are not consistent and so the atom IDs are not universal. Dashti et al. (2017) introduced the idea of the canonical atom order found as part of calculating the International Chemical Identifier (InChI) of a molecule to produce ALATIS unique identifiers. The --inchi_canonical_atom_ids option uses this idea and produces atom IDs that from the InChI canonical atom order. For non-hydrogen atoms the --inchi_canonical_atom_ids numerical part of the atom ID is the same as ALATIS ID.

Once again using as an example three different SMILES strings describing ephedrine grade2 --inchi_canonical_atom_ids produces:

atom IDs for ephedrine from --inchi_canonical_atom_ids

As expected consistent atom IDs are produced by --inchi_canonical_atom_ids regardless of the atom order in the input SMILES string. But the adjacent atom IDs are far apart in a molecule, for instance atom C1 is bonded to atom C8 and not adjacent to atom C2. This makes the IDs less "user-friendly" but more universal than --rdkit_canonical_atom_ids (that for me are more intuitive).