Feature request: Retrieve mapping betwewen SMILES and SELFIES tokens
See original GitHub issueIs it possible to get a map which SMILES tokens were used to generate which SELFIES tokens (or v.v.)?
I am looking for a feature like this:
>>>smiles = 'CCO'
>>>encoder(smiles, get_mapping=True)
([C][C][O], [0,1,2])
In this simple example [0,1,2]
would imply that the first SMILES token (C
) is mapped to the first selfies token ([C]
) and so on.
Motivation: I think this feature could be very useful to close the gap between RDKit and SELFIES. One example are scaffolds. Say we have a molecule, want to retrieve its scaffold and decorate it with a generative model. With SMILES it’s easy (see example below) but with SELFIES it’s not possible (as far as I understand).
My questions:
- Is it, in principle, possible to obtain such a mapping?
- If yes, is there already a way to obtain it with the current package?
- If no, is this a feature that seems worth implementing?
Discussion:
Such a mapping would imply a standardized way of splitting the strings into tokens. Fortunately, we have split_selfies
already, but regarding SMILES, I think that the tokenizer from the Found in Translation paper!) could be a good choice since it’s used widely. (I’m using that tokenizer in the example below.)
==== EXAMPLE === This is just the appendix to the post. It’s an example for how to retrieve which SMILES tokens constitute the scaffold of a given molecule. As it appears to me, this is currently not possible with SELFIES.
First, some boring setup:
from rdkit import Chem
from selfies import encoder, decoder, split_selfies
from rdkit.Chem.Scaffolds.MurckoScaffold import GetScaffoldForMol
from pytoda.smiles.processing import tokenize_smiles
import re
# Setup tokenizer
NON_ATOM_CHARS = set(list(map(str, range(1, 10))) + ['/', '\\', '(', ')', '#', '=', '.', ':', '-'])
regexp = re.compile(
r'(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#|'
r'-|\+|\\\\|\/|:|~|@|\?|>|\*|\$|\%[0-9]{2}|[0-9])'
)
smiles_tokenizer = lambda smi: [token for token in regexp.split(smi) if token]
Example molecule (left) and RDKit-extracted scaffold (right):
smiles = 'CCOc1[nH]c(N=Cc2ccco2)c(C#N)c1C#N'
mol = Chem.MolFromSmiles(smiles)
atom_symbols = [atom.GetSymbol() for atom in mol.GetAtoms()]
scaffold = GetScaffoldForMol(mol)
# List of ints pointing to scaffold atoms as they occur in SMILES
scaffold_atoms = mol.GetSubstructMatches(scaffold)[0]
smiles_tokens = smiles_tokenizer(smiles)
atom_id = -1
for token in smiles_tokens:
if token not in NON_ATOM_CHARS:
# Found atom
atom_id += 1
if atom_id in scaffold_atoms:
print(token, '--> on scaffold')
else:
print(token, '--> not on scaffold')
else:
# Non-Atom-Chars
if (atom_id in scaffold_atoms and atom_id+1 in scaffold_atoms) or atom_id==scaffold_atoms[-1]:
print(token, '--> on scaffold')
else:
print(token, '--> not on scaffold')
Output will be:
C --> not on scaffold
C --> not on scaffold
O --> not on scaffold
c --> on scaffold
1 --> on scaffold
[nH] --> on scaffold
c --> on scaffold
( --> on scaffold
N --> on scaffold
= --> on scaffold
C --> on scaffold
c --> on scaffold
2 --> on scaffold
c --> on scaffold
c --> on scaffold
c --> on scaffold
o --> on scaffold
2 --> on scaffold
) --> on scaffold
c --> on scaffold
( --> not on scaffold
C --> not on scaffold
# --> not on scaffold
N --> not on scaffold
) --> not on scaffold
c --> on scaffold
1 --> on scaffold
C --> not on scaffold
# --> not on scaffold
N --> not on scaffold
Trying to achieve the same with SELFIES does not seem to work. This is because selfies.encoder
does not fully preserve the order of the tokens passed. It preserves it to large extents (which is great) but around ring symbols it usually breaks. I feel like I would need to reverse-engineer the context free grammar to solve this.
Here would be the tokens in SMILES and SELFIES respectively:
C [C]
C [C]
O [O]
c [C]
1 [NHexpl]
[nH] [C]
c [Branch1_1]
( [Branch2_3]
N [N]
= [=C]
C [C]
c [=C]
2 [C]
c [=C]
c [O]
c [Ring1]
o [Branch1_1]
2 [=C]
) [Branch1_1]
c [Ring1]
( [C]
C [#N]
# [C]
N [Expl=Ring1]
) [=C]
c [C]
1 [#N]
C
#
N
Issue Analytics
- State:
- Created 2 years ago
- Comments:16 (13 by maintainers)
Top GitHub Comments
Sorry that it took so long to get this into the main repo, but it is in now finally. Thanks a lot @whitead , and welcome to the developer team 😃
We implemented something like this based on the amazing SELFIES 2.0 code for atom mappings for leruli:
You can try it interactively by searching any molecule and hitting the “explain SELFIES” button on the result page
We’d be happy to contribute that code, but the points @alstonlo made are very valid. So far what our code can do is identify which atom and bond is created by which SELFIES token. For SMILES, that would at least allow a one-to-one mapping of heavy atoms between SELFIES and SMILES tokens, except for the bond orders.
How about this code structure: selfies.decoder() gets an optional argument, “atom_mapping”, default False that, if set returns not only the existing SMILES, but also a dictionary with keys being the SELFIES token index and values being the SMILES atom index. That would allow to trace both directions. If that is in line with your thoughts on the API of the package, I’m happy to prepare a pull request.