Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature request: Retrieve mapping betwewen SMILES and SELFIES tokens

See original GitHub issue

Is it possible to get a map which SMILES tokens were used to generate which SELFIES tokens (or v.v.)?

I am looking for a feature like this:

>>>smiles = 'CCO'
>>>encoder(smiles, get_mapping=True)
([C][C][O], [0,1,2])

In this simple example [0,1,2] would imply that the first SMILES token (C) is mapped to the first selfies token ([C]) and so on.

Motivation: I think this feature could be very useful to close the gap between RDKit and SELFIES. One example are scaffolds. Say we have a molecule, want to retrieve its scaffold and decorate it with a generative model. With SMILES it’s easy (see example below) but with SELFIES it’s not possible (as far as I understand).

My questions:

Is it, in principle, possible to obtain such a mapping?
If yes, is there already a way to obtain it with the current package?
If no, is this a feature that seems worth implementing?

Discussion: Such a mapping would imply a standardized way of splitting the strings into tokens. Fortunately, we have split_selfies already, but regarding SMILES, I think that the tokenizer from the Found in Translation paper!) could be a good choice since it’s used widely. (I’m using that tokenizer in the example below.)

==== EXAMPLE === This is just the appendix to the post. It’s an example for how to retrieve which SMILES tokens constitute the scaffold of a given molecule. As it appears to me, this is currently not possible with SELFIES.

First, some boring setup:

from rdkit import Chem
from selfies import encoder, decoder, split_selfies
from rdkit.Chem.Scaffolds.MurckoScaffold import GetScaffoldForMol
from pytoda.smiles.processing import tokenize_smiles
import re 

# Setup tokenizer
NON_ATOM_CHARS = set(list(map(str, range(1, 10))) + ['/', '\\', '(', ')', '#', '=', '.', ':', '-'])
regexp = re.compile(
    r'(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#|'
    r'-|\+|\\\\|\/|:|~|@|\?|>|\*|\$|\%[0-9]{2}|[0-9])'
)
smiles_tokenizer = lambda smi: [token for token in regexp.split(smi) if token]

Example molecule (left) and RDKit-extracted scaffold (right): Screenshot 2021-05-06 at 11 59 07

smiles = 'CCOc1[nH]c(N=Cc2ccco2)c(C#N)c1C#N'
mol = Chem.MolFromSmiles(smiles)
atom_symbols = [atom.GetSymbol() for atom in mol.GetAtoms()]
scaffold = GetScaffoldForMol(mol)
# List of ints pointing to scaffold atoms as they occur in SMILES
scaffold_atoms = mol.GetSubstructMatches(scaffold)[0]
smiles_tokens =  smiles_tokenizer(smiles)


atom_id = -1
for token in smiles_tokens:
    if token not in NON_ATOM_CHARS:
        # Found atom
        atom_id += 1
        if atom_id in scaffold_atoms:
            print(token, '--> on scaffold')
        else:
            print(token, '--> not on scaffold')
    else:
        # Non-Atom-Chars
        if (atom_id in scaffold_atoms and atom_id+1 in scaffold_atoms) or atom_id==scaffold_atoms[-1]:
            print(token, '--> on scaffold')
        else:
            print(token, '--> not on scaffold')

Output will be:

C --> not on scaffold
C --> not on scaffold
O --> not on scaffold
c --> on scaffold
1 --> on scaffold
[nH] --> on scaffold
c --> on scaffold
( --> on scaffold
N --> on scaffold
= --> on scaffold
C --> on scaffold
c --> on scaffold
2 --> on scaffold
c --> on scaffold
c --> on scaffold
c --> on scaffold
o --> on scaffold
2 --> on scaffold
) --> on scaffold
c --> on scaffold
( --> not on scaffold
C --> not on scaffold
# --> not on scaffold
N --> not on scaffold
) --> not on scaffold
c --> on scaffold
1 --> on scaffold
C --> not on scaffold
# --> not on scaffold
N --> not on scaffold

Trying to achieve the same with SELFIES does not seem to work. This is because selfies.encoder does not fully preserve the order of the tokens passed. It preserves it to large extents (which is great) but around ring symbols it usually breaks. I feel like I would need to reverse-engineer the context free grammar to solve this. Here would be the tokens in SMILES and SELFIES respectively:

C [C]
C [C]
O [O]
c [C]
1 [NHexpl]
[nH] [C]
c [Branch1_1]
( [Branch2_3]
N [N]
= [=C]
C [C]
c [=C]
2 [C]
c [=C]
c [O]
c [Ring1]
o [Branch1_1]
2 [=C]
) [Branch1_1]
c [Ring1]
( [C]
C [#N]
# [C]
N [Expl=Ring1]
) [=C]
c [C]
1 [#N]
C
#
N

Issue Analytics

State:
Created 2 years ago
Comments:16 (13 by maintainers)

Top GitHub Comments

1reaction

MarioKrenn6240commented, Feb 16, 2022

Sorry that it took so long to get this into the main repo, but it is in now finally. Thanks a lot @whitead , and welcome to the developer team 😃

1reaction

ferchaultcommented, Nov 16, 2021

We implemented something like this based on the amazing SELFIES 2.0 code for atom mappings for leruli:

You can try it interactively by searching any molecule and hitting the “explain SELFIES” button on the result page

We’d be happy to contribute that code, but the points @alstonlo made are very valid. So far what our code can do is identify which atom and bond is created by which SELFIES token. For SMILES, that would at least allow a one-to-one mapping of heavy atoms between SELFIES and SMILES tokens, except for the bond orders.

How about this code structure: selfies.decoder() gets an optional argument, “atom_mapping”, default False that, if set returns not only the existing SMILES, but also a dictionary with keys being the SELFIES token index and values being the SMILES atom index. That would allow to trace both directions. If that is in line with your thoughts on the API of the package, I’m happy to prepare a pull request.

Top Results From Across the Web

SMILES, DeepSMILES and SELFIES represented as tokens ...

In quantitative evaluations, our model can correctly reconstruct 97% of the molecular images into structured format and achieve an F1-score around 97-98% in...

aspuru-guzik-group/selfies - GitHub

A main objective is to use SELFIES as direct input into machine learning models, in particular in generative models, for the generation of...

SELFIES and the future of molecular string representations

They all aim to map a string of tokens to a molecular graph, as illustrated in Figure 4. Smiles is a surjective representation...

SELFIES and the future of molecular string ... - Cell Press

General mappings. SELFIES, SMILES, INCHI, and DEEPSMILES are representations of a molecular graph. They all aim to map a string of tokens to ......

Molecular graph representations and SELFIES

One of our motivations is to get direct feedback from the community, ... The features of SELFIES: It is as powerful as SMILES...