Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[QUESTION]: generating MoleculeDataset with SMILES and additional features (NOT CSV INPUT)

See original GitHub issue

What are you trying to do? I have a GCNN model built using chemprop that takes SMILES and an additional (binary) feature as inputs. When predicting test compounds, I am not loading data from a CSV file (in this case I could simply provide path to the file containing additional features for the test compounds as an argument to chemprop_predict method). But I have the SMILES available in a list object in Python. So, I went ahead and tried to generate a MoleculeDataset object using the utility method get_data_from_smiles available in chemprop/data/utils.py. But I see that the method takes a list of feature generators as argument but not the additional features. Therefore, I modified the method as follows:

Previous attempts

def get_data_from_smiles_with_additional_features(smiles: List[str],
                         skip_invalid_smiles: bool = True,
                         logger: Logger = None,
                         features: List = None,
                         features_generator: List[str] = None) -> MoleculeDataset:
    """
    Converts SMILES to a MoleculeDataset.

    :param smiles: A list of SMILES strings.
    :param skip_invalid_smiles: Whether to skip and filter out invalid smiles.
    :param logger: Logger.
    :param features: List of additional features
    :param features_generator: List of features generators.
    :return: A MoleculeDataset with all of the provided SMILES.
    """
    debug = logger.debug if logger is not None else print

    data = MoleculeDataset([
        MoleculeDatapoint(
            smiles=smile,
            features=feature,
            row=OrderedDict({'smiles': smile}),
            features_generator=features_generator
        ) for smile, feature in zip(smiles, features)
    ])

    # Filter out invalid SMILES
    if skip_invalid_smiles:
        original_data_len = len(data)
        data = filter_invalid_smiles(data)

        if len(data) < original_data_len:
            debug(f'Warning: {original_data_len - len(data)} SMILES are invalid.')

    return data

Question The predictions I obtain using this method match the predictions from command line when using an additional file with the same features. However, I have only used a single feature (i.e., one column in a file). Is there a better way to deal with this when I have multiple additional features?

Thank you, Vishal

Issue Analytics

State:
Created a year ago
Comments:5 (3 by maintainers)

Top GitHub Comments

2reactions

shihchenglicommented, Sep 22, 2022

import torch
from chemprop.args import TrainArgs
path = '/path/to/checkpoint'
state = torch.load(path, map_location=lambda storage, loc: storage)
args = TrainArgs()
args.from_dict(vars(state["args"]), skip_unsettable=True)
args.features_size  # expected shape of the additional features

@iwwwish You can use the above code to get the size of molecular features. I will close this issue now. If you have other questions about Chemprop, feel free to open a new issue.

0reactions

iwwwishcommented, Sep 22, 2022

That worked! Thank you.

Is there a way to obtain information from the model file itself whether it takes additional features as input and if yes, the expected shape of the additional features?

Best, Vishal

Top Results From Across the Web

How to add a custom data? · Issue #27 - GitHub

To add custom datasets, you may follow the implementation of existing datasets, e.g. ClinTox. To use column data in a table file for...

Seq2seq RNN models with SMILES in Keras - Cheminformania

Blogpost that illustrates how to implement a seq2seq model with teacher enforcing for modeling chemical properties from SMILES.

Find and play with 'molecule' datasets - Towards Data Science

Since I am a data scientist and do not have a strong chemistry background, I would not go in-depth on how SMILES works....

Inductive transfer learning for molecular activity prediction

The results showed the method can achieve strong performances for all four datasets compared to other state-of-the-art machine learning modeling ...

Mol2vec documentation — mol2vec 0.1 documentation

Featurizes new samples using pre-trained Mol2vec model. It saves the result in CSV file with columns for molecule identifiers, canonical SMILES (generated ......