question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[QUESTION]: generating MoleculeDataset with SMILES and additional features (NOT CSV INPUT)

See original GitHub issue

What are you trying to do? I have a GCNN model built using chemprop that takes SMILES and an additional (binary) feature as inputs. When predicting test compounds, I am not loading data from a CSV file (in this case I could simply provide path to the file containing additional features for the test compounds as an argument to chemprop_predict method). But I have the SMILES available in a list object in Python. So, I went ahead and tried to generate a MoleculeDataset object using the utility method get_data_from_smiles available in chemprop/data/utils.py. But I see that the method takes a list of feature generators as argument but not the additional features. Therefore, I modified the method as follows:

Previous attempts

def get_data_from_smiles_with_additional_features(smiles: List[str],
                         skip_invalid_smiles: bool = True,
                         logger: Logger = None,
                         features: List = None,
                         features_generator: List[str] = None) -> MoleculeDataset:
    """
    Converts SMILES to a MoleculeDataset.

    :param smiles: A list of SMILES strings.
    :param skip_invalid_smiles: Whether to skip and filter out invalid smiles.
    :param logger: Logger.
    :param features: List of additional features
    :param features_generator: List of features generators.
    :return: A MoleculeDataset with all of the provided SMILES.
    """
    debug = logger.debug if logger is not None else print

    data = MoleculeDataset([
        MoleculeDatapoint(
            smiles=smile,
            features=feature,
            row=OrderedDict({'smiles': smile}),
            features_generator=features_generator
        ) for smile, feature in zip(smiles, features)
    ])

    # Filter out invalid SMILES
    if skip_invalid_smiles:
        original_data_len = len(data)
        data = filter_invalid_smiles(data)

        if len(data) < original_data_len:
            debug(f'Warning: {original_data_len - len(data)} SMILES are invalid.')

    return data

Question The predictions I obtain using this method match the predictions from command line when using an additional file with the same features. However, I have only used a single feature (i.e., one column in a file). Is there a better way to deal with this when I have multiple additional features?

Thank you, Vishal

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
shihchenglicommented, Sep 22, 2022
import torch
from chemprop.args import TrainArgs
path = '/path/to/checkpoint'
state = torch.load(path, map_location=lambda storage, loc: storage)
args = TrainArgs()
args.from_dict(vars(state["args"]), skip_unsettable=True)
args.features_size  # expected shape of the additional features

@iwwwish You can use the above code to get the size of molecular features. I will close this issue now. If you have other questions about Chemprop, feel free to open a new issue.

0reactions
iwwwishcommented, Sep 22, 2022

That worked! Thank you.

Is there a way to obtain information from the model file itself whether it takes additional features as input and if yes, the expected shape of the additional features?

Best, Vishal

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to add a custom data? · Issue #27 - GitHub
To add custom datasets, you may follow the implementation of existing datasets, e.g. ClinTox. To use column data in a table file for...
Read more >
Seq2seq RNN models with SMILES in Keras - Cheminformania
Blogpost that illustrates how to implement a seq2seq model with teacher enforcing for modeling chemical properties from SMILES.
Read more >
Find and play with 'molecule' datasets - Towards Data Science
Since I am a data scientist and do not have a strong chemistry background, I would not go in-depth on how SMILES works....
Read more >
Inductive transfer learning for molecular activity prediction
The results showed the method can achieve strong performances for all four datasets compared to other state-of-the-art machine learning modeling ...
Read more >
Mol2vec documentation — mol2vec 0.1 documentation
Featurizes new samples using pre-trained Mol2vec model. It saves the result in CSV file with columns for molecule identifiers, canonical SMILES (generated ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found