Understanding multiclass on Chemprop
See original GitHub issueI have a CSV file that looks like the example below:
CAN_SMILES | class1 | class2 | class 3 |
---|---|---|---|
SMILES1 | 0 | 0 | 1 |
SMILES2 | 0 | 0 | 1 |
SMILES3 | 0 | 0 | 1 |
I then trained the NN with:
chemprop_train --data_path data.csv --dataset_type multiclass \
--save_dir 00_all_datapoints_dedup_random_split.train.accuracy \
--metric accuracy --split_type random --split_sizes 0.8 0.2 0.0 \
--gpu 1 --dropout 0 --ensemble_size 1 --num_folds 1 --hidden_size 300 \
--ffn_hidden_size 300 --smiles_column CAN_SMILES \
--target_columns class1 class2 class3 \
--multiclass_num_classes 3
What I was expecting from the predictions was something like [P(class1), P(class2), P(class3]
, and summing up those three probabilities equals 1
. However, I am getting the following:
CAN_SMILES | class1 | class2 | class 3 | class1_class_0 | class1_class_1 | class1_class_2 |
---|---|---|---|---|---|---|
SMILES1 | 0 | 0 | 1 | [0.4089986979961395, 0.3245898485183716, 0.2664114236831665] | [0.4382680356502533, 0.18726319074630737, 0.3744688332080841] | [0.37828463315963745, 0.417496919631958, 0.20421843230724335] |
SMILES2 | 0 | 0 | 1 | [0.4023689031600952, 0.3176715075969696, 0.27995961904525757] | [0.42649418115615845, 0.20301619172096252, 0.3704896569252014] | [0.365766704082489, 0.4299478232860565, 0.20428545773029327] |
SMILES3 | 0 | 0 | 1 | [0.4001476764678955, 0.3139444887638092, 0.28590789437294006] | [0.4191807508468628, 0.2087540626525879, 0.3720651865005493] | [0.3638008236885071, 0.41967928409576416, 0.21651984751224518] |
I am completely lost with these results. As I mentioned above, I was expecting a single output vector with the probabilities for each class. Also, I don’t understand the labeling class1_class_0
– what is it referring to? I inspected the code and found;
But it is not clear to me yet. I hope someone could help me to understand this. As you see, I have 3 columns with three different classes, and I want Chemprop to predict the probability of each one in a vector [P(class1), P(class2), P(class3)]
.
Note: I also tried saving to the CSV file the SMILES, [class1, class2, class3]
but that seems to not being parsed by Chemprop.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (5 by maintainers)
Top GitHub Comments
Dear @muammar, this looks great. Happy I could help!
Hey @muammar, if the classes are mutually exclusive (what they seem to be), you need to reformat your CSV into something like:
so, class1 = 0, class2 = 1 and class3 = 2