Reported performance on TransE differs significantly (correct hyperparameters used)
See original GitHub issueIssue Description
Description
Reported performance on TransE differs with paramters given at https://docs.ampligraph.org/en/latest/experiments.html on FB15K237 dataset
Actual Behavior
`
mr_score(ranks0) 232.22932772286916 mrr_score(ranks0) 0.23103557722066143 hits_at_n_score(ranks0, n=1) 0.10348370682062824 hits_at_n_score(ranks0, n=3) 0.29459829728936293 hits_at_n_score(ranks0, n=10) 0.4654320383599178 `
Expected Behavior
https://docs.ampligraph.org/en/latest/experiments.html - expected results are posted here with the hyper parameters used.
Steps to Reproduce
`import numpy as np
from sklearn.metrics import brier_score_loss, log_loss from ampligraph.datasets import load_fb15k_237 from ampligraph.latent_features.models import TransE from ampligraph.utils import save_model from ampligraph.evaluation import hits_at_n_score, mr_score, evaluate_performance, mrr_score X = load_fb15k_237() model = TransE(batches_count=64, seed=0, epochs=4000, k=400, eta=30, optimizer=‘adam’, optimizer_params={‘lr’:0.0001}, loss=‘multiclass_nll’, regularizer=‘LP’, regularizer_params={‘lambda’: 0.0001, ‘p’: 2}) model.fit(X[‘train’]) save_model(model, model_name_path = ‘transe_seed_0.pkl’) filter = np.concatenate((X[‘train’], X[‘valid’], X[‘test’])) ranks0 = evaluate_performance(X[‘test’], model, filter, verbose=False) mr = mr_score(ranks0) mrr = mrr_score(ranks0) hits_1 = hits_at_n_score(ranks0, n=1) hits_3 = hits_at_n_score(ranks0, n=3) hits_10 = hits_at_n_score(ranks0, n=10)`
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (3 by maintainers)
Got it ! Thanks so much for helping me out with such a detailed reply, and thanks a ton for the code !
It depends on your use case. If x_test is a set of made up hypothesis - which may or may not be facts, then in that case x_filter shouldn’t contain x_test.
But if x_test is made up of known facts, then we must include it in the filter. This is what is commonly done in the KG community, and is the standard evaluation protocol described in Bordes et al.
What you have done above is correct. ‘x_valid’: X[‘valid’][::2]
You can also set it to X[‘valid’], but we didn’t see much increase/decrease in performance. Each early stopping test takes a lot of time, so we reduce the validation set size just for speed.
We include X[‘test’] in filter for the standard datasets as X[‘test’] triples are known facts.