Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unintuitive behaviour with cross_validate.random_train_test_split

See original GitHub issue

The random_train_test_split makes is easy to split the interactions matrix into train and test dataset but if you have data with weights you will have to apply random_train_test_split twice with the same random_state parameter. My concern is that it would be intuitive to do something like:

from lightfm.data import Dataset
from lightfm.cross_validation import random_train_test_split

users = np.random.choice([0., 1., 2.], (10, 1))
items = np.random.choice([0., 1., 2.], (10, 1))
weight = np.random.rand(10,1)
data = np.concatenate((users, items, weight), axis=1)

dataset = Dataset()
dataset.fit(users=np.unique(data[:, 0]), items=np.unique(data[:, 1]))
interactions, weight = dataset.build_interactions((i[0], i[1], i[2]) for i in data)

test_percentage = 0.2
random_state = np.random.RandomState(seed=1)

train, test = random_train_test_split(
    interactions=interactions,
    test_percentage=test_percentage,
    random_state=random_state
)
train_weight, test_weight = random_train_test_split(
    interactions=weight,
    test_percentage=test_percentage,
    random_state=random_state
)

np.array_equal(train.row, train_weight.row)
np.array_equal(train.col, train_weight.col)
np.array_equal(test.row, test_weight.row)
np.array_equal(test.col, test_weight.col)

>>> False
>>> False
>>> False
>>> False

This will result in an incorrect split because the state of the random_state changes after the first call to random_state.shuffle. For the above example to work as intended you need to make separate but identical RandomStates:

random_state_interaction = np.random.RandomState(seed=1)
random_state_weight = np.random.RandomState(seed=1)

train, test = random_train_test_split(
    interactions=interactions,
    test_percentage=test_percentage,
    random_state=random_state_interaction
)
train_weight, test_weight = random_train_test_split(
    interactions=weight,
    test_percentage=test_percentage,
    random_state=random_state_weight
)

np.array_equal(train.row, train_weight.row)
np.array_equal(train.col, train_weight.col)
np.array_equal(test.row, test_weight.row)
np.array_equal(test.col, test_weight.col)

>>> True
>>> True
>>> True
>>> True

It works but I think it’s a little awkward. Two possible solutions/suggestions:

Only require a seed parameter and create a RandomState in cross_validate._shuffle method. This has the added benefit of fitting in with the larger libraries that only require seed and not a RandomState generator. I also don’t see any additional flexibility by passing in a generator instead of a simple integer.
Make a copy of random_state before applying shuffle in cross_validate._shuffle.

Thoughts?

Issue Analytics

State:
Created 5 years ago
Comments:15 (1 by maintainers)

Top GitHub Comments

1reaction

igorkfcommented, Nov 18, 2020

After splitting in train and test, how can I know which users are in train or test? I splitted in train/test, now I want to evaluate the model with my own metric (NDCG), but only into the users that are in the test matrix.
How can I pick the users from the test matrix?
After this I would like to map this users to my original data.

1reaction

maciejkulacommented, Jul 14, 2018

Interactions for the WARP and BPR losses are binary only. The values have no effect.
You cannot swap them.

In the implementation, the presence of absence of an entry in the interaction matrix determines whether a gradient descent step is taken to updated the model to encode a preference. The weight determines the magnitude of that step.