Unintuitive behaviour with cross_validate.random_train_test_split
See original GitHub issueThe random_train_test_split
makes is easy to split the interactions
matrix into train and test dataset but if you have data with weights you will have to apply random_train_test_split
twice with the same random_state
parameter. My concern is that it would be intuitive to do something like:
from lightfm.data import Dataset
from lightfm.cross_validation import random_train_test_split
users = np.random.choice([0., 1., 2.], (10, 1))
items = np.random.choice([0., 1., 2.], (10, 1))
weight = np.random.rand(10,1)
data = np.concatenate((users, items, weight), axis=1)
dataset = Dataset()
dataset.fit(users=np.unique(data[:, 0]), items=np.unique(data[:, 1]))
interactions, weight = dataset.build_interactions((i[0], i[1], i[2]) for i in data)
test_percentage = 0.2
random_state = np.random.RandomState(seed=1)
train, test = random_train_test_split(
interactions=interactions,
test_percentage=test_percentage,
random_state=random_state
)
train_weight, test_weight = random_train_test_split(
interactions=weight,
test_percentage=test_percentage,
random_state=random_state
)
np.array_equal(train.row, train_weight.row)
np.array_equal(train.col, train_weight.col)
np.array_equal(test.row, test_weight.row)
np.array_equal(test.col, test_weight.col)
>>> False
>>> False
>>> False
>>> False
This will result in an incorrect split because the state of the random_state
changes after the first call to random_state.shuffle
. For the above example to work as intended you need to make separate but identical RandomStates
:
random_state_interaction = np.random.RandomState(seed=1)
random_state_weight = np.random.RandomState(seed=1)
train, test = random_train_test_split(
interactions=interactions,
test_percentage=test_percentage,
random_state=random_state_interaction
)
train_weight, test_weight = random_train_test_split(
interactions=weight,
test_percentage=test_percentage,
random_state=random_state_weight
)
np.array_equal(train.row, train_weight.row)
np.array_equal(train.col, train_weight.col)
np.array_equal(test.row, test_weight.row)
np.array_equal(test.col, test_weight.col)
>>> True
>>> True
>>> True
>>> True
It works but I think it’s a little awkward. Two possible solutions/suggestions:
- Only require a seed parameter and create a RandomState in
cross_validate._shuffle
method. This has the added benefit of fitting in with the larger libraries that only requireseed
and not aRandomState
generator. I also don’t see any additional flexibility by passing in a generator instead of a simple integer. - Make a copy of
random_state
before applying shuffle incross_validate._shuffle
.
Thoughts?
Issue Analytics
- State:
- Created 5 years ago
- Comments:15 (1 by maintainers)
Top Results From Across the Web
No results found
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
After splitting in train and test, how can I know which users are in train or test? I splitted in train/test, now I want to evaluate the model with my own metric (NDCG), but only into the users that are in the test matrix.
How can I pick the users from the test matrix?
After this I would like to map this users to my original data.
In the implementation, the presence of absence of an entry in the interaction matrix determines whether a gradient descent step is taken to updated the model to encode a preference. The weight determines the magnitude of that step.