Different latent vectors for same (test)users
See original GitHub issueHi,
I created a model for a retailer with 38k customers and 36k articles and a sparsity of 0,55%. One part of the analysis is to find similiar customers, to get the “neighbours” I took the latent features of the users and checked the dot product between them (similiar to the LightFM example). In order to get a better understanding of my model I created some test customers and calculated their similarity. Two of those test customers are (besides a different name) exactly the same. Both bought the same (one) article ones and are in the same industry (a user feature). In my understanding they should have the same vectors and therefore a dot product of 1. But unfortunately the vectors are different and the dot product is -0.057281606.
Does somebody have an explanation how this can happen?
Thanks in advance!
Best, Moritz
Those are the normalized vectors: Testuser 1:
[ 0.13190027 0.26906827 -0.0822762 0.21407925 0.13211617 0.2566751
0.17467268 0.02340734 0.09154253 0.2269812 -0.26795995 0.06671422
0.08172801 0.1463228 0.21353354 0.12667963 -0.02653628 -0.0790253
0.03541145 0.09163333 0.05831769 0.4006284 0.14730851 0.28267866
-0.05757256 0.1948472 -0.08183019 0.28852767 0.09479482 0.30822176]
Testuser 2:
[-0.15500401 0.2269293 0.21974853 -0.02541173 -0.16325705 -0.13497874
-0.17643186 0.09408431 0.04687239 0.20745914 0.27600515 0.02616096
-0.24575633 -0.27663186 -0.10878677 0.27803454 0.08667072 0.06445353
0.20262392 0.1274841 0.30217353 -0.04354052 0.29860505 0.30625728
0.0359767 -0.15467772 -0.09467538 -0.12735379 -0.20820434 0.06034918]
customer1 = "T0000001"
customer2 = "T0000004"
num_user = dataset.interactions_shape()[0]
user_x1 = mappings.kundennummer2row[customer1]
user_x2 = mappings.kundennummer2row[customer2]
user_embeddings_norm = (model.user_embeddings[:num_user].T
/ np.linalg.norm(model.user_embeddings[:num_user], axis=1)).T
similarity = np.dot(user_embeddings_norm[user_x2], user_embeddings_norm[user_x1])
That’s how I built the dataset:
dataset = Dataset()
dataset.fit(items=artikel_meta["Artikelnummer"],
users=kunden_meta["Hauptkundennummer"],
item_features=artikel_meta["Warengruppe"].unique(),
user_features=kunden_meta["Branchenschlüssel"].unique())
(interactions, weights) = dataset.build_interactions([(x['Hauptkundennummer'],
x['Artikelnummer'],
x['Kernumsatz']) for index,x in sales_data_2019_grouped.iterrows()])
def prepare_features_format(data, id, feature_columns):
features = []
for row in range(data.shape[0]):
features.append([data[id][row],[str(data[feature][row]) for feature in feature_columns]])
features = tuple(features)
return features
item_features = dataset.build_item_features(prepare_features_format(artikel_meta,'Artikelnummer',['Warengruppe']))
user_features = dataset.build_user_features(prepare_features_format(kunden_meta,'Hauptkundennummer',['Branchenschlüssel']))
some translation: “Artikelnummer” = “item number”, “Hauptkundennummer” = “Customer number”, “Warengruppe” = “Product group”, “Branchenschlüssel” = “Industry Code”
Issue Analytics
- State:
- Created 3 years ago
- Comments:8
Top GitHub Comments
Ah, I left out something in my explanation. You must provide the user
feature
matrix as an argument toget_user_representations()
. While you can also usemodel.user_embeddings
, you have to make sure that you add up all of the user’s embeddings prior to calculating similarity with other users. I’m not also not sure if you missed this or not, so I’ll walk through an example below just in case it’s helpful!Imagine you use the Dataset class to build both your interactions matrix and your user and item features. You set
user_identity_features=True
, and you have two other user features:device_is_ios
anddevice_is_android
, and each of these features can be1
or0
.If you build your user feature matrix, it will have shape
num_users, num_user_features
wherenum_user_features = num_users + 2
. This is because you are building a unique user feature for each user as well as the 2 extra features. This also means that youruser_embedding
matrix will have shapenum_user_features, num_components
. That is, you get an embedding for each unique user feature and the extradevice_is_*
features.So, when you want to calculate a user’s “representation” in order to calculate similarity, you need to add up both the user’s unique embedding and their
device_is_*
embedding together.@EthanRosenthal Thank u Ethan!. When i am using items_features, the model has lower precision compare to pure CF. In your comment you mention
devise_is_*
- Does it must to be one hot format? For example, my items data:So, I am building the features as
item_features = dataset.build_item_features([(i.article_id,[i.section_primary,i.writer_name]) for i in items.itertuples()])
Thanks!