Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Different latent vectors for same (test)users

See original GitHub issue

Hi,

I created a model for a retailer with 38k customers and 36k articles and a sparsity of 0,55%. One part of the analysis is to find similiar customers, to get the “neighbours” I took the latent features of the users and checked the dot product between them (similiar to the LightFM example). In order to get a better understanding of my model I created some test customers and calculated their similarity. Two of those test customers are (besides a different name) exactly the same. Both bought the same (one) article ones and are in the same industry (a user feature). In my understanding they should have the same vectors and therefore a dot product of 1. But unfortunately the vectors are different and the dot product is -0.057281606.

Does somebody have an explanation how this can happen?

Thanks in advance!

Best, Moritz

Those are the normalized vectors: Testuser 1:

[ 0.13190027  0.26906827 -0.0822762   0.21407925  0.13211617  0.2566751
  0.17467268  0.02340734  0.09154253  0.2269812  -0.26795995  0.06671422
  0.08172801  0.1463228   0.21353354  0.12667963 -0.02653628 -0.0790253
  0.03541145  0.09163333  0.05831769  0.4006284   0.14730851  0.28267866
 -0.05757256  0.1948472  -0.08183019  0.28852767  0.09479482  0.30822176]

Testuser 2:

[-0.15500401  0.2269293   0.21974853 -0.02541173 -0.16325705 -0.13497874
 -0.17643186  0.09408431  0.04687239  0.20745914  0.27600515  0.02616096
 -0.24575633 -0.27663186 -0.10878677  0.27803454  0.08667072  0.06445353
  0.20262392  0.1274841   0.30217353 -0.04354052  0.29860505  0.30625728
  0.0359767  -0.15467772 -0.09467538 -0.12735379 -0.20820434  0.06034918]

customer1 = "T0000001"
customer2 = "T0000004"
num_user = dataset.interactions_shape()[0]
user_x1 = mappings.kundennummer2row[customer1]
user_x2 = mappings.kundennummer2row[customer2]

user_embeddings_norm = (model.user_embeddings[:num_user].T
                  / np.linalg.norm(model.user_embeddings[:num_user], axis=1)).T

similarity = np.dot(user_embeddings_norm[user_x2], user_embeddings_norm[user_x1])

That’s how I built the dataset:

dataset = Dataset()

dataset.fit(items=artikel_meta["Artikelnummer"],
                    users=kunden_meta["Hauptkundennummer"],
                    item_features=artikel_meta["Warengruppe"].unique(),
                    user_features=kunden_meta["Branchenschlüssel"].unique())

(interactions, weights) = dataset.build_interactions([(x['Hauptkundennummer'],
                                                       x['Artikelnummer'],
                                                       x['Kernumsatz']) for index,x in sales_data_2019_grouped.iterrows()])

def prepare_features_format(data, id, feature_columns):
    features = []
    for row in range(data.shape[0]):
        features.append([data[id][row],[str(data[feature][row]) for feature in feature_columns]])
    features = tuple(features)
    return features

item_features = dataset.build_item_features(prepare_features_format(artikel_meta,'Artikelnummer',['Warengruppe']))
user_features = dataset.build_user_features(prepare_features_format(kunden_meta,'Hauptkundennummer',['Branchenschlüssel']))

some translation: “Artikelnummer” = “item number”, “Hauptkundennummer” = “Customer number”, “Warengruppe” = “Product group”, “Branchenschlüssel” = “Industry Code”

Issue Analytics

State:
Created 3 years ago
Comments:8

Top GitHub Comments

1reaction

EthanRosenthalcommented, Jul 10, 2020

Ah, I left out something in my explanation. You must provide the user feature matrix as an argument to get_user_representations(). While you can also use model.user_embeddings, you have to make sure that you add up all of the user’s embeddings prior to calculating similarity with other users. I’m not also not sure if you missed this or not, so I’ll walk through an example below just in case it’s helpful!

Imagine you use the Dataset class to build both your interactions matrix and your user and item features. You set user_identity_features=True, and you have two other user features: device_is_ios and device_is_android, and each of these features can be 1 or 0.

If you build your user feature matrix, it will have shape num_users, num_user_features where num_user_features = num_users + 2. This is because you are building a unique user feature for each user as well as the 2 extra features. This also means that your user_embedding matrix will have shape num_user_features, num_components. That is, you get an embedding for each unique user feature and the extra device_is_* features.

So, when you want to calculate a user’s “representation” in order to calculate similarity, you need to add up both the user’s unique embedding and their device_is_* embedding together.

0reactions

MPADABcommented, Nov 22, 2020

@EthanRosenthal Thank u Ethan!. When i am using items_features, the model has lower precision compare to pure CF. In your comment you mention devise_is_* - Does it must to be one hot format? For example, my items data:

      article_id section_primary      writer_name
0      1.9134852         culture  אפרת רובינשטיין
1      1.9141164         culture       אורון שמיר
2      1.9179619         culture      דייב איצקוף

So, I am building the features as item_features = dataset.build_item_features([(i.article_id,[i.section_primary,i.writer_name]) for i in items.itertuples()]) Thanks!