question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Different latent vectors for same (test)users

See original GitHub issue

Hi,

I created a model for a retailer with 38k customers and 36k articles and a sparsity of 0,55%. One part of the analysis is to find similiar customers, to get the “neighbours” I took the latent features of the users and checked the dot product between them (similiar to the LightFM example). In order to get a better understanding of my model I created some test customers and calculated their similarity. Two of those test customers are (besides a different name) exactly the same. Both bought the same (one) article ones and are in the same industry (a user feature). In my understanding they should have the same vectors and therefore a dot product of 1. But unfortunately the vectors are different and the dot product is -0.057281606.

Does somebody have an explanation how this can happen?

Thanks in advance!

Best, Moritz

Those are the normalized vectors: Testuser 1:

[ 0.13190027  0.26906827 -0.0822762   0.21407925  0.13211617  0.2566751
  0.17467268  0.02340734  0.09154253  0.2269812  -0.26795995  0.06671422
  0.08172801  0.1463228   0.21353354  0.12667963 -0.02653628 -0.0790253
  0.03541145  0.09163333  0.05831769  0.4006284   0.14730851  0.28267866
 -0.05757256  0.1948472  -0.08183019  0.28852767  0.09479482  0.30822176]

Testuser 2:

[-0.15500401  0.2269293   0.21974853 -0.02541173 -0.16325705 -0.13497874
 -0.17643186  0.09408431  0.04687239  0.20745914  0.27600515  0.02616096
 -0.24575633 -0.27663186 -0.10878677  0.27803454  0.08667072  0.06445353
  0.20262392  0.1274841   0.30217353 -0.04354052  0.29860505  0.30625728
  0.0359767  -0.15467772 -0.09467538 -0.12735379 -0.20820434  0.06034918]

customer1 = "T0000001"
customer2 = "T0000004"
num_user = dataset.interactions_shape()[0]
user_x1 = mappings.kundennummer2row[customer1]
user_x2 = mappings.kundennummer2row[customer2]

user_embeddings_norm = (model.user_embeddings[:num_user].T
                  / np.linalg.norm(model.user_embeddings[:num_user], axis=1)).T

similarity = np.dot(user_embeddings_norm[user_x2], user_embeddings_norm[user_x1])

That’s how I built the dataset:

dataset = Dataset()

dataset.fit(items=artikel_meta["Artikelnummer"],
                    users=kunden_meta["Hauptkundennummer"],
                    item_features=artikel_meta["Warengruppe"].unique(),
                    user_features=kunden_meta["Branchenschlüssel"].unique())

(interactions, weights) = dataset.build_interactions([(x['Hauptkundennummer'],
                                                       x['Artikelnummer'],
                                                       x['Kernumsatz']) for index,x in sales_data_2019_grouped.iterrows()])

def prepare_features_format(data, id, feature_columns):
    features = []
    for row in range(data.shape[0]):
        features.append([data[id][row],[str(data[feature][row]) for feature in feature_columns]])
    features = tuple(features)
    return features

item_features = dataset.build_item_features(prepare_features_format(artikel_meta,'Artikelnummer',['Warengruppe']))
user_features = dataset.build_user_features(prepare_features_format(kunden_meta,'Hauptkundennummer',['Branchenschlüssel']))

some translation: “Artikelnummer” = “item number”, “Hauptkundennummer” = “Customer number”, “Warengruppe” = “Product group”, “Branchenschlüssel” = “Industry Code”

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:8

github_iconTop GitHub Comments

1reaction
EthanRosenthalcommented, Jul 10, 2020

Ah, I left out something in my explanation. You must provide the user feature matrix as an argument to get_user_representations(). While you can also use model.user_embeddings, you have to make sure that you add up all of the user’s embeddings prior to calculating similarity with other users. I’m not also not sure if you missed this or not, so I’ll walk through an example below just in case it’s helpful!

Imagine you use the Dataset class to build both your interactions matrix and your user and item features. You set user_identity_features=True, and you have two other user features: device_is_ios and device_is_android, and each of these features can be 1 or 0.

If you build your user feature matrix, it will have shape num_users, num_user_features where num_user_features = num_users + 2. This is because you are building a unique user feature for each user as well as the 2 extra features. This also means that your user_embedding matrix will have shape num_user_features, num_components. That is, you get an embedding for each unique user feature and the extra device_is_* features.

So, when you want to calculate a user’s “representation” in order to calculate similarity, you need to add up both the user’s unique embedding and their device_is_* embedding together.

0reactions
MPADABcommented, Nov 22, 2020

@EthanRosenthal Thank u Ethan!. When i am using items_features, the model has lower precision compare to pure CF. In your comment you mention devise_is_* - Does it must to be one hot format? For example, my items data:

      article_id section_primary      writer_name
0      1.9134852         culture  אפרת רובינשטיין
1      1.9141164         culture       אורון שמיר
2      1.9179619         culture      דייב איצקוף

So, I am building the features as item_features = dataset.build_item_features([(i.article_id,[i.section_primary,i.writer_name]) for i in items.itertuples()]) Thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

How can I learn the latent vectors of a new user? #371 - GitHub
Hello. I have a question regarding your library. Let's suppose that I have already trained a hybrid model using a user-item rating matrix ......
Read more >
Understanding Latent Space in Machine Learning | by Ekin Tiu
We will set our latent space dimensions to be 3 x 1, meaning our compressed data point is a vector with 3-dimensions. Now,...
Read more >
How to Explore the GAN Latent Space When Generating Faces
In the paper, the authors explored the latent space for GANs fit on a number of different training datasets, most notably a dataset...
Read more >
Exploiting dynamic changes from latent features to improve ...
To capture the dynamic changes in each individual user latent vector and item ... The test set on the other hand is not...
Read more >
Nonlinear Latent Factorization by Embedding Multiple User ...
tering learn a latent vector for each user and each item, and ... such vectors, which are of the same dimension. In this...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found